Back to Case Studies
Case Study

Making arrayJoin query in ClickHouse 66% faster

Optimized by Functio

November 24, 2025
4 min read

Overview

  • Project: ClickHouse
  • Focus Area: arrayJoin heavy queries
  • File Path: ClickHouse/src/Common/SpaceSaving.h

ClickHouse is one of the fastest OLAP databases. More than 2000 enterprise customers adopted it as a DBMS. It is already highly optimized, but in this case study, Functio will further optimize it.

Introduction

Functio is an AI tool that helps pinpoint, analyze, and resolve performance bottlenecks in code.

In this case, Functio was running on ClickHouse, focusing on the query that heavily uses arrayJoin. This case study walks through:

  1. The query being optimized
  2. What Functio was found in the performance analysis
  3. The code changes and optimizations Functio applied
  4. Performance results before and after

Baseline Functionality

  1. We provided Functio with a following query to profile, and improve the performance of
WITH 50000000 AS N SELECT arrayJoin(topK(500)(intHash64(number) % 1000000)) FROM system.numbers_mt WHERE number < N;

The user needs to specify the input query, and Functio automatically handles all profiling, code changes, and benchmarking. At the end of analysis, it outputs the best-performing code patch that fully preserves the functionality.

Performance Analysis & Optimizations Applied

Through profiling Functio spotted that void percolate(Counter * counter) function in ClickHouse/src/Common/SpaceSaving.h is the performance bottleneck for the query.

Also, Functio spotted from the compiler report that the compiler did not vectorize the loop while (current_slot > 0){}, and there was a big overhead from swaps and redundant operations.

Original code:

void percolate(Counter * counter)
    {
        while (counter->slot > 0)
        {
            auto & next = counter_list[counter->slot - 1];
            if (*counter > *next)
            {
                std::swap(next->slot, counter->slot);
                std::swap(counter_list[next->slot], counter_list[counter->slot]);
            }
            else
                break;
        }
    }

Functio refactored to minimize pointer indirection and improve cache efficiency. The loop has an inherent data dependency (bubble-up pattern) that prevents true vectorization, but we've streamlined the scalar execution by:

  1. Caching raw pointers to avoid repeated unique_ptr::get() calls
   // OPTIMIZATION: Get raw pointer once to minimize indirection overhead
 Counter * prev_counter = counter_list[prev_slot].get();
  1. Hoisting the comparison operator call result
// OPTIMIZATION: Inline comparison logic to eliminate function call overhead
// and enable better instruction scheduling by the compiler
  const bool should_swap = (counter_count > prev_count) || 
(counter_count == prev_count && counter_error < prev_error);
  1. Reducing redundant member accesses
// OPTIMIZATION: Load counter values once per iteration to avoid redundant memory accesses
           const UInt64 counter_count = counter->count;
           const UInt64 counter_error = counter->error;
           const UInt64 prev_count = prev_counter->count;
           const UInt64 prev_error = prev_counter->error;
  1. Using direct pointer swaps instead of std::swap on unique_ptrs
// OPTIMIZATION: Direct swap in vector - compiler can optimize this well
std::swap(counter_list[current_slot], counter_list[prev_slot]);
  1. Added an early-exit when the counter is already at slot 0, avoiding entering the loop unnecessarily.
// OPTIMIZATION: Early exit for common case - counter already at top
       if (current_slot == 0)
           return;

Correctness & Validation

All ClickHouse unit tests were passed.

Results

ORIGINAL:

localhost:9000, queries: 1, QPS: 0.160, RPS: 7979601.899, MiB/s: 60.880, result RPS: 0.000, result MiB/s: 0.000.

6.157 sec.

OPTIMIZED:

localhost:9000, queries: 1, QPS: 0.265, RPS: 13258014.240, MiB/s: 101.151, result RPS: 0.000, result MiB/s: 0.000.

3.667 sec.

66.1% increase in RPS, 40.4% decrease in latency

Conclusion

By implementing a few simple optimizations, Functio sped up the joinArray-heavy query in ClickHouse by an incredible 66%.

More Case Studies

Explore additional performance wins delivered by Functio.