Making arrayJoin query in ClickHouse 66% faster
Optimized by Functio
Overview
- Project: ClickHouse
- Focus Area: arrayJoin heavy queries
- File Path:
ClickHouse/src/Common/SpaceSaving.h
ClickHouse is one of the fastest OLAP databases. More than 2000 enterprise customers adopted it as a DBMS. It is already highly optimized, but in this case study, Functio will further optimize it.
Introduction
Functio is an AI tool that helps pinpoint, analyze, and resolve performance bottlenecks in code.
In this case, Functio was running on ClickHouse, focusing on the query that heavily uses arrayJoin. This case study walks through:
- The query being optimized
- What Functio was found in the performance analysis
- The code changes and optimizations Functio applied
- Performance results before and after
Baseline Functionality
- We provided Functio with a following query to profile, and improve the performance of
WITH 50000000 AS N SELECT arrayJoin(topK(500)(intHash64(number) % 1000000)) FROM system.numbers_mt WHERE number < N;
The user needs to specify the input query, and Functio automatically handles all profiling, code changes, and benchmarking. At the end of analysis, it outputs the best-performing code patch that fully preserves the functionality.
Performance Analysis & Optimizations Applied
Through profiling Functio spotted that void percolate(Counter * counter) function in ClickHouse/src/Common/SpaceSaving.h is the performance bottleneck for the query.
Also, Functio spotted from the compiler report that the compiler did not vectorize the loop while (current_slot > 0){}, and there was a big overhead from swaps and redundant operations.
Original code:
void percolate(Counter * counter)
{
while (counter->slot > 0)
{
auto & next = counter_list[counter->slot - 1];
if (*counter > *next)
{
std::swap(next->slot, counter->slot);
std::swap(counter_list[next->slot], counter_list[counter->slot]);
}
else
break;
}
}
Functio refactored to minimize pointer indirection and improve cache efficiency. The loop has an inherent data dependency (bubble-up pattern) that prevents true vectorization, but we've streamlined the scalar execution by:
- Caching raw pointers to avoid repeated unique_ptr::get() calls
// OPTIMIZATION: Get raw pointer once to minimize indirection overhead
Counter * prev_counter = counter_list[prev_slot].get();
- Hoisting the comparison operator call result
// OPTIMIZATION: Inline comparison logic to eliminate function call overhead
// and enable better instruction scheduling by the compiler
const bool should_swap = (counter_count > prev_count) ||
(counter_count == prev_count && counter_error < prev_error);
- Reducing redundant member accesses
// OPTIMIZATION: Load counter values once per iteration to avoid redundant memory accesses
const UInt64 counter_count = counter->count;
const UInt64 counter_error = counter->error;
const UInt64 prev_count = prev_counter->count;
const UInt64 prev_error = prev_counter->error;
- Using direct pointer swaps instead of std::swap on unique_ptrs
// OPTIMIZATION: Direct swap in vector - compiler can optimize this well
std::swap(counter_list[current_slot], counter_list[prev_slot]);
- Added an early-exit when the counter is already at slot 0, avoiding entering the loop unnecessarily.
// OPTIMIZATION: Early exit for common case - counter already at top
if (current_slot == 0)
return;
Correctness & Validation
All ClickHouse unit tests were passed.
Results
ORIGINAL:
6.157 sec.
OPTIMIZED:
3.667 sec.
Conclusion
By implementing a few simple optimizations, Functio sped up the joinArray-heavy query in ClickHouse by an incredible 66%.