Making Boxblur in FFmpeg 2.5x faster
Optimized by Functio
Overview
Project: FFmpeg
Filter: vf_boxblur
Hot Path: middle segment of the 1‑D blur sweep (horizontal/vertical), where the window is fully inside the image
Files optimized:
- C path that calls into the optimized middle kernel when strides are 1 (u8 and u16 variants)
- x86/AVX2 assembly file implementing
boxblur_middle(8‑bit) andboxblur_middle16(16‑bit) - C path follows the same structure and naming conventions as other FFmpeg filters, which allows Functio to learn recurring optimization patterns across similar filter designs.
Introduction
Functio is an AI tool that pinpoints, analyzes, and patches performance bottlenecks. In this case study, Functio targeted FFmpeg’s boxblur middle loop, delivering a carefully optimized AVX2 kernel that reduces inner‑loop overhead and exploits wide data parallelism—without altering boundary handling or user‑facing behavior.
This write‑up walks through:
- What the original code does
- The bottlenecks we found
- The AVX2 design and code changes
- Correctness and reproducibility
- Results
Functionality Description (Original)
At heart, boxblur computes a sliding‑window average over a line. The classic C macro looks like:
static inline void blur depth(type *dst, int dst_step, const type *src,
int src_step, int len, int radius)
{
const int length = radius*2 + 1;
const int inv = ((1<<16) + length/2)/length;
int x, sum = src[radius*src_step];
for (x = 0; x < radius; x++)
sum += src[x*src_step]<<1;
sum = sum*inv + (1<<15);
for (x = 0; x <= radius; x++) {
sum += (src[(radius+x)*src_step] - src[(radius-x)*src_step])*inv;
dst[x*dst_step] = sum>>16;
}
for (; x < len-radius; x++) {
sum += (src[(radius+x)*src_step] - src[(x-radius-1)*src_step])*inv;
dst[x*dst_step] = sum >>16;
}
for (; x < len; x++) {
sum += (src[(2*len-radius-x-1)*src_step] -
src[(x-radius-1)*src_step])*inv;
dst[x*dst_step] = sum>>16;
}
}
The middle loop (second for) is the hot zone: the window fully overlaps valid input and each step just adds the new pixel, subtracts the outgoing pixel, multiplies by inv, updates sum, and writes the averaged result.
The Hook (C Changes)
We insert a thin dispatch that keeps edge handling in plain C but significantly optimizes the middle with our AVX2 kernel when strides are 1 and width permits a vectorized chunk:
// Middle loop: use optimized function if strides are 1
{
int middle_start = radius + 1;
int middle_end = len - radius;
if (middle_end > middle_start && dst_step == 1 && src_step == 1) {
int middle_end_mod16 = middle_end - ((middle_end - middle_start) % 16);
if (dsp && dsp->middle && middle_end_mod16 > middle_start) {
// AVX2 implementation
dsp->middle(dst, src, middle_start, middle_end_mod16, radius, inv, &sum);
x = middle_end_mod16;
}
for (; x < middle_end; x++) {
sum += (src[(radius+x)*src_step] - src[(x-radius-1)*src_step])*inv;
dst[x*dst_step] = sum >> 16;
}
} else {
// Scalar Fallback
for (x = middle_start; x < middle_end; x++) {
sum += (src[(radius+x)*src_step] - src[(x-radius-1)*src_step])*inv;
dst[x*dst_step] = sum >> 16;
}
}
}
dsp->middle points to our AVX2 kernel(s). This preserves correctness and fallback paths, and keeps all edge math in the readable scalar implementation.
How Functio Approaches Optimization
Functio supports two workflows:
Targeted function optimization — provide files/functions; Functio extracts deps, builds a microbench, mocks where needed, and drives a
mainfor repeatable tests.Workflow‑based optimization — point Functio at a command; it profiles the project, spots top hotspots, then isolates them into a microbench.
This Case: workflow‑based optimization, using the following command to automatically profile and benchmark FFmpeg’s boxblur implementation:
tests/checkasm/checkasm --test=vf_boxblur --bench
Functio intercepted this benchmarking run, identified the boxblur middle loop as the bottleneck, and constructed an isolated microbench from the detected hot path.
Analysis (Bottlenecks Found)
1) Scalar dependency chain
The C middle loop carries a strict data dependency through sum. Even with -O3, OoO overlap is limited; IPC suffers.
2) Per‑element overhead Pointer arithmetic and conditionals per pixel inflate the loop body; the compiler can’t fully vectorize the rolling sum pattern.
AVX2 Design (What the Kernel Does)
At each iteration (processing 16 u8 or 8 u16):
- Load incoming/outgoing samples for the sliding window:
src[x+radius]andsrc[x‑radius‑1]. - Form difference = incoming − outgoing.
- Scale by
invusingpmulldon sign‑extended dwords. - Compute an in‑register prefix sum over the 16 (or 8) scaled diffs to materialize all intermediate
sumvalues. - Add the running accumulator (the
sumfrom prior chunk) to the whole vector. - Extract the final lane to update the scalar accumulator for the next vector chunk.
- Shift‑right by 16 and pack back to u8 (or u16) for the output.
Key building blocks:
pmovzxbw/pmovsxwd(u8→u16→s32) andpmovzxwd(u16→u32)pslldq+padddladder for 128‑bit prefix; explicit cross‑lane carry viavextracti128/vinserti128- Accumulator broadcast with
vpbroadcastd
Correctness
- Bit‑exactness: The AVX2 kernel reproduces baseline outputs for all tested widths/radii and both u8/u16 types.
- Boundaries: Edges remain on the scalar path; the AVX2 kernel is only called where the window is fully valid.
- FATE‑friendly: The fallback path is unchanged, so FFmpeg tests continue to pass.
Results
AVX2:
vf_boxblur.boxblur_blur8 [OK] vf_boxblur.boxblur_blur16 [OK] checkasm: all 2 tests passed boxblur_blur8_c: 1396.9 (1.00x) boxblur_blur8_avx2: 541.1 (2.58x) boxblur_blur16_c: 1256.0 (1.00x) boxblur_blur16_avx2: 504.2 (2.49x)
showing impressive 2.5x speedup
Conclusion
By isolating the middle loop and swapping in a carefully crafted AVX2 prefix‑sum kernel, we achieved 2.5x speedup. The architecture stays clean (C for edges, SIMD for bulk), correctness is preserved, and the speedups scale with width and radius.