Making Boxblur in FFmpeg 2.5x faster

Overview

Project: FFmpeg

Filter: vf_boxblur

Hot Path: middle segment of the 1‑D blur sweep (horizontal/vertical), where the window is fully inside the image

Files optimized:

C path that calls into the optimized middle kernel when strides are 1 (u8 and u16 variants)
x86/AVX2 assembly file implementing boxblur_middle (8‑bit) and boxblur_middle16 (16‑bit)
C path follows the same structure and naming conventions as other FFmpeg filters, which allows Functio to learn recurring optimization patterns across similar filter designs.

Introduction

Functio is an AI tool that pinpoints, analyzes, and patches performance bottlenecks. In this case study, Functio targeted FFmpeg’s boxblur middle loop, delivering a carefully optimized AVX2 kernel that reduces inner‑loop overhead and exploits wide data parallelism—without altering boundary handling or user‑facing behavior.

This write‑up walks through:

What the original code does
The bottlenecks we found
The AVX2 design and code changes
Correctness and reproducibility
Results

Functionality Description (Original)

At heart, boxblur computes a sliding‑window average over a line. The classic C macro looks like:

static inline void blur depth(type *dst, int dst_step, const type *src,
                                 int src_step, int len, int radius)
{
    const int length = radius*2 + 1;
    const int inv = ((1<<16) + length/2)/length;
    int x, sum = src[radius*src_step];

    for (x = 0; x < radius; x++)
        sum += src[x*src_step]<<1;

    sum = sum*inv + (1<<15);

    for (x = 0; x <= radius; x++) {
        sum += (src[(radius+x)*src_step] - src[(radius-x)*src_step])*inv;
        dst[x*dst_step] = sum>>16;
    }

    for (; x < len-radius; x++) {
        sum += (src[(radius+x)*src_step] - src[(x-radius-1)*src_step])*inv;
        dst[x*dst_step] = sum >>16;
    }

    for (; x < len; x++) {
        sum += (src[(2*len-radius-x-1)*src_step] -
               src[(x-radius-1)*src_step])*inv;
        dst[x*dst_step] = sum>>16;
    }
}

The middle loop (second for) is the hot zone: the window fully overlaps valid input and each step just adds the new pixel, subtracts the outgoing pixel, multiplies by inv, updates sum, and writes the averaged result.

The Hook (C Changes)

We insert a thin dispatch that keeps edge handling in plain C but significantly optimizes the middle with our AVX2 kernel when strides are 1 and width permits a vectorized chunk:

    // Middle loop: use optimized function if strides are 1
    {
        int middle_start = radius + 1;
        int middle_end   = len - radius;
        if (middle_end > middle_start && dst_step == 1 && src_step == 1) {
            int middle_end_mod16 = middle_end - ((middle_end - middle_start) % 16);
            if (dsp && dsp->middle && middle_end_mod16 > middle_start) {
            // AVX2 implementation
                dsp->middle(dst, src, middle_start, middle_end_mod16, radius, inv, &sum);
                x = middle_end_mod16;
            }
            for (; x < middle_end; x++) {
                sum += (src[(radius+x)*src_step] - src[(x-radius-1)*src_step])*inv;
                dst[x*dst_step] = sum >> 16;
            }
        } else {
	        // Scalar Fallback
            for (x = middle_start; x < middle_end; x++) {
                sum += (src[(radius+x)*src_step] - src[(x-radius-1)*src_step])*inv;
                dst[x*dst_step] = sum >> 16;
            }
        }
    }

dsp->middle points to our AVX2 kernel(s). This preserves correctness and fallback paths, and keeps all edge math in the readable scalar implementation.

How Functio Approaches Optimization

Functio supports two workflows:

Targeted function optimization — provide files/functions; Functio extracts deps, builds a microbench, mocks where needed, and drives a main for repeatable tests.
Workflow‑based optimization — point Functio at a command; it profiles the project, spots top hotspots, then isolates them into a microbench.

This Case: workflow‑based optimization, using the following command to automatically profile and benchmark FFmpeg’s boxblur implementation:

tests/checkasm/checkasm --test=vf_boxblur --bench

Functio intercepted this benchmarking run, identified the boxblur middle loop as the bottleneck, and constructed an isolated microbench from the detected hot path.

Analysis (Bottlenecks Found)

1) Scalar dependency chain The C middle loop carries a strict data dependency through sum. Even with -O3, OoO overlap is limited; IPC suffers.

2) Per‑element overhead Pointer arithmetic and conditionals per pixel inflate the loop body; the compiler can’t fully vectorize the rolling sum pattern.

AVX2 Design (What the Kernel Does)

At each iteration (processing 16 u8 or 8 u16):

Load incoming/outgoing samples for the sliding window: src[x+radius] and src[x‑radius‑1].
Form difference = incoming − outgoing.
Scale by inv using pmulld on sign‑extended dwords.
Compute an in‑register prefix sum over the 16 (or 8) scaled diffs to materialize all intermediate sum values.
Add the running accumulator (the sum from prior chunk) to the whole vector.
Extract the final lane to update the scalar accumulator for the next vector chunk.
Shift‑right by 16 and pack back to u8 (or u16) for the output.

Key building blocks:

pmovzxbw/pmovsxwd (u8→u16→s32) and pmovzxwd (u16→u32)
pslldq+paddd ladder for 128‑bit prefix; explicit cross‑lane carry via vextracti128/vinserti128
Accumulator broadcast with vpbroadcastd

Correctness

Bit‑exactness: The AVX2 kernel reproduces baseline outputs for all tested widths/radii and both u8/u16 types.
Boundaries: Edges remain on the scalar path; the AVX2 kernel is only called where the window is fully valid.
FATE‑friendly: The fallback path is unchanged, so FFmpeg tests continue to pass.

Results

AVX2:

vf_boxblur.boxblur_blur8 [OK]
vf_boxblur.boxblur_blur16 [OK]
checkasm: all 2 tests passed
boxblur_blur8_c: 1396.9 (1.00x)
boxblur_blur8_avx2: 541.1 (2.58x)
boxblur_blur16_c: 1256.0 (1.00x)
boxblur_blur16_avx2: 504.2 (2.49x)

showing impressive 2.5x speedup

Conclusion

By isolating the middle loop and swapping in a carefully crafted AVX2 prefix‑sum kernel, we achieved 2.5x speedup. The architecture stays clean (C for edges, SIMD for bulk), correctness is preserved, and the speedups scale with width and radius.