If you’ve ever tried to add “record this gameplay” or “save this animation” to a real-time graphics application, you’ve probably hit the same wall: reading pixels back from the GPU is slow, and naively doing it every frame turns a smooth 60 FPS demo into a 12 FPS slideshow. This post walks through the standard solution — asynchronous Pixel Buffer Object (PBO) readback paired with a background encoder thread — using a WebP-encoding setup as the running example. The same pattern works for MP4, GIF, image sequences, or anything else.

Why the Naive Approach Stalls

The naive recording loop looks like this:

flowchart LR
    R[Render<br/>frame] --> RP["glReadPixels()<br/>(synchronous)"]
    RP --> E[Encode]
    E --> W[Write<br/>to disk]
    W --> N[Next<br/>frame]

glReadPixels blocks the calling thread until the GPU finishes whatever it’s doing, copies the framebuffer back over the PCIe bus, and returns the bytes to RAM. On modern hardware that’s a few milliseconds per call — but the kicker is that the GPU is now stalled too, because it can’t start the next frame until the readback completes.

The result is a punishing serialization, with both processors idling while the other works:

GPU:   ###______###______###______
CPU:   ___######___######___######
       |   |     |
       |   |     +-- encoding
       |   +-------- readback (GPU stalled here)
       +------------ rendering

What we want instead is parallel work: GPU rendering frame N+1 while the CPU encodes frame N.

PBO Async Readback

A Pixel Buffer Object is a GPU-side buffer that can be mapped into CPU memory after the GPU has finished writing to it. The trick is the GPU writes asynchronously, and the CPU only blocks if it tries to map the buffer before the write completes.

The readback pattern with a single PBO already helps — instead of glReadPixels directly into a CPU buffer, you read into a PBO and map it on the next frame:

sequenceDiagram
    participant GPU
    participant PBO as PBO_A
    participant CPU
    Note over GPU,CPU: Frame 1
    GPU->>PBO: glReadPixels (async write)
    Note over GPU,CPU: Frame 2 (CPU now reads frame 1)
    GPU->>PBO: glReadPixels (frame 2)
    PBO->>CPU: map → frame 1 pixels
    CPU->>CPU: encode

Now the GPU can start frame 2 while the previous frame’s data trickles back. The CPU only blocks if mapping the PBO catches the GPU still writing — and with a one-frame delay, that’s almost never.

Using three PBOs in rotation removes the last bit of contention. Each frame the GPU writes into one PBO and the CPU maps a different one that was filled two frames ago:

flowchart LR
    F1["Frame 1<br/>GPU → PBO_A"]
    F2["Frame 2<br/>GPU → PBO_B"]
    F3["Frame 3<br/>GPU → PBO_C<br/>CPU maps PBO_A"]
    F4["Frame 4<br/>GPU → PBO_A<br/>CPU maps PBO_B"]
    F5["Frame 5<br/>GPU → PBO_B<br/>CPU maps PBO_C"]
    F1 --> F2 --> F3 --> F4 --> F5

The GPU and CPU now run fully in parallel, with the PBOs acting as a small queue.

The Encoder Thread

PBO mapping gets pixels back to the CPU. But encoding them — especially to anything other than uncompressed PPM — is also expensive. WebP encoding at quality 75 takes on the order of 5-15 ms per 1080p frame depending on CPU. Doing that on the render thread eats your frame budget.

The fix is a background encoder thread with a bounded ring buffer:

flowchart TB
    subgraph render["Render thread"]
        direction LR
        R1[Render frame N] --> R2[Map PBO] --> R3[Push pixel data]
    end
    subgraph buffer["Ring buffer (bounded, e.g. 4 slots)"]
        Q[(queue)]
    end
    subgraph encoder["Encoder thread"]
        direction LR
        E1[Pop frame] --> E2["Encode (WebP)"] --> E3[Append to file]
    end
    render --> buffer --> encoder

The render thread does the cheap part (push), the encoder does the expensive part (encode + write), and they’re decoupled by the queue. As long as the average encode time is under the average frame time, the queue stays roughly stable.

What If the Encoder Falls Behind?

The ring buffer has finite slots — what happens when the encoder can’t keep up? Three options:

StrategyBehaviorUse when
BlockRender thread pauses until a slot freesRecording must be lossless
Drop oldestDiscard the oldest queued frameRecording must not stall the app
Drop newestDiscard the incoming frameRare — only if older frames matter more

For an interactive app where the user expects responsive playback, drop oldest is usually correct: the recording becomes slightly choppy, but the app stays smooth. For a benchmark or scientific recording, block is correct — you’d rather slow down than misrepresent the result.

A common compromise is block with a timeout: try to push for up to 1 frame’s worth of time, then drop if still blocked. This catches one-off encoder spikes without committing to either extreme.

Frame Pacing

One subtle issue: when do you stamp the timestamp on each frame? If you record real wall-clock timestamps, the resulting video will play back at exactly the speed it was captured at — including any hiccups. If you stamp at the target framerate (say, 60 FPS), the video will be smooth but the simulation will appear slightly time-shifted.

For most recordings, target-rate stamping is preferable:

Wall-clock:    0ms - 17ms - 33ms - 52ms - 67ms - 84ms -
                                    ^
                            19 ms gap (hiccup)

Stamped:       0ms - 17ms - 33ms - 50ms - 67ms - 83ms -
                                    ^
                            smoothed to 17 ms

This gives the viewer a 60 FPS playback experience even if the actual capture jittered. The cost is that the timing of in-simulation events appears slightly compressed — usually invisible unless the recording captures an event that’s supposed to happen at a precise wall-clock time.

Output Format Trade-offs

Three reasonable choices for the encoded output:

FormatProsConsUse when
Animated WebPCompact, plays in browsers, transparent backgrounds, lossless mode availableSlower to encode than GIFWeb sharing, demos, soft-body recordings
MP4 (H.264)Universal playback, very compactPatent-encumbered, no transparencyLong recordings, traditional video
GIFPlays everywhere, no decoder questionsLarge files, 256-color limitCompatibility, embeddable in old systems

For recordings under ~30 seconds with a small color palette and possible transparency, animated WebP is usually the best balance of file size, encoding speed, and compatibility.

(Inside an Animated Image Player covers the playback side of the loop.)

Safety Mechanisms

Two practical safeguards every recorder should have:

Max-duration cap. Without one, a forgotten recording session fills the disk overnight. A 30-second cap covers >95% of legitimate use cases for short demos and forces users to start a new clip for longer captures.

Disk space check. Before each frame is queued, verify there’s room. Aborting recording cleanly with a “low disk space” message is much friendlier than a corrupted output file.

flowchart LR
    F[Frame from PBO] --> D{Low disk?}
    D -->|Yes| S[Stop &amp; save<br/>partial output]
    D -->|No| Q[Queue for<br/>encoding]

What This Doesn’t Cover

Real-world recorders also need:

  • Audio track muxing (only if your app produces audio at all)
  • Hotkey support for start/stop without breaking the focus rules of the host app
  • Live preview of what’s being captured (a thumbnail of the most recent encoded frame works well)
  • Format-specific tuning (e.g., WebP quality and method parameters trade encode time for file size)

Each is a small project on its own, and most users only need a subset.

Wrapping Up

The full recipe in five lines:

  1. Render to the framebuffer as usual.
  2. Trigger an async PBO readback at the end of each frame.
  3. Map the previous frame’s PBO and push the bytes to a bounded ring buffer.
  4. Background thread pops from the ring buffer, encodes, appends to file.
  5. Stamp frames at the target framerate, not at wall-clock time.

Get those five right and recording costs almost nothing on the render thread — typically under 1 ms of overhead per frame, which is invisible in any 60 FPS application. The encoder thread does the heavy lifting on a CPU core that would otherwise be idle, and the resulting output is a smooth, accurate recording of exactly what the user saw.

The biggest pitfall I’ve seen in production is teams skipping step 5 — they stamp wall-clock times and then can’t figure out why the recording feels “wrong” even though every frame was captured. Once you decouple capture timing from playback timing, recording stops fighting the rest of the application.