Animated WebP, GIF, and APNG playback feels trivial from the outside — most browsers do it for free, after all. But once you try to build a player yourself, the seemingly simple problem fragments into a half-dozen subtle ones: how do you keep frame timing accurate, what state should the player be in when nothing is playing, what happens when the decoder is slower than the framerate, and how do you stay responsive to user input without dropping frames? This post walks through how those pieces fit together, using a Pillow-based implementation as the running example.
(See Recording Real-Time Graphics for the producer side.)
The Surface vs. the Iceberg
From the outside, the player is just an image and a controls bar. Inside, it’s five interacting components:
flowchart LR
Decoder["Decoder<br/>(Pillow)"] --> Cache["Frame<br/>Cache"]
Clock["Clock<br/>(monotonic)"] --> SM["State<br/>Machine"]
Cache --> SM
Input["Input<br/>Events"] --> SM
SM --> GL["GL Renderer"]
Five components, each with a job. The next sections walk through them.
The Timing Model
Animated images store a per-frame duration, not a fixed framerate. A typical animated WebP loop looks like this — note that frame 3 displays for only half as long as the others (an intentional fast cut):
gantt
title Animated WebP frame timing (one loop ≈ 350 ms)
dateFormat x
axisFormat %L ms
section Frames
Frame 1 (100 ms) :0, 100
Frame 2 (100 ms) :100, 200
Frame 3 (50 ms — fast cut) :200, 250
Frame 4 (100 ms) :250, 350
Both Pillow and the underlying file formats expose this as info["duration"] (in milliseconds) on each frame. A naive player ignores this and just calls time.sleep(1/30) between frames — which produces visible drift on any image where the durations vary, and outright wrong playback on simple cases like animated emoji that often use 50ms or 80ms frames.
The right pattern uses a monotonic clock and an absolute deadline per frame:
import time
start = time.monotonic()
deadline = start
for frame in frames:
deadline += frame.duration_ms / 1000.0
render(frame)
sleep_until(deadline)
time.monotonic() is critical: it never goes backward (unlike wall-clock time, which can if the user adjusts their system clock or if NTP corrects a drift). Drift correction by carrying the deadline forward — instead of resetting it each frame — is what makes a multi-minute animation stay in sync.
The State Machine
The player has three primary states:
stateDiagram-v2
[*] --> PAUSED
PAUSED --> PLAYING: user clicks play
PLAYING --> PAUSED: user clicks pause / ESC
PLAYING --> WOBBLE_FRAME: user clicks image
WOBBLE_FRAME --> PAUSED: wobble released
PAUSED: PAUSED<br/>(frame N)
PLAYING: PLAYING<br/>(advancing)
WOBBLE_FRAME: WOBBLE_FRAME<br/>(frame 1 only)
PAUSED is the default state for any newly opened animated image. This sounds counterintuitive — why not autoplay? — but auto-playing is hostile to two things: (1) screen readers and accessibility, and (2) workflows where the user wants to inspect a single frame. Browsers learned this the hard way and most now respect a prefers-reduced-motion media query.
WOBBLE_FRAME is application-specific (an image viewer with interactive deformation), but the general pattern shows up in many players: input cancels playback. A user clicking on an animated image to do anything else — annotate, crop, share — almost always wants the animation to stop first.
Decoder Pacing: Wait or Skip?
This is where naive implementations fall apart. The decoder isn’t free — Pillow has to seek, decode, and convert each frame. On a slow machine or a large APNG, decoding can occasionally take longer than the frame’s duration:
Frame budget: 100 ms each.
decode took 130 ms
v
Frames: --[1]--[2]--[3]--[4]--[5]--[6]X--[7]--
v v v v v
30 ms behind wall-clock
There are two possible responses:
Skip strategy: drop the late frame and jump to whichever frame the wall clock now demands.
wall-clock now
v
Frames: --[1]--[2]--[3]--[4]--[6]--[7]--
^
frame 5 skipped
Wait strategy: render frame N when it’s ready, even if late. The clock catches up on the next frame budget.
Frames: --[1]--[2]--[3]--[4]--[5]----[6]--[7]--
^
130 ms (late, but rendered)
The skip strategy keeps wall-clock sync but causes visible jumps. The wait strategy preserves visual continuity but lets jitter accumulate. For most viewers, wait is the right default, because:
- Animated images are usually short (under 10 seconds). Drift never gets large enough to matter.
- A skipped frame is more visually disturbing than a slightly delayed one — humans notice missing frames more than slow ones.
- If the user doesn’t notice a 30ms hiccup, you’ve gained nothing by dropping a frame.
If you do choose skip, prefer doing it only when drift exceeds a noticeable threshold (say, 200ms — five typical 30Hz frames).
Memory Strategies
Three reasonable approaches:
| Strategy | Memory | Decode latency | Best for |
|---|---|---|---|
| Preload all | High (full image × N frames) | Zero per frame | Short animations, plenty of RAM |
| On-demand | Low (one frame) | Full decode each frame | Long animations, memory-constrained |
| Sliding window | Medium (k frames around current) | Zero for in-window, full for out | Balanced workloads |
A useful heuristic:
total_frame_pixels = width × height × N_frames × 4 bytes
if total_frame_pixels < 200 MB:
preload_all()
else:
sliding_window(k=8)
200 MB is roughly the point where a typical desktop user starts noticing memory pressure from a single image, and 8 frames of slack covers most decoding hiccups without ballooning memory.
For an interactive image viewer where the user can scrub or jump around, preload-all is dramatically simpler and almost always feasible — animated images are usually small. For a video player, sliding window is mandatory.
Format Quirks
Pillow handles all three popular formats, but each has gotchas:
| Format | Quirk | Mitigation |
|---|---|---|
| WebP animated | info["duration"] is sometimes a list (one per frame), sometimes a single int (constant duration). | Coerce to per-frame list at load time. |
| GIF | Frames are in palette mode, not RGB. Direct pixel access requires converting. | Always call .convert("RGBA") before rendering. |
| APNG | Only well-supported in Pillow 9.5+. Earlier versions silently treat it as static PNG. | Pin minimum Pillow version, or check n_frames > 1. |
A clean abstraction normalizes these into a single Frame(image_rgba, duration_ms) representation right after decode, and the rest of the player never needs to know which format produced it.
Edge Cases Worth Handling
Two surprises that bit me in practice:
File modification while playing. If the file’s mtime changes mid-playback (a common occurrence when the user is editing the image in another tool), the decoder’s internal seek state can become inconsistent. The fix is to not watch mtime as a “did this file change” signal during playback — track explicit file-changed events from the OS instead, or just keep the decoder bound to the open file handle for the duration of the session.
Truncated files. Some animated WebPs found in the wild have a frame count in the header that doesn’t match the actual decoded frames available. Catch EOFError from seek() and treat the last successfully decoded frame as the loop point.
Header says: 24 frames
Decoder finds: 18 frames before EOFError
|
v
Loop back to frame 1 here
Wrapping Up
A frame-accurate animated image player is roughly 300-500 lines of Python on top of Pillow once you’ve nailed:
- Monotonic-clock-based deadline timing
- A small explicit state machine
- A “wait, don’t skip” decoder pacing default
- Preload-all memory unless the image is unusually large
- Per-format normalization at the decode boundary
If you’re considering building one, the biggest hidden cost is testing — the only way to catch timing bugs is with a stopwatch and a wall of test images that exercise every duration pattern you can think of. Open-source test corpora like the WebP gallery and Mozilla’s APNG samples are good starting points.
Once the player is working, the natural next features are frame export (let the user save individual frames as PNG), per-frame seek scrubbing, and palette-aware GIF rendering for the rare case where palette transparency matters. Each is its own rabbit hole — but with a solid timing model and state machine in place, they’re all straightforward additions rather than rewrites.