1mb movie

A useless but radical compression project: getting films down to 1mb and back again with language models

A 4×3 grid of mid-clip frames from twelve films reconstructed by the lossy pipeline.

How could we compress a film to <1mb while preserving the most quality?

Dimensions that we normally compress videos on are bitmap resolution (1080p, 480p, etc), frame rate (movies and tv are usually 24fps), and audio sample rate (160kbps is Spotify’s High quality), but can we crank those down enough for sub-1mb? We’ll look at that dead-end in the next section.

Shakespeare’s screenplays each probably fit into 1mb, but they’re plays, not films or tv shows which are more detailed, like a single re-enactment of a play.

Language models are good at summarising information so seem ideal for compression. Multi-model models can work with image and audio inputs.

LMs are also ideal for the de-compression too because they’re good at generate video and audio from small text prompts.

In this article I’ll go through my experience and learnings to achieve radical compression through language models for the lowest cost.

How we normally compress films

The lowest resolution we can go with conventional compression is around 1.1kbps for audio and 6kbps for video, the audio

Source — 15 sec, ~521 kbps H.264, 978 KB. 124.7 min at this rate ≈ 470 MB.
100 kbps AV1 — 250 KB for 15s. 124.7 min ≈ 93 MB.
30 kbps AV1, 320×136, 12fps — 78 KB. 124.7 min ≈ 29 MB.
10 kbps AV1, 160×68, 8fps — 39 KB. The codec floor. 124.7 min ≈ 15 MB. Still 15× over budget.
6 kbps Opus speech-only — Opus’s hard floor. 124.7 min ≈ 5.6 MB.

Encode: make the film get really small

Hagrid and Hedwig the snowy owl, reconstructed by the lossy pipeline (character-identity-enriched generation).
Prompt fed to Wan 2.1 (character-enriched): “Rubeus Hagrid: A large, imposing man with a massive frame, characterized by his long, wild, unkempt dark hair and a thick, bushy beard…”

Shot by shot

Since VLMs (video language models) generate short videos of around 5-10 seconds and shots in films tend to be a similar length, it would seem that VLMs could be used to generate a film as a sequence of shot videos.

It turns out there’s a python library for finding shot boundaries in films! PySceneDetect.

Star wars, for example has 2000 clips (to make precise) which are detected pretty well by this library. There are a few exceptions, like the full-screen laser flashes in the first scene causing a few eroneus extra shots to be detected.

ffmpeg is a popular toolset for working with videos. You can use it to extract all sorts of things from videos like individual frames, audio, and clips.

Describe each shot

Instead of compressing a film by reducing the pixels or sampling rates, we’ll use an image LM: Google’s Gemini 3.1 Flash-Lite to summarise each shot in the film, given a few frames sampled from the shot. This gemini flash lite model is cheap at $0.075 per million input tokens and $0.30 per million output tokens.

Going through security, The Matrix (8.8 seconds)

Close on a black coat and duffel bag as the camera begins to pull back.The camera pulls back further in the lobby.Trinity begins to come into frame.Trinity revealed in a black vinyl coat and sunglasses.Trinity and Neo walking forward together through the lobby.Neo reaches into his coat for his holstered guns.The pair continue walking toward the camera.Wide view of Neo and Trinity striding through the lobby.
The 8 frames sampled from this 8.8s shot — the only input the vision model saw.
{
  "shot_type": "medium",
  "camera_movement": "slow zoom out",
  "subjects": "A woman in a black vinyl coat and a man in a black trench coat, both wearing dark sunglasses.",
  "action": "The two characters walk forward together through a lobby. The man draws two handguns from his coat holsters simultaneously as they continue walking.",
  "lighting": "High-contrast, cool-toned interior lighting with bright highlights reflecting off the vinyl fabric.",
  "color_palette": "Black, dark grey, muted green, and white.",
  "mood": "Intense, cool, determined, and action-oriented.",
  "setting": "A modern, sterile office building lobby with stone walls and a green 'EXIT' sign in the background.",
  "sound": "A mix of intense, rhythmic electronic music, ambient room tone, and the sharp, metallic sound of guns being drawn."
}

Clustering on characters

What about audio?

Movies have a few layers of audio: dialogue, music, and sound effects.

We’ll at least split up dialogue from music and sound effects.

I’ll cheat a bit with dialogue by directly using the subtitles for the film, which are sometimes embedded in the film itself.

Listening for sounds

We can use an audio classification model like YAMNet to detect which of 521 everyday sound classes are present at which times

YAMNet · one reading every 0.48s · 521 class scores
+0.0s Music 0.15 · Arrow 0.09
+1.4s Music 0.87 · Timpani 0.01
+3.3s Domestic animals 0.19 · Dog 0.16
+5.2s Outside, urban 0.12 · Vehicle 0.08
+7.1s Inside, small room 0.17 · Bicycle 0.16
average the 20 frames in this shot, keep the top 5, pick a bucket
per-shot audio summary
effectsMusic 0.14 · Inside, small room 0.06 · Typing 0.05 · Explosion 0.05 · Gunshot 0.05
hand to the vision model with the sampled frames + subtitle lines
written into the shot entry
”sound”: “A mix of intense, rhythmic electronic music, ambient room tone, and the sharp, metallic sound of guns being drawn.”
The Matrix lobby shot (8.8s) down the audio side of the encoder. Noisy per-frame guesses — a dog, a bicycle — wash out when the shot’s 20 frames are averaged, leaving a stable bucket and top labels that seed the vision model’s final sound description.

Putting it together: the lossy compression encode pipeline

Film ~600 MB · ~2 hr · 1080p
↓ detect shot boundaries · PySceneDetect
~2,070 shots ~3.6 s each
↓ for each shot — extract, then analyse
vision
2–8 frames@512px · ffmpeg
Gemini VLMscene caption
motion
optical flowFarnebäck
camera-movepan / zoom / static
sound
audio16 kHz mono · ffmpeg
YAMNetbucket + top labels
speech
subtitlesSRT · ffmpeg
dialoguelines + timings
motion, sound & speech also feed the VLM as hints
↓ fuse into one entry per shot
per-shot JSON entry
shot_typecamera_movementsubjectsactionlightingcolor_palettemoodsettingsounddialogue
↓ collect all ~2,070 entries + character clusters
Manifest manifest.json · ~88 KB the whole compressed film — every shot entry + character clusters ≈ 7000× smaller

Decode: making the least crap film from the smallest amount of information for the lowest cost

A reconstructed frame from Fantasia 2000, generated by the lossy pipeline.
Prompt fed to Wan 2.1: “a man and a woman in 1940s-style business attire, holding a small child with a red ball above their heads. the camera zooms out to reveal the couple standing in a spotlight…”

So now we have a set of shot descriptions, one for each shot in the movie:

0010:00–0:09A black coat and duffel bag; the camera pulls back to reveal two figures in dark sunglasses walking through a sterile office lobby…
0020:09–0:13A woman in a black vinyl coat strides forward as the man beside her draws two handguns from his coat holsters…
0030:13–0:18Wide shot of the pair advancing toward the camera, high-contrast cool lighting, a green EXIT sign behind them…
⋮  

Reminder: each description includes shot_type, camera_movement, subjects, action, lighting, color_palette, mood, setting, sound, and dialogue.

Now we’re ready to generate a video for each of these shots with a video language model, before stitching them together to create the movie, whole again!

Selecting a video model

I’m going for the lowest cost for a plausable movie output in 480p/720p.

The best model for output quality is currently ByteDance’s Seedance 2 model, but it’s too expensive at around $1200 for a 2 hour movie length output. It’s also not open-weight so we’d have to choose from a few hosting providers.

Alibaba’s Wan 2.1 & 2.2 are older models from early/mid 2025. The quality is noticeably worse but it’s dramatically cheaper to run. It’s open-weight and its ~10GB (to confirm) RAM fits on an Nvidia GTX 4090 GPU which is meaningful for self-hosting.

Wan 2.1 reconstruction of the opening Star Wars space shot: a blocky ship firing a red laser, surrounded by a busy field of planets and lens flares.Wan 2.1
Seedance 1.0 Pro reconstruction of the same shot: a cleaner, more cinematic ship firing a red laser over a single planet against a starfield.Seedance 1.0 Pro
Same shot, same prompt — Star Wars IV shot 011 (“a spacecraft moves across the frame… a red laser beam fires from the rear of the ship… deep space, orbiting a large brownish-orange planet”). Wan 2.1 (left) clutters the frame and reads as painterly; Seedance 1.0 Pro (right) holds a cleaner, more intentional composition. That quality gap is what makes the cost table below interesting.
ModelWherecost / secwhole film (~2 hr)
Wan 2.1 T2V 1.3Bself-hosted (RunPod 4090)$0.002≈$16
Wan 2.2 T2V Fastpaid API (Replicate)≈$0.01≈$74
Seedance 1.0 Propaid API (fal)$0.05≈$370
Seedance 2.0paid API (list price)≈$0.15≈$1,100
Measured from our own runs on Star Wars IV (124.7 min); Seedance 2.0 is published list price (≈3× Seedance 1.0), not run. Full-film figures extrapolate the per-second rate to one clip per shot — retries add to it. Note the $16-vs-$74 gap is mostly a small model on a rented GPU, not self-host-vs-API alone: the cheap Wan 2.2 API is a distilled “Fast” variant.

Self hosting

Wan 2.2 5B — same scene (11 clips, 33.8s)Billed bycost / secscene total
fal.ai, hostedper clip ($0.15)≈$0.05$1.65
RunPod A6000, self-hostedGPU time ($0.33 / hr)≈$0.01$0.33
The exact same weights — Wan 2.2 5B, wan2.2_ti2v_5B_fp16 — run both ways, so the only variable is who owns the GPU. Measured on one real Star Wars scene: 6 shots, 11 clips, 33.8s. fal charges a flat $0.15 per clip however short the clip; self-hosting pays only for the A6000’s time (full pod uptime at $0.33/hr). ~5× cheaper here — and that’s a pessimistic case for self-hosting, since a single short scene barely amortizes the pod’s spin-up and model load; across a whole film the per-clip GPU cost falls further.

The continuity problem: every shot, a different actor

Four reconstructed frames of C-3PO from the same Lars homestead scene in Star Wars IV. He is a different gold droid in each — a different head and face every time.
C-3PO across four shots of one scene (Star Wars IV, shots 279–285), all generated by the same Wan 2.1 model. He’s a different gold droid in every shot — a new head, a new face — because the model only ever saw a fresh ~100-word description of the shot, never the droid it had already drawn.

Since we’re stitching together clips that are generated independently, continuity becomes a major challenge in way of making something coherent.

Something that suprised me about VLMs was how small the input prompt is. The sweet spot for models generally is on the order of 50 - 150 words:

Star Wars IV · shot 011 · fed to Wan 2.1 · 109 words

A large, blocky spacecraft with a grid of bright glowing thrusters. The spacecraft moves steadily across the frame against the backdrop of a planet and a moon, while a red laser beam fires from the rear of the ship, dissipating as it travels through space. Extreme Wide shot, static. Deep space, orbiting a large, textured, brownish-orange planet with a smaller moon visible in the distance. High contrast space lighting; the ship is illuminated by its own thrusters and the distant light of a star, while the planet below is lit by a sun off-screen. Deep black, vibrant orange, muted grey, and bright white. Epic, cinematic, and adventurous.

That’s the budget we have for describing a shot in our de-compressed movie. So if we want Luke Skywalker to look the same across the clips we’re generating we may only have 100 words to describe Mark Hamill’s face, stature, pose, costume, expression, etc. Maybe other characters are in the shot too.

Character clustering: an attempt

Something I tried to mitigate the continuity problem without is clustering

A 3×2 grid of character headshots generated from the pipeline's clustered descriptions: Luke Skywalker, Gandalf, Harry Potter, Rubeus Hagrid, Larry David, and Jimmy McGill.
One canonical portrait per character, built by clustering every shot a character appears in into a single ~60-word description and rendering it with FLUX. The idea: give the video model one fixed reference per character instead of a fresh guess each shot. The wardrobe and silhouette survive the round-trip — the specific actor’s face does not.
A character reference portrait of Luke Skywalker generated from the pipeline's clustered text description.
Clustered from 509 shots of Star Wars IV, rendered with FLUX: “A young male human with fair skin, light brown hair, and blue eyes… a bright orange Rebel Alliance flight suit over a white thermal undershirt… sitting in the cockpit of an X-wing starfighter.”

Long shots

Wan 2.1/2 models support output length of only around 3-6 seconds. Not enough for some shots like The Dude browsing for milk at the beginning of The Big Lebowski which goes on for 50 seconds or so.

There’s a good solution to this: we can use the last frame of an output clip as the first frame of the next clip with the first_image parameter, then join the clips into one shot. It works well:

chained: last frame → first frame
independent clips
A landspeeder crossing the desert — two ~5s Wan 2.2 clips joined into one ~10s shot, with the seam at the halfway point. Left: the second clip is seeded with the last frame of the first, so the speeder and the mesas behind it carry across the cut. Right: the two clips generated independently — at the seam the speeder becomes a different vehicle in a different canyon.

Short shots

Some movie shots are under a second — shorter than the ~2s floor that Wan will generate. Every model I tried bottoms out around there (Seedance too), so there’s no way to ask for a clip that short.

So the stitcher squeezes the ~2s clip into the slot it needs: same frames, retimed to play faster. For a sub-second shot that’s a ~3× speed-up, and the motion ends up looking like fast-forward — though at well under a second it flashes by too fast to really notice. (The other option would be to trim the clip rather than speed it up — keep the first 0.75s at natural speed and drop the rest — but I haven’t bothered.)

natural speed — the ~2s clip Wan generates
squeezed to fit a 0.75s shot
A 0.75s shot from The Matrix — gloved hands typing as the camera punches in. Left: the clip Wan actually returns, 33 frames at its ~2s floor, playing at natural speed. Right: the same 33 frames squeezed into the 0.75s slot — no frames dropped, just retimed to ~45fps, so the typing and zoom blur into a fast-forward.

Generating the audio

For dialogue we’ll just use the cheapest (dullest) Eleven labs voice to keep costs low. We can’t attribute dialogue to characters from the subtitles alone anyway. It’s a harder problem to infer that, for another day…

Darth Vader’s “I find your lack of faith disturbing” — delivered mid Force-choke — read back by the cheapest ElevenLabs voice. James Earl Jones’ menace is exactly the kind of performance the round-trip can’t write down.

Music & sound effects

The bowling-alley scene from The Big Lebowski, reconstructed with Wan 2.1 video and an MMAudio sound track generated from the shot descriptions alone — no original audio. The ambient alley noise, the distant clatter of pins and ball returns, is entirely the model’s invention.
AudioModelratewhole film (~2 hr)
Ambience + SFXMMAudio, on the video pod$0 marginal$0
Ambience + SFXMMAudio, fal.ai hosted$0.002 / sec≈$2.60
Extra SFXElevenLabs SFX$0.001 / sec≈$0.50
DialogueElevenLabs Speech$0.05 / 1K chars≈$2.70
Music / score— not generated$0
Audio is the cheap part. Best case is $0: MMAudio generates ambience and sound effects on the same RunPod GPU already rented for the video, so it adds no marginal cost. Run it on a hosted API instead and it’s a couple of dollars (≈$2.60 measured on Star Wars IV); dialogue through ElevenLabs adds another ≈$2.70 (55K characters of subtitles). Music never enters the round-trip at all — the model was never told a score existed — so the score is simply gone.

The result: decompressed 1mb films:

Reflections

What astonishing film compression! We took a 632 MB film down to under 1 MB, that’s like 1000× smaller.

Now we can carry 1000 films on a 1gb usb drive, and to watch one of them we’ll only need to pay ~$35NZD and wait ~12 hours for processing.

Our decompressed films are lossy. But the output is indeed a full-length 480p (or up to 1080p for more $) video which retains core aspects of the film including all dialog, and a basic visual and audio reconstruction (incl. some of the music & sound effects) of every shot in the input film.

Perhaps we’d get better output from our compressed form of the film by using it to re-shoot the film with a different crew, actors, etc. We surely would. It would certainly handle the continuity aspect we struggled with. But we’re on a budget and can’t afford to spend any more or wait any longer to watch our 1mb films.

I hope this was educational as a tour of some of what’s possible with video models, and what’s necessary (not much) to use them, especially on a budget.

Did we merely build a machine that turns art into slop? Judging by the output, maybe. But

← all posts