Narrative Threads Pass 1 was speech-only: feed ASR into Gemini, get beats back. mm3 interleaves visual keyframe captions at shot cuts with the ASR, enforces per-3-minute-span coverage, and tags spoiler tiers for safe / teaser / full selection. Across 50 pairwise comparisons judged by Gemini 2.5 Pro, mm3safe was chosen 17 times; v1 won 8.
A five-stage local pipeline wraps a Gemini-chaptering call. Every step is cacheable and resumable; a catalog-wide runner manages resume, cost-cap, and disk cleanup.
# from run_mm3_pipeline.sh — one title, 25 minutes wallclock 0. ffmpeg shot cuts # select=gt(scene,0.3), ~1 cut per 8s 1. extract keyframes # greedy max-score, 20s min spacing 2. caption keyframes # Gemini 2.5 Flash, 8 workers, thinking=0 3. pass1_multimodal_v3 # Gemini 2.5 Pro, coverage-enforcing prompt 4. pass2 ASR-only # subdivide beats into clip-sized sub-beats 5. spoiler-safe picker # 3 modes × 3 trees = 9 outputs
The “coverage-enforcing prompt” is the load-bearing change. Before mm3, Titanic's Pass 1 skipped a 15-minute stretch of its third act. The v3 prompt adds a hard rule: every 3-minute span must land in exactly one beat, with a 180s upper bound on any single beat.
Interleave ASR with [VISUAL] keyframe captions. Feed to Gemini 2.5 Pro. Return beats.
Same transcript, same model. Adds spoiler_level, narrative_role, reveals per beat.
Same schema, new prompt addendum. 180s hard cap per beat; every 3-min span must land in exactly one beat.
Gemini 2.5 Pro judged position-swapped pairs of clip sets for five films: Titanic, Wolf of Wall Street, Breakfast Club, Kill Bill 2, Amistad. Each film contributed 10 comparisons. The judge picked a winner per pair; ties allowed. Numbers below are total wins out of 50.
mm4 = Pass 2 also multimodal; mm4c = mm4 + shot-cut boundary snap. Both add cost without moving the pairwise-win needle against mm3, so mm3 is the current candidate for production. Rendering vertical MP4s for the remaining 32 scored titles and running a human A/B is the next step.
Separate from pairwise judgment, we scored variants against 12 iconic Titanic moments (“Rose on the bow”, “I'll never let go”, etc.). Did the variant land a clip within 45 seconds of the ground-truth moment?
| Variant | Model | Recall | Notes |
|---|---|---|---|
| mm4cfull | mm3 Pass 1 + Pass 2 multimodal + shot-snap | 33% | Best on iconic, weak on pairwise |
| mm4full | mm3 Pass 1 + Pass 2 multimodal | 33% | Same as mm4cfull without snap |
| mm3full | mm3 Pass 1 + ASR-only Pass 2 | 25% | Full-spoiler mode |
| mm3safe / mmfull | safe mode / mm baseline | 17% | Good clips, not always canonical |
| v1 (pilot) | Gemini 2.5 Flash ASR-only | 0% | Skipped the third-act window entirely |
Takeaway: full-spoiler + multimodal Pass 2 is best for canonical-moment recall, but mm3safe wins the generic “which clip-set is more engaging” judgment. Different tasks, different winners.
Five films currently have rendered tree_diff pages with click-to-play 480p proxies, showing every variant's Pass-1 and Pass-2 structure side by side. The remaining 32 mm3-scored titles will be rendered as clips ship.
mm3 produces Pass-1 beats that are 60-600 seconds long — full scene-length cuts, not the 30-90s
sub-beats Pass 2 produces. The mm3direct variant ships those directly to the feed viewer as a
distinct product surface. Think clip (Pass 2) vs scene (Pass 1).
| Variant | Filter | Clips shipped | Example URL |
|---|---|---|---|
mm3directsafe | spoiler = none | 16+ | Open → |
mm3directteaser | spoiler ≤ moderate | 53+ | Open → |
mm3directfull | all beats | 76+ | Open → |
Cloud Run Jobs pipeline: each task = one movie, renders all 60-600s beats in parallel (4 vCPU / 16 GiB / gen2 / whisper small.en).
v5 smoke rendered 71 clean clips across 3 movies in ~65 min for ~$0.45. Full 35-movie run is in flight; target ~700 clips at ~$15-20 total.
The numbers above grow live as the full run completes.
1. Ship full Pass-1 catalog. v3 worker image deployed; re-executing the 35-task Cloud Run Job brings mm3directfull from 17 clips to ~700+. ~$25-30 at full 4 vCPU / 16 GiB / 1.5h per task.
2. Render vertical MP4s for the 32 unrendered Pass-2 mm3 titles.
They have beats and elite picks; they just need the vionlabs_cut.py pass and whisper subs.
Feeds them into the scrolling viewer and the compare grid.
3. Human A/B with 10+ raters. LLM judgment established the baseline. A human rater panel on mm3safe vs mm4teaser vs v1 is the promotion gate.
4. Caption-prompt neutrality ablation. Re-score one film with visual captions stripped of adjective-laden language, confirm the win isn't an artifact of caption prose quality.