Pass 1 · Chapter-Llama recipe

Multimodal beats win pairwise 2:1 against the v1 pilot.

Narrative Threads Pass 1 was speech-only: feed ASR into Gemini, get beats back. mm3 interleaves visual keyframe captions at shot cuts with the ASR, enforces per-3-minute-span coverage, and tags spoiler tiers for safe / teaser / full selection. Across 50 pairwise comparisons judged by Gemini 2.5 Pro, mm3safe was chosen 17 times; v1 won 8.

17 / 50
mm3safe pairwise wins
2.1× the v1 pilot
99.3%
coverage (Titanic)
up from 62.7% without the rule
37
titles scored
3 rendered as Pass-1 clips · 5 as tree diffs
$0.30
per title
incl. keyframes + Pass 1 LLM

How mm3 works

A five-stage local pipeline wraps a Gemini-chaptering call. Every step is cacheable and resumable; a catalog-wide runner manages resume, cost-cap, and disk cleanup.

# from run_mm3_pipeline.sh — one title, 25 minutes wallclock
0. ffmpeg shot cuts   # select=gt(scene,0.3), ~1 cut per 8s
1. extract keyframes   # greedy max-score, 20s min spacing
2. caption keyframes   # Gemini 2.5 Flash, 8 workers, thinking=0
3. pass1_multimodal_v3 # Gemini 2.5 Pro, coverage-enforcing prompt
4. pass2 ASR-only      # subdivide beats into clip-sized sub-beats
5. spoiler-safe picker # 3 modes × 3 trees = 9 outputs

The “coverage-enforcing prompt” is the load-bearing change. Before mm3, Titanic's Pass 1 skipped a 15-minute stretch of its third act. The v3 prompt adds a hard rule: every 3-minute span must land in exactly one beat, with a 180s upper bound on any single beat.

Evolution: mm → mm2 → mm3

mm · v1

Chapter-Llama baseline

Interleave ASR with [VISUAL] keyframe captions. Feed to Gemini 2.5 Pro. Return beats.

Titanic: 44 beats vs v1's 29 (+52%).
Avg duration 154s vs 339s.
mm2

Spoiler-aware schema

Same transcript, same model. Adds spoiler_level, narrative_role, reveals per beat.

Unlocks 3-mode selection: safe / teaser / full.
Titanic: 9 / 11 / 15 none / moderate / climactic.
mm3 · current

Coverage enforcement

Same schema, new prompt addendum. 180s hard cap per beat; every 3-min span must land in exactly one beat.

Titanic coverage: 62.7% → 99.3%.
Closes the 15-min third-act gap that v1 and mm skipped.

Pairwise LLM judge — N=5 films, 50 comparisons

Gemini 2.5 Pro judged position-swapped pairs of clip sets for five films: Titanic, Wolf of Wall Street, Breakfast Club, Kill Bill 2, Amistad. Each film contributed 10 comparisons. The judge picked a winner per pair; ties allowed. Numbers below are total wins out of 50.

mm3safe 17
mm4teaser 11
v1 (pilot) 8
mm3full 3
mm4full 2

mm4 = Pass 2 also multimodal; mm4c = mm4 + shot-cut boundary snap. Both add cost without moving the pairwise-win needle against mm3, so mm3 is the current candidate for production. Rendering vertical MP4s for the remaining 32 scored titles and running a human A/B is the next step.

Iconic-moment recall — Titanic, ±45s window

Separate from pairwise judgment, we scored variants against 12 iconic Titanic moments (“Rose on the bow”, “I'll never let go”, etc.). Did the variant land a clip within 45 seconds of the ground-truth moment?

VariantModelRecallNotes
mm4cfullmm3 Pass 1 + Pass 2 multimodal + shot-snap33%Best on iconic, weak on pairwise
mm4fullmm3 Pass 1 + Pass 2 multimodal33%Same as mm4cfull without snap
mm3fullmm3 Pass 1 + ASR-only Pass 225%Full-spoiler mode
mm3safe / mmfullsafe mode / mm baseline17%Good clips, not always canonical
v1 (pilot)Gemini 2.5 Flash ASR-only0%Skipped the third-act window entirely

Takeaway: full-spoiler + multimodal Pass 2 is best for canonical-moment recall, but mm3safe wins the generic “which clip-set is more engaging” judgment. Different tasks, different winners.

Published films — click to see tree diffs

Five films currently have rendered tree_diff pages with click-to-play 480p proxies, showing every variant's Pass-1 and Pass-2 structure side by side. The remaining 32 mm3-scored titles will be rendered as clips ship.

Titanic published Wolf of Wall Street published Breakfast Club published Kill Bill 2 published Amistad published + 32 more mm3 scored, unrendered

Pass-1 direct delivery — new

mm3 produces Pass-1 beats that are 60-600 seconds long — full scene-length cuts, not the 30-90s sub-beats Pass 2 produces. The mm3direct variant ships those directly to the feed viewer as a distinct product surface. Think clip (Pass 2) vs scene (Pass 1).

VariantFilterClips shippedExample URL
mm3directsafespoiler = none16+Open →
mm3directteaserspoiler ≤ moderate53+Open →
mm3directfullall beats76+Open →

Cloud Run Jobs pipeline: each task = one movie, renders all 60-600s beats in parallel (4 vCPU / 16 GiB / gen2 / whisper small.en). v5 smoke rendered 71 clean clips across 3 movies in ~65 min for ~$0.45. Full 35-movie run is in flight; target ~700 clips at ~$15-20 total. The numbers above grow live as the full run completes.

What's next

1. Ship full Pass-1 catalog. v3 worker image deployed; re-executing the 35-task Cloud Run Job brings mm3directfull from 17 clips to ~700+. ~$25-30 at full 4 vCPU / 16 GiB / 1.5h per task.

2. Render vertical MP4s for the 32 unrendered Pass-2 mm3 titles. They have beats and elite picks; they just need the vionlabs_cut.py pass and whisper subs. Feeds them into the scrolling viewer and the compare grid.

3. Human A/B with 10+ raters. LLM judgment established the baseline. A human rater panel on mm3safe vs mm4teaser vs v1 is the promotion gate.

4. Caption-prompt neutrality ablation. Re-score one film with visual captions stripped of adjective-laden language, confirm the win isn't an artifact of caption prose quality.

Compare across all rankers →  ·  How a clip is built →