How Voyager works

A long-form film becomes 20 vertical clips in five steps.

No Rendercam. No YOLO. No EMA smoothing. The pipeline mirrors Periscope's in-browser crop algorithm frame-for-frame, so the demo renders are the exact same crop your users would see once the production encoder path ships.

Five steps, start to finish

Fetch pan-and-scan coordinates

Periscope · /api/vionlabs_timeline/vertical_video/<video_id>/

VionLabs analyzes the source film and produces per-frame x_center values describing where the crop window should sit horizontally. We cache each title's response as coords/<slug>_periscope.json — an array of {ts, x_center} samples plus source_frame_width (typically 416 px, the analysis-frame width).

Gotcha: VionLabs' has_vertical_video=true flag does not mean a vertical MP4 exists — it only means the coord JSON is available. Rendering is on us.

Download the HD source master

gs://cbs_ent/cbs_ent/movies/<id>/<name>_hdpmezz_*.mp4

Source MP4 URIs live in src_url_60.json. batch_cut_60.sh pulls each source once via gcloud storage cp, cuts every variant (v1 / v5 / v6 / g3) from the same local copy, then deletes the source to reclaim ~100 GB of local disk per catalog pass.

Gotcha: VionLabs proxies are often 360p. When we need 1080p, we reach past the proxy to the cbs_ent mezzanine master.

Cut vertical — vionlabs_cut.py

ffmpeg + numpy, fully deterministic, CPU-only

Two piped ffmpeg processes with numpy in the middle: a producer decodes the clip range to raw BGR24, Python computes each frame's media time and interpolates x_center from the coord timeline, a centered window of width height × 9 ÷ 16 is sliced and clamped, and a consumer re-encodes with a fixed filter chain:

hqdn3d=2:1.5:4:3                                 # gentle denoise
scale=1080:1920:flags=lanczos+full_chroma+accurate_rnd
unsharp=5:5:0.6:5:5:0.0                         # luma-only sharpen
libx264 preset=medium crf=20                    # ~4s encode per clip

Audio is muxed from the source AAC@192k with +faststart. About 22 seconds for a 5-clip set. Roughly 200× faster than a Rendercam full-movie render.

Word-level captions — whisper_subs.py

faster-whisper medium.en · word_timestamps=True · VAD

Runs locally on Apple Silicon CPU via faster-whisper with compute_type=int8 and beam_size=1. VAD filter with 400 ms minimum silence. Output is a sibling <clip>.subs.json with per-segment, per-word timings in clip-local seconds:

{
  "segments": [{
    "start": 0.0, "end": 2.84,
    "words": [{"w":"I", "s":0.0, "e":0.12}, ...]
  }, ...]
}

Ships ~5–10× realtime on an M2. The feed viewer uses requestVideoFrameCallback to highlight the active word frame-accurately.

Upload to GCS — Paramount-domain gated

gs://kevin-shortform-demo/clips/

Clips and subs upload to a public-read bucket with roles/storage.objectViewer: domain:cbsinteractive.com — authenticated Paramount users get 200, external users get 403. CORS allows GitHub Pages, localhost, and trycloudflare origins, so feed.html fetches MP4s and word-timing JSON directly from the browser with no server.

Why not signed URLs: signed URLs expire every 12 hours. The domain-gated bucket gives stable, non-expiring links for internal demo sharing.

Properties of the pipeline

Inputs	GCS source MP4 (HD master) + VionLabs coord JSON (per-frame x_center)
Outputs	1080×1920 H.264 / AAC MP4 + `.subs.json` word timings
Hardware	CPU only. No GPU path. Runs on Kevin's MacBook.
Determinism	Cut is fully deterministic (same inputs → identical MP4). Whisper is near-deterministic (greedy decode).
Throughput	~4 seconds per clip on M2. A 5-clip set finishes in ~22 seconds.
Filter chain	`hqdn3d → scale=lanczos → unsharp → libx264 crf=20`
Periscope fidelity	Mirrors the canvas-crop algorithm 1:1. Same x_center interpolation, same clamp behavior.

What's hosted where

The clip bucket is the single source of truth.

Every Voyager page is served from gs://kevin-shortform-demo/. MP4s, .subs.json word timings, narrative tree JSONs, feed_data_*.js per-variant manifests, and every HTML page live at the root or under clips/. There's no intermediate server, no CDN layer, no signed-URL refresh ritual. The browser talks directly to GCS.

What's deliberately not here

Rendercam. Full-movie vertical render was too slow for batch (2–6 h per film). Replaced entirely for this demo.
SAS Push API. ~40 min per clip × 30 clips = 20 hours. Demo uses the local pipeline instead.
HLS / adaptive bitrate. Everything is MP4 progressive. HLS is a post-demo optimization.
Signed URLs. Domain-gated bucket sidesteps the 12-hour expiry.
Any backend service. Static deploy only. A hard constraint.

See what mm3 changes → · Compare all ranker variants →