All 7 variants pick 5 clips from the same 11 movies — they are different rankers selecting what a viewer would find most scroll-stopping as a TikTok-style feed. v1 is the anchor: it's current production, Pass 2 sub-beat generation with Gemini 2.5 Flash, which produces ~80 LLM-scored sub-beats per film then reselects the top 5 by Freytag intensity. Every other variant changes exactly one thing and asks a specific question about what drives quality.
The headline quality experiment is the 4 Pass-2 variants (v1 / g3 / v5 / v6), which differ only in which LLM they use for sub-beat generation. Visible by default. The 3 merge/trim variants (v2 / v2b / v2c) were cost-motivated — asking "can we approximate v1 at $0?" — and are superseded under the quality-first directive. Shown only via the show 7 · all toggle for reference.
Read each column's hypothesis (Q:) and decision (if wins) boxes — those tell you what clicking "prefer" on that variant actually means. Two columns winning on different genres (e.g. v5 dominant on action, v6 on drama) is itself a valid outcome and suggests genre-based tiering.