Jiehui Huang1,† ·
Yuechen Zhang2 ·
Bin Xia2 ·
Jiahao Wang3 ·
Xu He4 ·
Zhenchao Tang5 ·
Meng Chu1 ·
Xin Tao3 ·
Pengfei Wan3 ·
Jiaya Jia1,✉
1HKUST · 2CUHK · 3Kling Team, Kuaishou Technology · 4Tsinghua University · 5Sun Yat-sen University
†Work done during internship at the Kling Team · ✉Corresponding Author
- [2026.06] 🎉 Project page, arXiv paper, and the 200-case benchmark — 🤗 KlingTeam/UnityShotsBench — are live.
🔊 Click the preview above to play the intro with audio (43s) · full gallery on the Project Page
UnityShots turns a single-shot audio-video diffusion model (LTX-2.3 22B) into a coherent
multi-shot storyteller. From one structured prompt it generates a k-shot sequence (3–9 shots) as a single continuous .mp4 in which:
- 🎭 Identity persists — same face, wardrobe and body across cuts
- 🌍 World persists — scene, lighting and props stay consistent shot to shot
- 🔊 Audio is generated and synchronized — lip-synced speech + scene-aware ambient and score
- 🎬 Cuts are controllable — a learned cut-type prior becomes an inference-time knob
A single set of weights serves three inference modes — Text-to-Video (T2V), Image-to-Video (I2V), and Reference-to-Video (R2V) — via a per-shot mixed-mode Shots-Forcing training recipe.
Two fixed-size memory slots per modality — a Long-Term Memory (LTM) anchored to the opening shot and a Short-Term Memory (STM) holding the immediately preceding tail — are fused at every cut by a Boundary-Aware Gate conditioned on visual cut probability and beat signals. The audio stream injects a reference speaker token at every shot to preserve vocal timbre without a growing audio bank. Memory stays constant-size, so generation scales to long, many-shot stories.
High-resolution 1376×768 multi-shot generations. Each strip below shows one frame per shot of a single generated story — notice how the character and world stay consistent across every cut.
🔊 Full videos with audio are on the Project Page.
Reference identities are public-domain or AI-generated and are shown for academic, non-commercial demonstration only.
We release UnityShotsBench on the Kling Team Hugging Face org — a 200-case multilingual, multi-cultural multi-shot storytelling benchmark spanning 13 languages and 6 cultural regions, with reference identity images, reference voice clips, and per-shot scripts for all three conditioning modes.
UnityShots reaches performance comparable to strong open-source and closed-source baselines including LTX-2, Ovi, MovA, IDLora and DreamID-Omni across cross-shot identity, audio coherence and visual aesthetics. Full numbers and qualitative comparisons are on the project page and in the paper.
We are gradually opening up the full UnityShots stack — please stay tuned:
- 🧠 Model checkpoints — T2V / I2V / R2V weights
- 🛠️ Training code & recipes
- 🤖 Agent system — turns a free-form idea into a structured multi-shot prompt for UnityShots
⭐ If you find this work interesting, please star the repo — it helps us prioritise the open-source release and reach more people. Thanks for your patience! 🙏
Released under CC BY-NC 4.0 (academic, non-commercial research only). Reference identities and generated media are provided for research and demonstration only.
If you find this work useful for your research, please cite:
@article{huang2026unityshots,
title = {UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating},
author = {Huang, Jiehui and Zhang, Yuechen and Xia, Bin and Wang, Jiahao and
He, Xu and Tang, Zhenchao and Chu, Meng and Tao, Xin and Wan, Pengfei and Jia, Jiaya},
journal = {arXiv preprint arXiv:2606.21661},
year = {2026}
}





