UnityShots : Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

Jiehui Huang^1,† · Yuechen Zhang² · Bin Xia² · Jiahao Wang³ · Xu He⁴ · Zhenchao Tang⁵ ·
Meng Chu¹ · Xin Tao³ · Pengfei Wan³ · Jiaya Jia^1,✉

¹HKUST · ²CUHK · ³Kling Team, Kuaishou Technology · ⁴Tsinghua University · ⁵Sun Yat-sen University

^†Work done during internship at the Kling Team · ^✉Corresponding Author

📢 Checkpoints, training code & the agent system will be released soon — please stay tuned! ⭐

📢 News

[2026.06] 🎉 Project page, arXiv paper, and the 200-case benchmark — 🤗 KlingTeam/UnityShotsBench — are live.

📖 Introduction

▶ Click the preview to play the intro with audio

🔊 Click the preview above to play the intro with audio (43s) · full gallery on the Project Page

UnityShots turns a single-shot audio-video diffusion model (LTX-2.3 22B) into a coherent multi-shot storyteller. From one structured prompt it generates a k-shot sequence (3–9 shots) as a single continuous .mp4 in which:

🎭 Identity persists — same face, wardrobe and body across cuts
🌍 World persists — scene, lighting and props stay consistent shot to shot
🔊 Audio is generated and synchronized — lip-synced speech + scene-aware ambient and score
🎬 Cuts are controllable — a learned cut-type prior becomes an inference-time knob

A single set of weights serves three inference modes — Text-to-Video (T2V), Image-to-Video (I2V), and Reference-to-Video (R2V) — via a per-shot mixed-mode Shots-Forcing training recipe.

🎯 Method

Two fixed-size memory slots per modality — a Long-Term Memory (LTM) anchored to the opening shot and a Short-Term Memory (STM) holding the immediately preceding tail — are fused at every cut by a Boundary-Aware Gate conditioned on visual cut probability and beat signals. The audio stream injects a reference speaker token at every shot to preserve vocal timbre without a growing audio bank. Memory stays constant-size, so generation scales to long, many-shot stories.

📊 Results Gallery

High-resolution 1376×768 multi-shot generations. Each strip below shows one frame per shot of a single generated story — notice how the character and world stay consistent across every cut.

🔊 Full videos with audio are on the Project Page.

Reference identities are public-domain or AI-generated and are shown for academic, non-commercial demonstration only.

🤗 Benchmark — released by the Kling Team

We release UnityShotsBench on the Kling Team Hugging Face org — a 200-case multilingual, multi-cultural multi-shot storytelling benchmark spanning 13 languages and 6 cultural regions, with reference identity images, reference voice clips, and per-shot scripts for all three conditioning modes.

UnityShots reaches performance comparable to strong open-source and closed-source baselines including LTX-2, Ovi, MovA, IDLora and DreamID-Omni across cross-shot identity, audio coherence and visual aesthetics. Full numbers and qualitative comparisons are on the project page and in the paper.

🗓️ Roadmap

We are gradually opening up the full UnityShots stack — please stay tuned:

🧠 Model checkpoints — T2V / I2V / R2V weights
🛠️ Training code & recipes
🤖 Agent system — turns a free-form idea into a structured multi-shot prompt for UnityShots

⭐ If you find this work interesting, please star the repo — it helps us prioritise the open-source release and reach more people. Thanks for your patience! 🙏

⚖️ License

Released under CC BY-NC 4.0 (academic, non-commercial research only). Reference identities and generated media are provided for research and demonstration only.

📚 Citation

If you find this work useful for your research, please cite:

@article{huang2026unityshots,
  title   = {UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating},
  author  = {Huang, Jiehui and Zhang, Yuechen and Xia, Bin and Wang, Jiahao and
             He, Xu and Tang, Zhenchao and Chu, Meng and Tao, Xin and Wan, Pengfei and Jia, Jiaya},
  journal = {arXiv preprint arXiv:2606.21661},
  year    = {2026}
}

🚀 Stay Tuned for Updates!

Watch / Star this repo to get notified when we release the checkpoints, training code, and agent system.

_{Fine-tuned from LTX-2.3 22B · Work completed during an internship at the Kling Team, Kuaishou Technology.}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UnityShots : Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

📢 Checkpoints, training code & the agent system will be released soon — please stay tuned! ⭐

📢 News

📖 Introduction

🎯 Method

📊 Results Gallery

🤗 Benchmark — released by the Kling Team

🗓️ Roadmap

⚖️ License

📚 Citation

🚀 Stay Tuned for Updates!

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

UnityShots : Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

📢 Checkpoints, training code & the agent system will be released soon — please stay tuned! ⭐

📢 News

📖 Introduction

🎯 Method

📊 Results Gallery

🤗 Benchmark — released by the Kling Team

🗓️ Roadmap

⚖️ License

📚 Citation

🚀 Stay Tuned for Updates!

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages