Skip to content

JIA-Lab-research/UnityShots

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

UnityShots Logo

UnityShots : Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

arXiv Project Page License Dataset Stars

Jiehui Huang1,† · Yuechen Zhang2 · Bin Xia2 · Jiahao Wang3 · Xu He4 · Zhenchao Tang5 ·
Meng Chu1 · Xin Tao3 · Pengfei Wan3 · Jiaya Jia1,✉

1HKUST · 2CUHK · 3Kling Team, Kuaishou Technology · 4Tsinghua University · 5Sun Yat-sen University

Work done during internship at the Kling Team  ·  Corresponding Author


📢 Checkpoints, training code & the agent system will be released soon — please stay tuned! ⭐


📢 News


📖 Introduction

▶ Click the preview to play the intro with audio

🔊 Click the preview above to play the intro with audio (43s)  ·  full gallery on the Project Page

UnityShots turns a single-shot audio-video diffusion model (LTX-2.3 22B) into a coherent multi-shot storyteller. From one structured prompt it generates a k-shot sequence (3–9 shots) as a single continuous .mp4 in which:

  • 🎭 Identity persists — same face, wardrobe and body across cuts
  • 🌍 World persists — scene, lighting and props stay consistent shot to shot
  • 🔊 Audio is generated and synchronized — lip-synced speech + scene-aware ambient and score
  • 🎬 Cuts are controllable — a learned cut-type prior becomes an inference-time knob

A single set of weights serves three inference modes — Text-to-Video (T2V), Image-to-Video (I2V), and Reference-to-Video (R2V) — via a per-shot mixed-mode Shots-Forcing training recipe.


🎯 Method

UnityShots architecture

Two fixed-size memory slots per modality — a Long-Term Memory (LTM) anchored to the opening shot and a Short-Term Memory (STM) holding the immediately preceding tail — are fused at every cut by a Boundary-Aware Gate conditioned on visual cut probability and beat signals. The audio stream injects a reference speaker token at every shot to preserve vocal timbre without a growing audio bank. Memory stays constant-size, so generation scales to long, many-shot stories.


📊 Results Gallery

High-resolution 1376×768 multi-shot generations. Each strip below shows one frame per shot of a single generated story — notice how the character and world stay consistent across every cut.

🔊 Full videos with audio are on the Project Page.

Reference identities are public-domain or AI-generated and are shown for academic, non-commercial demonstration only.


🤗 Benchmark — released by the Kling Team

We release UnityShotsBench on the Kling Team Hugging Face org — a 200-case multilingual, multi-cultural multi-shot storytelling benchmark spanning 13 languages and 6 cultural regions, with reference identity images, reference voice clips, and per-shot scripts for all three conditioning modes.

Dataset

UnityShots reaches performance comparable to strong open-source and closed-source baselines including LTX-2, Ovi, MovA, IDLora and DreamID-Omni across cross-shot identity, audio coherence and visual aesthetics. Full numbers and qualitative comparisons are on the project page and in the paper.


🗓️ Roadmap

We are gradually opening up the full UnityShots stack — please stay tuned:

  • 🧠 Model checkpoints — T2V / I2V / R2V weights
  • 🛠️ Training code & recipes
  • 🤖 Agent system — turns a free-form idea into a structured multi-shot prompt for UnityShots

If you find this work interesting, please star the repo — it helps us prioritise the open-source release and reach more people. Thanks for your patience! 🙏


⚖️ License

Released under CC BY-NC 4.0 (academic, non-commercial research only). Reference identities and generated media are provided for research and demonstration only.


📚 Citation

If you find this work useful for your research, please cite:

@article{huang2026unityshots,
  title   = {UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating},
  author  = {Huang, Jiehui and Zhang, Yuechen and Xia, Bin and Wang, Jiahao and
             He, Xu and Tang, Zhenchao and Chu, Meng and Tao, Xin and Wan, Pengfei and Jia, Jiaya},
  journal = {arXiv preprint arXiv:2606.21661},
  year    = {2026}
}

🚀 Stay Tuned for Updates!

Watch / Star this repo to get notified when we release the checkpoints, training code, and agent system.

Fine-tuned from LTX-2.3 22B · Work completed during an internship at the Kling Team, Kuaishou Technology.

About

This project is the official implementation of "UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors