VOOZH about

URL: https://huggingface.co/jdopensource/JoyAI-Echo

โ‡ฑ jdopensource/JoyAI-Echo ยท Hugging Face


๐Ÿ‘ JoyAI-Echo generated video gallery

JoyAI-Echo

๐ŸŽฌ Pushing the Frontier of Long Video Generation

Official model weights for minute-level multi-shot audio-video generation with a distilled DMD generator, paired cross-modal memory, and story-level consistency.

For academic research and non-commercial use only.

๐Ÿ“„ Paper | ๐ŸŒ Project Page | ๐Ÿ’ป Inference Code | ๐Ÿงฌ Model | ๐Ÿš€ Usage | ๐Ÿ“Š Results | ๐Ÿ“ Citation

๐Ÿ‘ Text-to-Video
๐Ÿ‘ Audio + Video
๐Ÿ‘ 5 minute long video
๐Ÿ‘ Model Weights

Model Summary

JoyAI-Echo is a long-form, multi-shot, audio-video generation framework that breaks the barriers of error accumulation, weak temporal coherence, and prohibitive latency in long video generation. A cross-modal audio-visual memory bank preserves character appearance and voice timbre consistently over five-minute videos, while a post-training pipeline combining memory-based reinforcement learning with distribution matching distillation (DMD) delivers a 7.5ร— inference speedup without sacrificing quality.

JoyAI-Echo decisively outperforms HappyOyster (directing mode) on long-form generation and even surpasses the short-video specialist Wan 2.6 on human-centric tasks.

This repository hosts the released checkpoint. Inference code is released separately โ€” see the Usage section.

Model Details

  • Developed by: Echo Team @ Joy Future Academy, JD
  • Model type: Text-to-(Audio+Video) diffusion transformer, DMD 8-step
  • Modality: Text โ†’ synchronized video + audio
  • Backbone: Built on top of LTX-Video
  • Text encoder: google/gemma-3-12b-it (downloaded separately)
  • Resolution / length (by default): 1280 ร— 736, 241 frames @ 25 fps per shot
  • Max story length: up to 5 minutes (multi-shot)
  • License: LTX-2 Community License Agreement

Highlights

  • ๐ŸŽž๏ธ Minute-level multi-shot stories: generate a sequence of coherent shots from one prompt JSON.
  • โšก DMD-distilled few-step inference: ~7.5ร— faster than the original pipeline.
  • ๐Ÿ”Š Joint audio-video generation: one pipeline produces synchronized video and audio.
  • ๐Ÿง  Paired cross-modal memory bank: conditions each new shot on prior visual identity and voice context for story-level consistency.

Demo Gallery

Explore long-form and short-form JoyAI-Echo cases on the Project Page. ๐Ÿฟ

Usage

Inference is run with the standalone JoyAI-Echo inference repository.

1. Download the checkpoint

huggingface-cli download jdopensource/JoyAI-Echo \
 --local-dir checkpoints

Also download the Gemma text encoder:

huggingface-cli download google/gemma-3-12b-it \
 --local-dir checkpoints/gemma-3-12b

Expected layout:

checkpoints/
โ”œโ”€โ”€ echo-longvideo-release.safetensors
โ””โ”€โ”€ gemma-3-12b/

2. Get the inference code

git clone https://github.com/jd-opensource/JoyAI-Echo.git
cd JoyAI-Echo

Environment: Python 3.11 + PyTorch 2.8 + CUDA 12.8 (see the inference repo's environment.yml / requirements.txt).

3. Write a story prompt

Enhance your prompt first. We provide prompt enhancers โ€” system prompts that expand a short story or idea into well-formed shot prompts: prompts/long_story_writer_system_prompt.md for long, multi-shot video, and prompts/short_story_writer_system_prompt.md for single-shot short video. We strongly recommend running your input through the matching enhancer before inference; un-enhanced prompts tend to produce noticeably weaker results.

Create a JSON file under prompts/. Each file is a single object with a prompts list, where every string is one complete shot. A single string produces one shot; multiple strings produce a multi-shot story, with each new shot conditioned on the previous ones through the paired audio-video memory bank.

Inside each string, write these parts in order:

Part What to describe
Roles & Subjects Describe the appearance of all visible people, including age, build, hair, face, wardrobe, and speaking voice timbre when applicable.
Action & Dialogue What the subject does and speaks.
Style The overall visual and emotional aesthetic โ€” e.g. realistic motorsport film language, cool daylight, restrained cinematic tension.
Camera Movement The shot type and framing or movement โ€” e.g. a stable close-up on the face, or a medium shot from the waist up.
Background The setting and scene details behind the subject.
Sound Effects & BGM The sounds in the scene and the background music โ€” e.g. room tone, wind, footsteps and fabric, with a soft low music bed under the dialogue or no background music.

4. Run

python inference.py

Outputs land in inference_result/outputs/<prompt-name>/inference_<timestamp>/.

Hardware

Peak GPU memory is ~46โ€“50 GB at the default 1280 ร— 736 ร— 241 frame setting โ€” a single H100/A100 (80 GB) or 48 GB GPU is sufficient. For smaller GPUs, lower resolution or frame count:

python inference.py --num-frames 121 --video-height 480 --video-width 832

Results

Reported Scale

Item Value
๐ŸŽฌ Long-form coherent story length 5 min
โšก Generation speedup over the original multi-step pipeline 7.5ร—
๐Ÿ“š Benchmark stories 100
๐ŸŽž๏ธ Generated evaluation shots 3,000
๐Ÿ•’ Frames per shot 241 @ 25 fps

Human Evaluation

GSB user study on long- and short-video generation. The numbers denote the percentage of user preferences.

Aspect (Long Video) JoyAI-Echo Tie HappyOyster (Directing)
Visual aesthetics 63.6% 8.8% 27.6%
Audio quality 81.7% 6.5% 11.8%
Prompt following 80.6% 13.5% 5.9%
IP consistency 59.4% 12.9% 27.7%
Aspect (Short Video) JoyAI-Echo Tie Wan 2.6
Visual aesthetics 58.8% 14.7% 26.5%
Audio quality 32.3% 30.9% 36.8%
Prompt following 33.8% 36.8% 29.4%

Links

Acknowledgements

We gratefully acknowledge the open-source projects this work builds upon โ€” in particular LTX2.3 for the base video generator and Gemma for the text encoder. Thanks to the broader research community whose contributions made this release possible.

Citation

If JoyAI-Echo helps your research or products, please cite:

@techreport{echo2026JoyEcho,
 title = {JoyAI-Echo: Pushing the Frontier of Long Video Generation},
 author = {{Echo Team @ Joy Future Academy, JD}},
 institution = {Joy Future Academy, JD},
 year = {2026},
 month = {May}
}

License

This project is based on LTX-2 by Lightricks Ltd.

Portions of the original LTX-2 codebase have been modified by JD.com for academic and research purposes only. This project is not intended for commercial use. For commercial use of LTX-2 or its derivatives, please contact Lightricks Ltd.

All original copyright, license, patent, trademark, and attribution notices from LTX-2 are retained. This project remains subject to the LTX-2 Community License Agreement.

Downloads last month
10,468

Model tree for jdopensource/JoyAI-Echo

Quantizations
1 model

Space using jdopensource/JoyAI-Echo 1