We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Automatic Speech Recognition Embeddings Reranker Text Generation Text To Image Text To Music Text To Speech Text To Video World Model Zero Shot Image Classification

Docs

Pricing

FastVideo/

LTX-2.3-Distilled-Diffusers

$0.0350

/ second

A fast, step-distilled build of Lightricks' LTX-2.3 diffusion-transformer video model (distilled by FastVideo). Generates high-fidelity text-to-video and image-to-video in just a few denoising steps.

Public

Project Paper License

👁 FastVideo/LTX-2.3-Distilled-Diffusers cover image

api versions

Input

Prompt

text prompt describing the video content

Negative Prompt

Negative text prompt (optional, not required); leave blank to fall back to the model's default negative.. (Default: uncanny face, mask-like, plastic skin, doll-like, waxy, mannequin, cgi, 3d render, deformed face, distorted face, extra fingers, deformed hands, blurry, washed out, vintage, 1970s, sepia, grainy, low quality)

Seconds

Clip duration: always 5 seconds (fixed/required for this model).

Resolution

Output resolution: always 1080p (fixed/required for this model).

Orientation

Output orientation: always landscape (fixed/required for this model).

Please upload an image file

You need to log in to use this model

Settings

Seed

specify a seed for reproducible output (Default: empty)

Output

Model Information

LTX-2.3 Distilled

LTX-2.3 is a diffusion-transformer (DiT) audio-video foundation model from Lightricks that generates high-fidelity video with synchronized audio from text or a starting image. This endpoint serves the distilled variant, accelerated with FastVideo (Hao AI Lab, UCSD) to produce results in only a few denoising steps.

Capabilities

Text-to-video — generate a clip from a text prompt.
Image-to-video — animate a still image by passing image_url (an http(s) URL or a data: URI).
Synchronized audio — produces a matching audio track (speech, ambient sound, music) alongside the video.
High-fidelity output at up to 1080p.

Usage

Provide a descriptive prompt. For image-to-video, also pass image_url. Use negative_prompt to steer away from unwanted artifacts and seed for reproducible results. Detailed, concrete prompts — subject, action, setting, lighting, camera motion, and any sound or dialogue — produce the strongest results; for image-to-video, describe the motion you want applied to the supplied image.

Limitations

Not intended for generating factual or accurate real-world information.
May reflect or amplify societal biases present in its training data.
Prompt adherence can vary with phrasing and style.
Audio quality is lower for non-speech sounds than for speech.
Can produce unexpected or inappropriate content from some prompts.

Model & credits

Base model: LTX-2.3 by Lightricks — model page · docs · LTX-2 repo
Distillation & inference: FastVideo (Hao AI Lab, UCSD) — GitHub
Paper: LTX-2: Efficient Joint Audio-Visual Foundation Model
License: LTX-2 Community License Agreement

That's the readme done. The full set is now ready to paste:

Description: A fast, step-distilled build of Lightricks' LTX-2.3 audio-video diffusion-transformer model (distilled by FastVideo). Generates high-fidelity video with synchronized audio from text or an image, in just a few denoising steps.
Project link: https://github.com/hao-ai-lab/FastVideo
Paper link: https://arxiv.org/abs/2601.03233
License link: https://huggingface.co/Lightricks/LTX-2.3/blob/main/LICENSE

👁 Footer Logo

👁 SOC 2 Certified
👁 ISO 27001 Certified

Have questions or need a custom solution?

Company