LTX-2.3 Model Card

This model card focuses on the LTX-2.3 model, which is a significant update to the LTX-2 model with improved audio and visual quality as well as enhanced prompt adherence. LTX-2 was presented in the paper LTX-2: Efficient Joint Audio-Visual Foundation Model.

💻💻 If you want to dive in right to the code - it is available here. 💾💾

LTX-2.3 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.

Model Checkpoints

Name	Notes
ltx-2.3-22b-dev	The full model, flexible and trainable in bf16
ltx-2.3-22b-distilled	The distilled version of the full model, 8 steps, CFG=1
ltx-2.3-22b-distilled-lora-384	A LoRA version of the distilled model applicable to the full model
ltx-2.3-spatial-upscaler-x2-1.0	An x2 spatial upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher resolution
ltx-2.3-spatial-upscaler-x1.5-1.0	An x1.5 spatial upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher resolution
ltx-2.3-temporal-upscaler-x2-1.0	An x2 temporal upscaler for the ltx-2.3 latents, used in multi stage (multiscale) pipelines for higher FPS

Model Details

Developed by: Lightricks
Model type: Diffusion-based audio-video foundation model
Language(s): English

Online demo

LTX-2.3 is accessible right away via the API Playground.

Diffusers 🧨

LTX-2.3 support in the Diffusers Python library is coming soon!

General tips:

Width & height settings must be divisible by 32. Frame count must be divisible by 8 + 1.
In case the resolution or number of frames are not divisible by 32 or 8 + 1, the input should be padded with -1 and then cropped to the desired resolution and number of frames.
For tips on writing effective prompts, please visit our Prompting guide

Limitations

This model is not intended or able to provide factual information.
As a statistical model this checkpoint might amplify existing societal biases.
The model may fail to generate videos that matches the prompts perfectly.
Prompt following is heavily influenced by the prompting-style.
The model may generate content that is inappropriate or offensive.
When generating audio without speech, the audio may be of lower quality.

Train the model

The base (dev) model is fully trainable.

It's extremely easy to reproduce the LoRAs and IC-LoRAs we publish with the model by following the instructions on the LTX-2 Trainer Readme.

Training for motion, style or likeness (sound+appearance) can take less than an hour in many settings.

Citation

@article{hacohen2025ltx2,
  title={LTX-2: Efficient Joint Audio-Visual Foundation Model},
  author={HaCohen, Yoav and Brazowski, Benny and Chiprut, Nisan and Bitterman, Yaki and Kvochko, Andrew and Berkowitz, Avishai and Shalem, Daniel and Lifschitz, Daphna and Moshe, Dudu and Porat, Eitan and Richardson, Eitan and Guy Shiran and Itay Chachy and Jonathan Chetboun and Michael Finkelson and Michael Kupchick and Nir Zabari and Nitzan Guetta and Noa Kotler and Ofir Bibi and Ori Gordon and Poriya Panet and Roi Benita and Shahar Armon and Victor Kulikov and Yaron Inger and Yonatan Shiftan and Zeev Melumian and Zeev Farbman},
  journal={arXiv preprint arXiv:2601.03233},
  year={2025}
}

Downloads last month: 971

GGUF

Model size

21B params

Architecture

ltxv

Hardware compatibility

3-bit

4-bit

5-bit

6-bit

8-bit

Paper for vantagewithai/LTX-2.3-GGUF

LTX-2: Efficient Joint Audio-Visual Foundation Model

Paper • 2601.03233 • Published Jan 6 • 156