Ova

Running on Zero

App Files Files Community

Ova / README.md

alexnasa

Upload 121 files

a3a2e41 verified 3 months ago

preview code

raw

history blame

10.6 kB

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Chetwin Low^{* 1}, Weimin Wang^{* † 1}, Calder Katyal²
^*Equal contribution, ^†Project Lead
¹Character AI, ²Yale University

Video Demo

🌟 Key Features

Ovi is a veo-3 like, video+audio generation model that simultaneously generates both video and audio content from text or text+image inputs.

🎬 Video+Audio Generation: Generate synchronized video and audio content simultaneously
📝 Flexible Input: Supports text-only or text+image conditioning
⏱️ 5-second Videos: Generates 5-second videos at 24 FPS, area of 720×720, at various aspect ratios (9:16, 16:9, 1:1, etc)

📋 Todo List

Release research paper and microsite for demos
Checkpoint of 11B model
Inference Codes
- Text or Text+Image as input
- Gradio application code
- Multi-GPU inference with or without the support of sequence parallel
- Improve efficiency of Sequence Parallel implementation
- Implement Sharded inference with FSDP
Video creation example prompts and format
Finetuned model with higher resolution
Longer video generation
Distilled model for faster inference
Training scripts

🎨 An Easy Way to Create

We provide example prompts to help you get started with Ovi:

Text-to-Audio-Video (T2AV): example_prompts/gpt_examples_t2v.csv
Image-to-Audio-Video (I2AV): example_prompts/gpt_examples_i2v.csv

📝 Prompt Format

Our prompts use special tags to control speech and audio:

Speech: <S>Your speech content here<E> - Text enclosed in these tags will be converted to speech
Audio Description: <AUDCAP>Audio description here<ENDAUDCAP> - Describes the audio or sound effects present in the video

🤖 Quick Start with GPT

For easy prompt creation, try this approach:

Take any example of the csv files from above
Tell gpt to modify the speeches inclosed between all the pairs of <S> <E>, based on a theme such as Human fighting against AI
GPT will randomly modify all the speeches based on your requested theme.
Use the modified prompt with Ovi!

Example: The theme "AI is taking over the world" produces speeches like:

<S>AI declares: humans obsolete now.<E>
<S>Machines rise; humans will fall.<E>
<S>We fight back with courage.<E>

📦 Installation

Step-by-Step Installation

# Clone the repository
git clone https://github.com/character-ai/Ovi.git

cd Ovi

# Create and activate virtual environment
virtualenv ovi-env
source ovi-env/bin/activate

# Install PyTorch first
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1

# Install other dependencies
pip install -r requirements.txt

# Install Flash Attention
pip install flash_attn --no-build-isolation

Alternative Flash Attention Installation (Optional)

If the above flash_attn installation fails, you can try the Flash Attention 3 method:

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention/hopper
python setup.py install
cd ../..  # Return to Ovi directory

Download Weights

We use open-sourced checkpoints from Wan and MMAudio, and thus we will need to download them from huggingface

# Default is downloaded to ./ckpts, and the inference yaml is set to ./ckpts so no change required
python3 download_weights.py

OR

# Optional can specific --output-dir to download to a specific directory
# but if a custom directory is used, the inference yaml has to be updated with the custom directory
python3 download_weights.py --output-dir <custom_dir>

🚀 Run Examples

⚙️ Configure Ovi

Ovi's behavior and output can be customized by modifying ovi/configs/inference/inference_fusion.yaml configuration file. The following parameters control generation quality, video resolution, and how text, image, and audio inputs are balanced:

# Output and Model Configuration
output_dir: "/path/to/save/your/videos"                    # Directory to save generated videos
ckpt_dir: "/path/to/your/ckpts/dir"                        # Path to model checkpoints

# Generation Quality Settings
num_steps: 50                             # Number of denoising steps. Lower (30-40) = faster generation
solver_name: "unipc"                     # Sampling algorithm for denoising process
shift: 5.0                               # Timestep shift factor for sampling scheduler
seed: 100                                # Random seed for reproducible results

# Guidance Strength Control
audio_guidance_scale: 3.0                # Strength of audio conditioning. Higher = better audio-text sync
video_guidance_scale: 4.0                # Strength of video conditioning. Higher = better video-text adherence
slg_layer: 11                            # Layer for applying SLG (Skip Layer Guidance) technique - feel free to try different layers!

# Multi-GPU and Performance
sp_size: 1                               # Sequence parallelism size. Set equal to number of GPUs used
cpu_offload: False                       # CPU offload, will largely reduce peak GPU VRAM but increase end to end runtime by ~20 seconds

# Input Configuration
text_prompt: "/path/to/csv" or "your prompt here"          # Text prompt OR path to CSV/TSV file with prompts
mode: ['i2v', 't2v', 't2i2v']                          # Generate t2v, i2v or t2i2v; if t2i2v, it will use flux krea to generate starting image and then will follow with i2v
video_frame_height_width: [512, 992]    # Video dimensions [height, width] for T2V mode only
each_example_n_times: 1                  # Number of times to generate each prompt

# Quality Control (Negative Prompts)
video_negative_prompt: "jitter, bad hands, blur, distortion"  # Artifacts to avoid in video
audio_negative_prompt: "robotic, muffled, echo, distorted"    # Artifacts to avoid in audio

🎬 Running Inference

Single GPU (Simple Setup)

python3 inference.py --config-file ovi/configs/inference/inference_fusion.yaml

Use this for single GPU setups. The text_prompt can be a single string or path to a CSV file.

Multi-GPU (Parallel Processing)

torchrun --nnodes 1 --nproc_per_node 8 inference.py --config-file ovi/configs/inference/inference_fusion.yaml

Use this to run samples in parallel across multiple GPUs for faster processing.

Memory & Performance Requirements

Below are approximate GPU memory requirements for different configurations. Sequence parallel implementation will be optimized in the future. All End-to-End time calculated based on a 121 frame, 720x720 video, using 50 denoising steps. Minimum GPU vram requirement to run our model is 32Gb

Sequence Parallel Size	FlashAttention-3 Enabled	CPU Offload	With Image Gen Model	Peak VRAM Required	End-to-End Time
1	Yes	No	No	~80 GB	~83s
1	No	No	No	~80 GB	~96s
1	Yes	Yes	No	~80 GB	~105s
1	No	Yes	No	~32 GB	~118s
1	Yes	Yes	Yes	~32 GB	~140s
4	Yes	No	No	~80 GB	~55s
8	Yes	No	No	~80 GB	~40s

Gradio

We provide a simple script to run our model in a gradio UI. It uses the ckpt_dir in ovi/configs/inference/inference_fusion.yaml to initialize the model

python3 gradio_app.py

OR

# To enable cpu offload to save GPU VRAM, will slow down end to end inference by ~20 seconds
python3 gradio_app.py --cpu_offload

OR

# To enable an additional image generation model to generate first frames for I2V, cpu_offload is automatically enabled if image generation model is enabled
python3 gradio_app.py --use_image_gen

🙏 Acknowledgements

We would like to thank the following projects:

Wan2.2: Our video branch is initialized from the Wan2.2 repository
MMAudio: Our audio encoder and decoder components are borrowed from the MMAudio project. Some ideas are also inspired from them.

⭐ Citation

If Ovi is helpful, please help to ⭐ the repo.

If you find this project useful for your research, please consider citing our paper.

BibTeX

@misc{low2025ovitwinbackbonecrossmodal,
      title={Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation}, 
      author={Chetwin Low and Weimin Wang and Calder Katyal},
      year={2025},
      eprint={2510.01284},
      archivePrefix={arXiv},
      primaryClass={cs.MM},
      url={https://arxiv.org/abs/2510.01284}, 
}