krea-realtime-video / README.md
erwann's picture
Diffusers updates (#6)
6b5d204 verified
---
license: apache-2.0
base_model:
- Wan-AI/Wan2.1-T2V-14B
pipeline_tag: text-to-video
tags:
- diffusion-single-file
- text-to-video
- video-to-video
- realtime
library_name: diffusers
---
Krea Realtime 14B is distilled from the [Wan 2.1 14B text-to-video model](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) using Self-Forcing, a technique for converting regular video diffusion models into autoregressive models. It achieves a text-to-video inference speed of **11fps** using 4 inference steps on a single NVIDIA B200 GPU. For more details on our training methodology and sampling innovations, refer to our [technical blog post](https://www.krea.ai/blog/krea-realtime-14b).
Inference code can be found [here](https://github.com/krea-ai/realtime-video).
<video width="100%" controls>
<source src="https://cdn-uploads.huggingface.co/production/uploads/62a2712903bf94c3ac3ae004/NI1qn109PHVeO_LvBQIr8.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
- Our model is over **10x larger than existing realtime video models**
- We introduce **novel techniques for mitigating error accumulation,** including **KV Cache Recomputation** and **KV Cache Attention Bias**
- We develop **memory optimizations specific to autoregressive video diffusion models** that facilitate training large autoregressive models
- **Our model enables realtime interactive capabilities**: Users can modify prompts mid-generation, restyle videos on-the-fly, and see first frames within 1 second
# Video To Video
Krea realtime allows users to stream real videos, webcam inputs, or canvas primitives into the model, unlocking controllable video synthesis and editing
<div align="center">
<table>
<tr>
<td width="50%">
<video width="100%" controls>
<source src="https://cdn-uploads.huggingface.co/production/uploads/62a2712903bf94c3ac3ae004/iW8bdR6Q4WlZS3PW87Q8c.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
</td>
<td width="50%">
<video width="100%" controls>
<source src="https://cdn-uploads.huggingface.co/production/uploads/62a2712903bf94c3ac3ae004/x2JUP1bvzBr1_nuUpHuM-.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
</td>
</tr>
<tr>
<td width="50%">
<video width="100%" controls>
<source src="https://cdn-uploads.huggingface.co/production/uploads/62a2712903bf94c3ac3ae004/lYEE3n5_Ms8B9jCTkfJuq.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
</td>
<td width="50%">
<video width="100%" controls>
<source src="https://cdn-uploads.huggingface.co/production/uploads/62a2712903bf94c3ac3ae004/e_rSMargujaaVqHSS-qm5.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
</td>
</tr>
</table>
</div>
# Text To Video
Krea realtime allows users to generate videos in a streaming fashion with ~1s time to first frame.
<div align="center">
<table>
<tr>
<td width="50%">
<video width="100%" controls>
<source src="https://cdn-uploads.huggingface.co/production/uploads/62a2712903bf94c3ac3ae004/4Lz7mRSAHXBPi6Q56Vxmi.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
</td>
<td width="50%">
<video width="100%" controls>
<source src="https://cdn-uploads.huggingface.co/production/uploads/62a2712903bf94c3ac3ae004/-rLGuV0eaXRPCDYMcT0Xr.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
</td>
</tr>
<tr>
<td width="50%">
<video width="100%" controls>
<source src="https://cdn-uploads.huggingface.co/production/uploads/62a2712903bf94c3ac3ae004/dg5Tf7lIme_bc-JHrNAnD.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
</td>
<td width="50%">
<video width="100%" controls>
<source src="https://cdn-uploads.huggingface.co/production/uploads/62a2712903bf94c3ac3ae004/nRfYFFiMN3KKfshZVYeTz.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
</td>
</tr>
</table>
</div>
# Use it with our inference code
Set up
```bash
sudo apt install ffmpeg # install if you haven't already
git clone https://github.com/krea-ai/realtime-video
cd realtime-video
uv sync
uv pip install flash_attn --no-build-isolation
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir-use-symlinks False --local-dir wan_models/Wan2.1-T2V-1.3B
huggingface-cli download krea/krea-realtime-video krea-realtime-video-14b.safetensors --local-dir-use-symlinks False --local-dir checkpoints/krea-realtime-video-14b.safetensors
```
Run
```bash
export MODEL_FOLDER=Wan-AI
export CUDA_VISIBLE_DEVICES=0 # pick the GPU you want to serve on
export DO_COMPILE=true
uvicorn release_server:app --host 0.0.0.0 --port 8000
```
And use the web app at http://localhost:8000/ in your browser
(for more advanced use-cases and custom pipeline check out our GitHub repository: https://github.com/krea-ai/realtime-video)
# Use it with 🧨 diffusers
Krea Realtime 14B can be used with the `diffusers` library utilizing the new Modular Diffusers structure
```bash
# Install diffusers from main
pip install git+github.com/huggingface/diffusers.git
```
<details>
<summary>Text to Video</summary>
```py
import torch
from tqdm import tqdm
from diffusers.utils import export_to_video
from diffusers import ModularPipeline
from diffusers.modular_pipelines import PipelineState
repo_id = "krea/krea-realtime-video"
pipe = ModularPipeline.from_pretrained(repo_id, trust_remote_code=True)
pipe.load_components(
trust_remote_code=True,
device_map="cuda",
torch_dtype={"default": torch.bfloat16, "vae": torch.float16},
)
for block in pipe.transformer.blocks:
block.self_attn.fuse_projections()
num_blocks = 9
frames = []
state = PipelineState()
prompt = ["a cat sitting on a boat"]
generator = torch.Generator(device=pipe.device).manual_seed(42)
for block_idx in tqdm(range(num_blocks)):
state = pipe(
state,
prompt=prompt,
num_inference_steps=6,
num_blocks=num_blocks,
block_idx=block_idx,
generator=generator,
)
frames.extend(state.values["videos"][0])
export_to_video(frames, "output.mp4", fps=24)
```
</details>
<details>
<summary>Video to Video</summary>
```py
import torch
from tqdm import tqdm
from diffusers.utils import load_video, export_to_video
from diffusers import ModularPipeline
from diffusers.modular_pipelines import PipelineState
repo_id = "krea/krea-realtime-video"
pipe = ModularPipeline.from_pretrained(repo_id, trust_remote_code=True)
pipe.load_components(
trust_remote_code=True,
device_map="cuda",
torch_dtype={"default": torch.bfloat16, "vae": torch.float16},
)
for block in pipe.transformer.blocks:
block.self_attn.fuse_projections()
num_blocks = 9
video = load_video("https://app-uploads.krea.ai/public/a8218957-1a80-43dc-81b2-da970b5f2221-video.mp4")
frames = []
prompt = ["A car racing down a snowy mountain"]
state = PipelineState()
generator = torch.Generator("cuda").manual_seed(42)
for block_idx in tqdm(range(num_blocks)):
state = pipe(
state,
video=video,
prompt=prompt,
num_inference_steps=6,
strength=0.3,
block_idx=block_idx,
generator=generator,
)
frames.extend(state.values["videos"][0])
export_to_video(frames, "output-v2v.mp4", fps=24)
```
</details>
<details>
<summary>Streaming Video to Video</summary>
Using the `video_stream` input will process video frames in as they arrive, while maintaining temporal consistency across chunks.
```py
import torch
from collections import deque
from tqdm import tqdm
from diffusers.utils import load_video, export_to_video
from diffusers import ModularPipeline
from diffusers.modular_pipelines import PipelineState
repo_id = "krea/krea-realtime-video"
pipe = ModularPipeline.from_pretrained(repo_id, trust_remote_code=True)
pipe.load_components(
trust_remote_code=True,
device_map="cuda",
torch_dtype={"default": torch.bfloat16, "vae": torch.float16},
)
for block in pipe.transformer.blocks:
block.self_attn.fuse_projections()
n_samples = 9
frame_sample_len = 12
video = load_video(
"https://app-uploads.krea.ai/public/a8218957-1a80-43dc-81b2-da970b5f2221-video.mp4"
)
# Simulate streaming video input
frame_samples = [
video[sample_start : sample_start + frame_sample_len]
for sample_start in range(0, n_samples * frame_sample_len, frame_sample_len)
]
frames = []
state = PipelineState()
prompt = ["A car racing down a snowny mountain road"]
block_idx = 0
generator = torch.Generator("cpu").manual_seed(42)
for frame_sample in tqdm(frame_samples):
state = pipe(
state,
video_stream=frame_sample,
prompt=prompt,
num_inference_steps=6,
strength=0.3,
block_idx=block_idx,
generator=generator,
)
frames.extend(state.values["videos"][0])
block_idx += 1
export_to_video(frames, "output-v2v-streaming.mp4", fps=24)
```
</details>
<details>
<summary>Using LoRAs</summary>
```py
import torch
from collections import deque
from tqdm import tqdm
from diffusers.utils import export_to_video
from diffusers import ModularPipeline
from diffusers.modular_pipelines import PipelineState
repo_id = "krea/krea-realtime-video"
pipe = ModularPipeline.from_pretrained(repo_id, trust_remote_code=True)
pipe.load_components(
trust_remote_code=True,
device_map="cuda",
torch_dtype={"default": torch.bfloat16, "vae": torch.float16},
)
pipe.transformer.load_lora_adapter(
"shauray/Origami_WanLora",
prefix="diffusion_model",
weight_name="origami_000000500.safetensors",
adapter_name="origami",
)
for block in pipe.transformer.blocks:
block.self_attn.fuse_projections()
num_blocks = 9
frames = []
state = PipelineState()
prompt = ["[origami] a cat sitting on a boat"]
generator = torch.Generator("cuda").manual_seed(42)
for block_idx in tqdm(range(num_blocks)):
state = pipe(
state,
prompt=prompt,
num_inference_steps=6,
num_blocks=num_blocks,
block_idx=block_idx,
generator=generator,
)
frames.extend(state.values["videos"][0])
export_to_video(frames, "output.mp4", fps=24)
```
</details>
<details>
<summary>Optimized Inference</summary>
To optimize inference speed and memory usage on Hopper level GPUs (H100s), we recommend using `torch.compile`, Sageattention and FP8 quantization with [torchao](https://github.com/pytorch/ao).
First let's set up our depedencies by enabling Sageattention via Hub [kernels](https://huggingface.co/docs/kernels/en/index) and installing the `torchao` and `kernels` packages.
```shell
export DIFFUSERS_ENABLE_HUB_KERNELS=true
pip install -U kernels torchao
```
Alternatively, you can use Flash Attention 3 via kernels by disabling Sageattention:
```shell
export DISABLE_SAGEATTENTION=1
```
Then we will iterate over the blocks of the transformer and apply quantization and `torch.compile`.
```py
import torch
from collections import deque
from tqdm import tqdm
from diffusers.utils import export_to_video
from diffusers import ModularPipeline
from diffusers.modular_pipelines import PipelineState
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, quantize_
repo_id = "krea/krea-realtime-video"
pipe = ModularPipeline.from_pretrained(repo_id, trust_remote_code=True)
pipe.load_components(
trust_remote_code=True,
device_map="cuda",
torch_dtype={"default": torch.bfloat16, "vae": torch.float16},
)
for block in pipe.transformer.blocks:
block.self_attn.fuse_projections()
# Quantize just the transformer blocks
for block in pipe.transformer.blocks:
quantize_(block, Float8DynamicActivationFloat8WeightConfig())
# Compile just the attention modules
for submod in pipe.transformer.modules():
if submod.__class__.__name__ in ["CausalWanAttentionBlock"]:
submod.compile(fullgraph=False)
num_blocks = 9
state = PipelineState()
prompt = ["a cat sitting on a boat"]
# Compile warmup
for block_idx in range(num_blocks):
state = pipe(
state,
prompt=prompt,
num_inference_steps=2,
num_blocks=num_blocks,
num_frames_per_block=num_frames_per_block,
block_idx=block_idx,
generator=torch.Generator("cuda").manual_seed(42),
)
# Reset state
state = PipelineState()
generator = torch.Generator("cuda").manual_seed(42)
for block_idx in tqdm(range(num_blocks)):
state = pipe(
state,
prompt=prompt,
num_inference_steps=6,
num_blocks=num_blocks,
block_idx=block_idx,
generator=generator,
)
frames.extend(state.values["videos"][0])
export_to_video(frames, "output.mp4", fps=24)
```
</details>