File size: 5,595 Bytes
2c042ec
2e69f8d
2c042ec
2e69f8d
 
 
 
 
 
 
2c042ec
 
 
2e69f8d
 
 
 
 
 
 
 
 
4552285
2e69f8d
4552285
 
 
 
2c042ec
4552285
2e69f8d
 
2c042ec
2e69f8d
2c042ec
 
 
 
 
 
 
 
 
 
 
 
 
2e69f8d
 
 
4552285
 
2e69f8d
 
 
4552285
2e69f8d
4552285
 
2e69f8d
 
4552285
 
 
 
 
 
2e69f8d
 
4552285
2e69f8d
 
 
 
 
4552285
 
 
 
 
 
2e69f8d
4552285
 
 
 
2e69f8d
4552285
 
 
 
2e69f8d
4552285
 
 
2e69f8d
 
 
 
 
4552285
2e69f8d
 
 
 
 
 
4552285
2e69f8d
 
 
 
 
4552285
2e69f8d
 
 
 
 
4552285
2e69f8d
4552285
 
 
 
 
2e69f8d
 
 
4552285
2e69f8d
4552285
2e69f8d
4552285
 
 
 
 
2e69f8d
4552285
2e69f8d
 
4552285
 
 
 
 
 
2c042ec
 
4552285
2e69f8d
 
4552285
 
 
 
 
 
2e69f8d
 
4552285
2e69f8d
 
 
4552285
2e69f8d
4552285
 
 
 
 
 
 
2e69f8d
 
 
 
 
4552285
 
 
2e69f8d
 
 
 
 
4552285
 
2e69f8d
 
 
 
 
4552285
 
2e69f8d
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
library_name: minisora
license: mit
language:
  - en
tags:
  - text-to-video
  - video-diffusion
  - continuation
  - colossalai
pipeline_tag: text-to-video
---

# MiniSora: Fully Open Video Diffusion with ColossalAI

[GitHub: YN35/minisora](https://github.com/YN35/minisora)  
[Author (X / Twitter): @__ramu0e__](https://x.com/__ramu0e__)

---

## 🧾 Overview

**MiniSora** is a fully open video diffusion codebase designed for everything from research to production.

- All training, inference, and evaluation scripts are available
- Supports multi-GPU / multi-node training via **ColossalAI**
- Simple DiT-based video model + pipeline, inspired by Diffusers
- Includes a continuation demo to generate the "next" part of an existing video

This model card hosts the DiT pipeline trained on DMLab trajectories and published as `ramu0e/minisora-dmlab`.

---

## πŸš€ Inference: Text-to-Video (Minimal Example)

```python
from minisora.models import DiTPipeline

pipeline = DiTPipeline.from_pretrained("ramu0e/minisora-dmlab")

output = pipeline(
    batch_size=1,
    num_inference_steps=28,
    height=64,
    width=64,
    num_frames=20,
)
latents = output.latents  # shape: (B, C, F, H, W)
```

`latents` are video tensors that remain in the same normalized space as training.  
Use the scripts in the repository to decode or visualize them.

---

## πŸŽ₯ Continuation: Generate the Rest of a Video

MiniSora also supports continuation-style generation (like Sora), where subsequent frames are sampled while conditioning on the observed prefix.  
A demo script is bundled to extend existing videos.

```bash
uv run scripts/demo/full_continuation.py \
  --model-id ramu0e/minisora-dmlab \
  --input-video path/to/input.mp4 \
  --num-extend-frames 12 \
  --num-inference-steps 28 \
  --seed 1234
```

See `scripts/demo/full_continuation.py` for the exact arguments and I/O specification.

---

## 🧩 Key Features

- **End-to-End Transparency**  
  - Model definition (DiT): `src/minisora/models/modeling_dit.py`  
  - Pipeline: `src/minisora/models/pipeline_dit.py`  
  - Training script: `scripts/train.py`  
  - Data loaders: `src/minisora/data/`  
  Every stage from data to inference is available.

- **ColossalAI for Scale-Out Training**  
  - Zero / DDP plugins  
  - Designed for multi-GPU and multi-node setups  
  - Easy experimentation with large video models

- **Simple, Readable Implementation**  
  - Dependency management via `uv` (`uv sync` / `uv run`)  
  - Minimal Diffusers-inspired video DiT pipeline  
  - Experiments and analysis scripts organized under `reports/`

- **Continuation / Conditioning Ready**  
  - Masking logic to fix conditioned frames  
  - Training scheme that applies noise to only part of the sequence

---

## πŸ›  Install & Setup

### 1. Clone the Repository

```bash
git clone https://github.com/YN35/minisora.git
cd minisora
```

### 2. Install Dependencies with `uv`

```bash
uv sync
```

All scripts can then be executed through `uv run ...`.

---

## πŸ“¦ This Pipeline (`ramu0e/minisora-dmlab`)

This Hugging Face repository distributes the MiniSora DiT pipeline checkpoint trained on DMLab trajectories.

- **Model type**: DiT-based video diffusion model  
- **Training resolution**: e.g., 64Γ—64 or 128Γ—128 (see `reports/` in the repo)  
- **Frames per sample**: typically 20  
- **Library**: `minisora` (custom lightweight framework)  
- **Use case**: research or sample-quality video generation

---

## πŸ§ͺ Training (Summary)

Complete training code is available in the repository.

- Main script: `scripts/train.py`
- Highlights:
  - Rectified-flow style training with `FlowMatchEulerDiscreteScheduler`
  - ColossalAI Booster to switch between Zero / DDP
  - Conditioning-aware objective (noise partial subsets of frames)

### Example: Single-Node Training

```bash
uv run scripts/train.py \
  --dataset_type minecraft \
  --data_root /path/to/train_data \
  --outputs outputs/exp1 \
  --batch_size 32 \
  --precision bf16
```

### Example: Multi-Node (torchrun + ColossalAI)

```bash
torchrun --nnodes 2 --nproc_per_node 8 scripts/train.py \
  --dataset_type minecraft \
  --data_root /path/to/train_data \
  --outputs outputs/exp-multinode \
  --batch_size 64 \
  --plugin zero --zero 1
```

Refer to `scripts/train.py` for all available options.

---

## πŸ“š Repository Structure (Excerpt)

- `src/minisora/models/modeling_dit.py` – core DiT transformer for video
- `src/minisora/models/pipeline_dit.py` – Diffusers-style pipeline (`DiTPipeline`)
- `src/minisora/data/` – datasets and distributed samplers (DMLab, Minecraft)
- `scripts/train.py` – ColossalAI-based training loop
- `scripts/demo/full_vgen.py` – simple end-to-end video generation demo
- `scripts/demo/full_continuation.py` – continuation demo
- `reports/` – experiment notes, mask visualizations, metric scripts

---

## πŸ” Limitations & Notes

- This checkpoint targets research-scale experiments.  
- Quality at higher resolution or longer durations depends on data and hyperparameters.  
- Continuation quality varies with the provided prefix and conditioning setup.

---

## 🀝 Contributions

- Contributions to code, models, and docs are welcome.  
- Please open issues or PRs at [YN35/minisora](https://github.com/YN35/minisora).

---

## πŸ“„ License

- Code and weights are released under the **MIT License**.  
  Commercial use, modification, and redistribution are all permitted (see the GitHub `LICENSE`).

```text
MIT License
Copyright (c) YN
Permission is hereby granted, free of charge, to any person obtaining a copy
...
```