File size: 6,792 Bytes
b2bc222
 
 
 
 
 
 
 
 
 
 
 
 
ce35ca9
b2bc222
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce35ca9
 
 
 
 
 
b2bc222
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
license: apache-2.0
pipeline_tag: image-to-image
library_name: diffusers
---

# iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

[![arXiv](https://img.shields.io/badge/arXiv-2511.20635-b31b1b.svg)](https://huggingface.co/papers/2511.20635)
[![Project Page](https://img.shields.io/badge/Project-Page-4b9e5f.svg)](https://kr1sjfu.github.io/iMontage-web/)
[![GitHub](https://img.shields.io/badge/GitHub-Code-keygen.svg?logo=github&style=flat-square)](https://github.com/Kr1sJFU/iMontage)

<p align="center">
  <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/demo/teaser.png" alt="iMontage Teaser" width="85%">
</p>

iMontage is a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, maintaining strong cross-image contextual consistency and generating scenes with extraordinary dynamics.

## ๐Ÿ“ฆ Features

- โšก High-dynamic, high-consistency image generation from flexible inputs
- ๐ŸŽ›๏ธ Robust instruction following across heterogeneous tasks
- ๐ŸŒ€ Video-like temporal coherence, even for non-video image sets
- ๐Ÿ† SOTA results across different tasks

## ๐Ÿš€ Sample Usage

For detailed installation instructions and more complex inference examples, please refer to the [GitHub repository](https://github.com/Kr1sJFU/iMontage).

Here is a simple Python code snippet demonstrating image generation with iMontage:

```python
from inference_solver import FlexARInferenceSolver
from PIL import Image

# ******************** Image Generation ********************
inference_solver = FlexARInferenceSolver(
    model_path="Kr1sJ/iMontage", # Use this Hugging Face model
    precision="bf16",
    target_size=768, # Ensure target_size is consistent with the checkpoint
)

q1 = f"Generate an image of 768x768 according to the following prompt:\
" \
     f"Image of a dog playing water, and a waterfall is in the background."

# generated: tuple of (generated response, list of generated images)
generated = inference_solver.generate(
    images=[],
    qas=[[q1, None]],
    max_gen_len=8192,
    temperature=1.0,
    logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
)

a1, new_image = generated[0], generated[1][0]

# Display or save the image
new_image.show()
# new_image.save("generated_dog.png")
```

The model supports various tasks as illustrated below:

| **Task Type**        | **Input**                                                   | **Prompt**                                                    | **Output**                                                  |
|----------------------|-------------------------------------------------------------|----------------------------------------------------------------|-------------------------------------------------------------|
| **image_editing**    | <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/images/llava.png" width="120">             | *Change the material of the lava to silver.*                  | <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/results/llava_0.png" width="120">    |
| **cref**             | <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/images/Confucius.png" width="80"> <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/images/Moses.png" width="80"> <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/images/Solon.png" width="80"> | *Confucius from the first image, Moses from the secondโ€ฆ*      | <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/results/Confucius_0.png" width="120">             |
| **conditioned_cref**    | <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/images/depth.png" width="80">  <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/images/girl.png" width="80">             | *depth*                  | <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/results/depth_0.png" width="120">    |
| **sref**             | <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/images/woman.png" width="80"> <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/images/joker.png" width="80">      | *(empty)*                                                      | <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/results/woman_0.png" width="120">             |
| **multiview**        | <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/images/city.png" width="120">              | *1. Shift left; 2. Look up; 3. Zoom out.*                     | <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/results/city_0.png" width="80">  <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/results/city_1.png" width="80"> <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/results/city_2.png" width="80">       |
| **storyboard**       | <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/images/Hepburn.png" width="80"> <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/images/yellow_bag.png" width="80"> | *Vintage film: 1. Hepburn carrying the yellow bagโ€ฆ*           | <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/results/Hepburn_0.png" width="80"> <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/results/Hepburn_1.png" width="80"> <img src="https://huggingface.co/Kr1sJ/iMontage/resolve/main/results/Hepburn_2.png" width="80">     |

## ๐Ÿ’– Acknowledgment

We sincerely thank the open-source community for providing strong foundations that enabled this work.  
In particular, we acknowledge the following projects for their models, datasets, and valuable insights:

- **HunyuanVideo-T2V**, **HunyuanVideo-I2V** โ€“ Provided base generative model designs and code. 
- **FastVideo** โ€“ Contributed key components and open-source utilities that supported our development.

These contributions have greatly influenced our research and helped shape the design of **iMontage**.

---

## ๐Ÿ“ Citation

If you find **iMontage** useful for your research or applications, please consider starring โญ the repo and citing our paper:

```bibtex
@article{fu2025iMontage,
  title={iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation}, 
  author={Zhoujie Fu and Xianfang Zeng and Jinghong Lan and Xinyao Liao and Cheng Chen and Junyi Chen and Jiacheng Wei and Wei Cheng and Shiyu Liu and Yunuo Chen and Gang Yu and Guosheng Lin},
  journal={arXiv preprint arXiv:2511.20635},
  year={2025},   
}
```