Instructions to use hf-internal-testing/tiny-helios-modular-pipe with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use hf-internal-testing/tiny-helios-modular-pipe with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("hf-internal-testing/tiny-helios-modular-pipe", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Draw Things
- DiffusionBee
| library_name: diffusers | |
| tags: | |
| - modular-diffusers | |
| - diffusers | |
| - helios | |
| - text-to-image | |
| - modular-diffusers | |
| - diffusers | |
| - helios | |
| This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework. | |
| **Pipeline Type**: HeliosAutoBlocks | |
| **Description**: Auto Modular pipeline for text-to-video, image-to-video, and video-to-video tasks using Helios. | |
| This pipeline uses a 4-block architecture that can be customized and extended. | |
| ## Example Usage | |
| [TODO] | |
| ## Pipeline Architecture | |
| This modular pipeline is composed of the following blocks: | |
| 1. **text_encoder** (`HeliosTextEncoderStep`) | |
| - Text Encoder step that generates text embeddings to guide the video generation | |
| 2. **vae_encoder** (`HeliosAutoVaeEncoderStep`) | |
| - Encoder step that encodes video or image inputs. This is an auto pipeline block. | |
| - *video_encoder*: `HeliosVideoVaeEncoderStep` | |
| - Video Encoder step that encodes an input video into VAE latent space, producing image_latents (first frame) and video_latents (chunked video frames) for video-to-video generation. | |
| - *image_encoder*: `HeliosImageVaeEncoderStep` | |
| - Image Encoder step that encodes an input image into VAE latent space, producing image_latents (first frame prefix) and fake_image_latents (history seed) for image-to-video generation. | |
| 3. **denoise** (`HeliosAutoCoreDenoiseStep`) | |
| - Core denoise step that selects the appropriate denoising block. | |
| - *video2video*: `HeliosV2VCoreDenoiseStep` | |
| - V2V denoise block that seeds history with video latents and uses I2V-aware chunk preparation. | |
| - *image2video*: `HeliosI2VCoreDenoiseStep` | |
| - I2V denoise block that seeds history with image latents and uses I2V-aware chunk preparation. | |
| - *text2video*: `HeliosCoreDenoiseStep` | |
| - Denoise block that takes encoded conditions and runs the chunk-based denoising process. | |
| 4. **decode** (`HeliosDecodeStep`) | |
| - Decodes all chunk latents with the VAE, concatenates them, trims to the target frame count, and postprocesses into the final video output. | |
| ## Model Components | |
| 1. text_encoder (`UMT5EncoderModel`) | |
| 2. tokenizer (`AutoTokenizer`) | |
| 3. guider (`ClassifierFreeGuidance`) | |
| 4. vae (`AutoencoderKLWan`) | |
| 5. video_processor (`VideoProcessor`) | |
| 6. transformer (`HeliosTransformer3DModel`) | |
| 7. scheduler (`HeliosScheduler`) | |
| ## Input/Output Specification | |
| ### Inputs **Required:** | |
| - `prompt` (`str`): The prompt or prompts to guide image generation. | |
| - `history_sizes` (`list`): Sizes of long/mid/short history buffers for temporal context. | |
| - `sigmas` (`list`): Custom sigmas for the denoising process. | |
| **Optional:** | |
| - `negative_prompt` (`str`): The prompt or prompts not to guide the image generation. | |
| - `max_sequence_length` (`int`), default: `512`: Maximum sequence length for prompt encoding. | |
| - `video` (`Any`): Input video for video-to-video generation | |
| - `height` (`int`), default: `384`: The height in pixels of the generated image. | |
| - `width` (`int`), default: `640`: The width in pixels of the generated image. | |
| - `num_latent_frames_per_chunk` (`int`), default: `9`: Number of latent frames per temporal chunk. | |
| - `generator` (`Generator`): Torch generator for deterministic generation. | |
| - `image` (`PIL.Image.Image | list[PIL.Image.Image]`): Reference image(s) for denoising. Can be a single image or list of images. | |
| - `num_videos_per_prompt` (`int`), default: `1`: Number of videos to generate per prompt. | |
| - `image_latents` (`Tensor`): image latents used to guide the image generation. Can be generated from vae_encoder step. | |
| - `video_latents` (`Tensor`): Encoded video latents for V2V generation. | |
| - `image_noise_sigma_min` (`float`), default: `0.111`: Minimum sigma for image latent noise. | |
| - `image_noise_sigma_max` (`float`), default: `0.135`: Maximum sigma for image latent noise. | |
| - `video_noise_sigma_min` (`float`), default: `0.111`: Minimum sigma for video latent noise. | |
| - `video_noise_sigma_max` (`float`), default: `0.135`: Maximum sigma for video latent noise. | |
| - `num_frames` (`int`), default: `132`: Total number of video frames to generate. | |
| - `keep_first_frame` (`bool`), default: `True`: Whether to keep the first frame as a prefix in history. | |
| - `num_inference_steps` (`int`), default: `50`: The number of denoising steps. | |
| - `latents` (`Tensor`): Pre-generated noisy latents for image generation. | |
| - `timesteps` (`Tensor`): Timesteps for the denoising process. | |
| - `None` (`Any`): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc. | |
| - `attention_kwargs` (`dict`): Additional kwargs for attention processors. | |
| - `fake_image_latents` (`Tensor`): Fake image latents used as history seed for I2V generation. | |
| - `output_type` (`str`), default: `np`: Output format: 'pil', 'np', 'pt'. | |
| ### Outputs - `videos` (`list`): The generated videos. | |