--- license: mit datasets: - uwunish/ghibli-dataset language: - en base_model: - stabilityai/stable-diffusion-2-1-base pipeline_tag: text-to-image library_name: diffusers tags: - ghibli - text2image - finetune - sd-2.1 ---

Ghibli Fine-Tuned Stable Diffusion 2.1

## Dataset Avalible at: https://huggingface.co/datasets/uwunish/ghibli-dataset. ## Hyperparameters The fine-tuning process was optimized with the following hyperparameters: | Hyperparameter | Value | | --- | --- | | `learning_rate` | 1e-05 | | `num_train_epochs` | 40 | | `train_batch_size` | 2 | | `gradient_accumulation_steps` | 2 | | `mixed_precision` | "fp16" | | `resolution` | 512 | | `max_grad_norm` | 1 | | `lr_scheduler` | "constant" | | `lr_warmup_steps` | 0 | | `checkpoints_total_limit` | 1 | | `use_ema` | True | | `use_8bit_adam` | True | | `center_crop` | True | | `random_flip` | True | | `gradient_checkpointing` | True | These parameters were carefully selected to balance training efficiency and model performance, leveraging techniques like mixed precision and gradient checkpointing. ## Metrics The fine-tuning process achieved a final loss of **0.0345**, indicating excellent convergence and high fidelity to the Ghibli art style. ## Usage ### Step 1: Import Required Libraries Begin by importing the necessary libraries to power the image generation pipeline. ```python import torch from PIL import Image import numpy as np from transformers import CLIPTextModel, CLIPTokenizer from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler from tqdm import tqdm ``` ### Step 2: Configure the Model Set up the device, data type, and load the pre-trained Ghibli-fine-tuned Stable Diffusion model. ```python # Configure device and data type device = torch.device("cuda" if torch.cuda.is_available() else "cpu") dtype = torch.float16 if torch.cuda.is_available() else torch.float32 # Model path model_name = "danhtran2mind/ghibli-fine-tuned-sd-2.1" # Load model components vae = AutoencoderKL.from_pretrained(model_name, subfolder="vae", torch_dtype=dtype).to(device) tokenizer = CLIPTokenizer.from_pretrained(model_name, subfolder="tokenizer") text_encoder = CLIPTextModel.from_pretrained(model_name, subfolder="text_encoder", torch_dtype=dtype).to(device) unet = UNet2DConditionModel.from_pretrained(model_name, subfolder="unet", torch_dtype=dtype).to(device) scheduler = PNDMScheduler.from_pretrained(model_name, subfolder="scheduler") ``` ### Step 3: Define the Image Generation Function Use the following function to generate Ghibli-style images based on your text prompts. ```python def generate_image(prompt, height=512, width=512, num_inference_steps=50, guidance_scale=3.5, seed=42): """Generate a Ghibli-style image from a text prompt.""" # Set random seed for reproducibility generator = torch.Generator(device=device).manual_seed(int(seed)) # Tokenize and encode the prompt text_input = tokenizer( [prompt], padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt" ) with torch.no_grad(): text_embeddings = text_encoder(text_input.input_ids.to(device))[0].to(dtype=dtype) # Encode an empty prompt for classifier-free guidance uncond_input = tokenizer( [""], padding="max_length", max_length=text_input.input_ids.shape[-1], return_tensors="pt" ) with torch.no_grad(): uncond_embeddings = text_encoder(uncond_input.input_ids.to(device))[0].to(dtype=dtype) text_embeddings = torch.cat([uncond_embeddings, text_embeddings]) # Initialize latent representations latents = torch.randn( (1, unet.config.in_channels, height // 8, width // 8), generator=generator, dtype=dtype, device=device ) # Configure scheduler timesteps scheduler.set_timesteps(num_inference_steps) latents = latents * scheduler.init_noise_sigma # Denoising loop for t in tqdm(scheduler.timesteps, desc="Generating image"): latent_model_input = torch.cat([latents] * 2) latent_model_input = scheduler.scale_model_input(latent_model_input, t) with torch.no_grad(): if device.type == "cuda": with torch.autocast(device_type="cuda", dtype=torch.float16): noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample else: noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample # Apply classifier-free guidance noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond) latents = scheduler.step(noise_pred, t, latents).prev_sample # Decode latents to image with torch.no_grad(): latents = latents / vae.config.scaling_factor image = vae.decode(latents).sample # Convert to PIL Image image = (image / 2 + 0.5).clamp(0, 1) image = image.detach().cpu().permute(0, 2, 3, 1).numpy() image = (image * 255).round().astype("uint8") return Image.fromarray(image[0]) ``` ### Step 4: Generate Your Image Craft a vivid prompt and generate your Ghibli-style masterpiece. ```python # Example prompt prompt = "a serene landscape in Ghibli style" # Generate the image image = generate_image( prompt=prompt, height=512, width=512, num_inference_steps=50, guidance_scale=3.5, seed=42 ) # Display or save the image image.show() # Or image.save("ghibli_landscape.png") ``` ## Environment The project was developed and tested in the following environment: - **Python Version**: 3.11.11 - **Dependencies**: | Library | Version | | --- | --- | | huggingface-hub | 0.30.2 | | accelerate | 1.3.0 | | bitsandbytes | 0.45.5 | | torch | 2.5.1 | | Pillow | 11.1.0 | | numpy | 1.26.4 | | transformers | 4.51.1 | | torchvision | 0.20.1 | | diffusers | 0.33.1 | | gradio | Latest | Ensure your environment matches these specifications to avoid compatibility issues.