--- license: apache-2.0 datasets: - xiaorui638/cc3m - liuhaotian/LLaVA-Instruct-150K - Xkev/LLaVA-CoT-100k metrics: - bleu - accuracy base_model: - LiquidAI/LFM2-350M --- # 🐍 **VIPER-L1: A Family of Small Multimodal-LLMs**
Viper-L1 Logo
β€œFast. Compact. Vision-Language Intelligence.”
***Note:*** This model is still in improving, so we recommend to fine-tune this model in your use case! --- ## 🌟 Overview **Viper-L1** is an open-source **small multimodal large language model (Multimodal-LLM)** designed for efficient multimodal reasoning and deployment on consumer GPUs. It is built upon the [**Liquid Model**](https://huggingface.co/LiquidAI/LFM2-350M) architecture (β‰ˆ1.2B parameters), enabling a powerful yet lightweight foundation for **personal research, on-device applications, and internal experimentation**. --- ## 🧠 Key Features * ⚑ **Efficient Training & Inference** Trained on **2Γ— H100 GPUs** within **~2 days**, thanks to our lightweight multimodal fusion and liquid transformer design. Inference runs smoothly even on **RTX 4070** GPUs. * πŸ”— **Multimodal Connector (Sense Integration Module)** Inspired by human perception, Viper-L1 introduces a *connector* that fuses signals from different sensory encoders (vision, audio, etc.), enabling deeper **cross-modal alignment** and improved reasoning. * 🧩 **Hybrid Architecture** Combines the **semantic strength of Transformers** with the **efficiency of Liquid Neural Networks**, resulting in a compact yet expressive multimodal model. --- ## πŸš€ Progress * βœ… **Released** β€” Viper-L1 model checkpoint * 🧩 **Coming Soon** β€” Fully documented training and inference scripts * 🧩 **Coming Soon** β€” Fully documented for post-training (LoRA, DPO, GRPO) Stay tuned for our next updates on model fine-tuning and multimodal reasoning enhancements. --- ## πŸ—οΈ Architecture The overall architecture is shown below:
Viper-L1 Architecture
**Main Components:** 1. 🎨 **Vision Encoder** – Extracts compact visual embeddings 2. πŸ”— **Multimodal Connector** – Fuses sensory inputs efficiently 3. 🧠 **Language Backbone (LFM2-350M-based)** – Performs semantic reasoning and response generation > πŸ§ͺ *The current Viper-L1 (1.2B parameters) was trained on ~4 million images using 2Γ— H100 GPUs for 2 days.* --- ## πŸ“Š Benchmark Results | Benchmark | Task | Split | Metric | Viper-L1 (CoT) | |-------------|------|-------|----------|----------------| | RealWorldQA | VQA | Test | Accuracy | **33.73%** | | Other results | VQA | Test | Accuracy | On going | **Notes.** CoT = Chain-of-Thought prompting enabled during inference. Exact settings (temperature/top-p/max tokens) can influence results; see the inference snippet below to replicate typical generation settings. --- ## 🧩 Usage To get started with **inference**, follow the setup in the main repository: πŸ”— [**Viper-VLM Repository**](https://github.com/huyquoctrinh/Viper-LM) πŸ“œ Example inference script: [`infer_viper.sh`](https://github.com/huyquoctrinh/Viper-LM/blob/feat/viper-vlm_cot/infer_viper.sh) Or you can use these functions for inference ```python import os import argparse import torch from PIL import Image from transformers import AutoTokenizer, AutoProcessor from model import ViperLMForCausalLM # your local class IMAGE_TOKEN_ID = 64400 def build_messages(question: str, include_image: bool = True): # Mirror CCDataset._format_prompt() user_content = (" " if include_image else "") + (question or "") return [ {"role": "user", "content": user_content}, # assistant turn is left empty; apply_chat_template(add_generation_prompt=True) will add assistant prefix ] @torch.inference_mode() def generate_answer( ckpt_dir: str, tokenizer_path: str, processor_path: str, image_path: str, question: str, device: str = "cuda", dtype: str = "bf16", max_new_tokens: int = 128, temperature: float = 0.2, top_p: float = 0.9, repetition_penalty: float = 1.05, ): # --- device / dtype --- device = torch.device(device if torch.cuda.is_available() else "cpu") use_bf16 = (dtype.lower() == "bf16") use_fp16 = (dtype.lower() == "fp16") amp_dtype = torch.bfloat16 if use_bf16 else (torch.float16 if use_fp16 else torch.float32) # --- tokenizer / processor --- tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, use_fast=True) if tokenizer.pad_token_id is None: tokenizer.pad_token = tokenizer.eos_token # optional but common for generation with left context if not hasattr(tokenizer, "padding_side") or tokenizer.padding_side != "left": tokenizer.padding_side = "left" processor = AutoProcessor.from_pretrained(processor_path) # --- model --- model = ViperLMForCausalLM.from_pretrained( ckpt_dir, torch_dtype=amp_dtype if device.type == "cuda" else torch.float32, ).to(device) model.eval() if getattr(model.config, "pad_token_id", None) is None: model.config.pad_token_id = tokenizer.pad_token_id # expose image token id if your forward expects it; keep it consistent with training image_token_id = getattr(model.config, "image_token_id", None) if image_token_id is None and "" in tokenizer.get_vocab(): image_token_id = tokenizer.convert_tokens_to_ids("") # --- text input with the SAME chat template as training --- messages = build_messages(question=question, include_image=True) enc = tokenizer.apply_chat_template( messages, add_generation_prompt=True, # adds assistant header the model expects before generation tokenize=True, return_tensors="pt", ) if isinstance(enc, torch.Tensor): input_ids = enc attention_mask = torch.ones_like(enc, dtype=torch.long) else: input_ids = enc["input_ids"] attention_mask = enc.get("attention_mask") if attention_mask is None: attention_mask = torch.ones_like(input_ids, dtype=torch.long) input_ids = input_ids.to(device) attention_mask = attention_mask.to(device) # --- image preprocessing (match training) --- img = Image.open(image_path).convert("RGB") proc = processor(images=[img], return_tensors="pt") # list, like training pixel_values = proc.get("pixel_values", None) if pixel_values is None: raise ValueError("Processor did not return 'pixel_values'. Check processor_path.") pixel_values = pixel_values.to(device) # (1, 3, H, W) # --- generate --- gen_kwargs = { "max_new_tokens": max_new_tokens, "do_sample": temperature > 0.0, "temperature": max(temperature, 1e-6), "top_p": top_p, "repetition_penalty": repetition_penalty, "eos_token_id": tokenizer.eos_token_id, "pad_token_id": tokenizer.pad_token_id, "image_inputs": pixel_values, # IMPORTANT: use the same argument names your model.forward saw in training # not "image_inputs" "image_token_id": image_token_id, # if your forward uses it "use_cache": False, } if device.type == "cuda" and (use_bf16 or use_fp16): with torch.autocast(device_type="cuda", dtype=amp_dtype): out = model.generate( input_ids=input_ids, attention_mask=attention_mask, **gen_kwargs ) else: out = model.generate( input_ids=input_ids, attention_mask=attention_mask, **gen_kwargs ) # --- decode only new tokens --- generated = out[0] prompt_len = input_ids.size(1) new_tokens = generated[prompt_len:] answer = tokenizer.decode(new_tokens, skip_special_tokens=True) return answer.strip() if __name__ == "__main__": ckpt_dir = "" tokenizer_path = "" processor_path = "" image_path = "" question = "" device = "" ans = generate_answer( ckpt_dir=ckpt_dir, tokenizer_path=tokenizer_path, processor_path=processor_path, image_path=image_path, question=question, device=device, dtype="bfloat16", max_new_tokens=128, temperature=0.7, top_p=0.8, repetition_penalty=1 ) print("\n ======Answer===== \n") print(ans) ``` --- ## πŸ™ Acknowledgements We gratefully thank the following foundational projects for inspiring and enabling our research: * [**Liquid Model**](https://huggingface.co/LiquidAI/LFM2-350M) – Base architecture for dynamic neural computation * [**SigLIP**](https://huggingface.co/google/siglip2-base-patch16-naflex) – Vision encoder powering multimodal understanding Their open-source contributions have made **Viper-L1** possible. πŸ’š --- ## πŸ“« Contact If you’re interested in collaboration or research discussions: πŸ‘‰ [**Contact us**](https://github.com/huyquoctrinh) or open an issue in the repository. ---