--- base_model: - DeepGlint-AI/rice-vit-large-patch14-560 - Qwen/Qwen3-4B-Instruct-2507 datasets: - lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M library_name: transformers license: apache-2.0 pipeline_tag: image-text-to-text language: en ---
(a) The vocabulary coverage proportion in the LLaVA-OneVision-1.5 Mid-Training dataset before and after concept balancing. (b) Distribution of data sources within the LLaVA-OneVision-1.5 Mid-Training dataset. (c) Distribution of data sources within the LLaVA-OneVision-1.5 Insturct dataset.
| Description | Link | Status | |--------------------|--------------------------------------------------------------------------------------------------------|-------------| | LLaVA-OV-1.5-Mid-Training-85M | [🤗HF / Mid-Training 85M](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M) | Uploading… | | LLaVA-OV-1.5-Instruct | [🤗HF / Insturct-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data) | Uploading… | ## Evaluation Results All evaluations were conducted using lmms_eval.  ## Quick Start with HuggingFace ```python from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM from qwen_vl_utils import process_vision_info model_path = "lmms-lab/LLaVA-OneVision-1.5-8B-Instruct" # default: Load the model on the available device(s) model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True ) # default processer processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", }, {"type": "text", "text": "Describe this image."}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=1024) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` ## Evaluation ``` # pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \ --model=llava_onevision1_5 \ --model_args=pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000 \ --tasks=mmmu_val,mmmu_pro_standard,mmbench_en_test,mmerealworld,mmerealworld_cn,ai2d,ai2d_no_mask,vstar_bench,chartqa,charxiv,docvqa_test,mathvista_testmini,mmstar,scienceqa \ --batch_size=1 ``` ## Quick Start Guide ### 1.🐳 Docker (Recommended) We strongly recommend using the docker environment for a seamless experience. The following instructions are tailored for the A100 80GB GPU environment. ```bash # Clone repository git clone https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5 cd LLaVA-OneVision-1.5 docker build -t llava_megatron:25.04 . # Run container with -w to set working directory directly to the mounted volume docker run -it --gpus all \ --ipc host --net host --privileged --cap-add IPC_LOCK \ --ulimit memlock=-1 --ulimit stack=67108864 --rm \ -v $(pwd):/workspace/LLaVA-OneVision-1.5 \ -w /workspace/LLaVA-OneVision-1.5 \ --name "llava_megatron_container" \ llava_megatron:25.04 /bin/bash ``` ### 2. Checkpoint and Format Conversion You have two options to get started with LLaVA-OneVision-1.5-stage-0: #### Option 1: Download pre-trained model from HuggingFace Download our `LLaVA-OneVision-1.5-4B-stage0` model directly from [HuggingFace](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-stage0). #### Option 2: Merge initial weights yourself Alternatively, you can merge the initial weights from the original ViT and LLM: ```bash python ds/merge_model.py \ --vit_path DeepGlint-AI/rice-vit-large-patch14-560 \ --llm_path Qwen/Qwen3-4B-Instruct-2507 \ --output LLaVA-OneVision-1.5-4B-stage0 ``` Note: When merging weights, the adapter component will be initialized with default values. Convert the model from HuggingFace format to Megatron format: ```bash AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 bash examples/llava_ov_1_5/convert/convert_4b_hf_to_mcore.sh \ LLaVA-OneVision-1.5-4B-stage0 \ LLaVA-OneVision-1.5-4B-stage0_mcore_tp1_pp1 \ 1 1 ``` ### 3. Stage 1 Alignment-Training Download LLaVA from [LLaVA-558K-Webdataset](https://huggingface.co/datasets/lmms-lab/LLaVA-558K-Webdataset). ```bash # ============================================================ # Required environment variables: # AIAK_TRAINING_PATH Root directory of the AIAK-Training-LLM project # DATA_PATH Directory with WebDataset shards (.tar) for pretraining # TOKENIZER_PATH Hugging Face tokenizer directory # CHECKPOINT_PATH Megatron-formatted checkpoint directory (e.g., mcore TP1/PP1) # SAVE_CKPT_PATH Output directory for saving training checkpoints AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 \ DATA_PATH=LLaVA-558K-Webdataset \ TOKENIZER_PATH=LLaVA-OneVision-1.5-4B-stage0 \ CHECKPOINT_PATH=LLaVA-OneVision-1.5-4B-stage0_mcore_tp1_pp1 \ bash examples/llava_ov_1_5/quick_start/stage_1_alignment_llava_ov_4b.sh ``` ### 4. Stage 1.5 Mid-Training Download our lightweight packed subset from [LLaVA-OneVision-1.5-Mid-Training-Quick-Start-3M-Webdataset](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Mid-Training-Webdataset-Quick-Start-3M). ```bash # ============================================================ # Convert model to release format bash examples/llava_ov_1_5/convert/convert_4b_mcore_to_release.sh \ stage_1_alignment_llava_ov_4b/iter_0002500/ \ stage_1_alignment_llava_ov_4b_release 1 1 # ============================================================ # Launch AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 \ DATA_PATH=LLaVA-OneVision-1.5-Mid-Training-Quick-Start-3M-Webdataset \ TOKENIZER_PATH=LLaVA-OneVision-1.5-4B-stage0 \ CHECKPOINT_PATH=stage_1_alignment_llava_ov_4b_release \ bash examples/llava_ov_1_5/quick_start/stage_1.5_mid_training_llava_ov_4b.sh ``` ### 5. Stage 2 Instruct-Training Download LLaVA-NeXT-780k-webdataset at [LLaVA-NeXT-780K Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-780k-webdataset). ```bash # ============================================================ # Convert model to release format bash examples/llava_ov_1_5/convert/convert_4b_mcore_to_release.sh \ stage_1.5_mid_training_llava_ov_4b/iter_0020000/ \ stage_1.5_mid_training_llava_ov_4b_release 1 1 # ============================================================ # # Launch AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 \ DATA_PATH=LLaVA-NeXT-780k-Webdataset \ TOKENIZER_PATH=LLaVA-OneVision-1.5-4B-stage0 \ CHECKPOINT_PATH=stage_1.5_mid_training_llava_ov_4b_release \ bash examples/llava_ov_1_5/quick_start/stage_2_instruct_llava_ov_4b.sh ``` ### 6. Convert mcore to huggingface ```bash AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 \ bash examples/llava_ov_1_5/convert/convert_4b_mcore_to_hf.sh \ stage_2_instruct_llava_ov_4b/iter_0003500 \ LLaVA-OneVision-1.5-4B-3M-Mid-Training-780K-Instruct \ 1 1 # Copy non-model files (e.g., tokenizer config) to the new directory find LLaVA-OneVision-1.5-4B-stage0/ -type f -not -iname '*safetensors*' -exec cp {} LLaVA-OneVision-1.5-4B-3M-Mid-Training-780K-Instruct/ ';' ``` ### 7. Evaluation ```bash # pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch \ --num_processes=4 --main_process_port 12399 -m lmms_eval --model=llava_onevision1_5 --batch_size=1 --tasks=mme \ --model_args=pretrained=/workspace/LLaVA-OneVision-1.5/LLaVA-OneVision-1.5-4B-3M-Mid-Training-780K-Instruct,max_pixels=3240000 ``` ## Fully Reproducing Guide > [!TIP] > More detailed reproduction steps for the complete process will be provided after the dataset upload is completed. ### Mid-Training To improve model training efficiency, we implement offline sample packing: 1. Download the [**Mid-Training-85M Dataset**](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M) 2. Pack the data into webdataset format, refer to [**Examples offlinepacking**](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5/tree/main/examples_offline_packing) and [**Offline Padding-Free Data Packing**](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5/tree/main/examples/llava_ov_1_5/sample_packing/README.md) ### Instruct 1. Download the [**LLaVA-OneVision-1.5-Insturct-Data**](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data) 2. Convert the data into webdataset format, refer to [**Conversion for Mixed Instruction Data**](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5/blob/main/docs/sft_data_preprocessing.md) ## Roadmaps Q4 2025 Key Deliverables: 1. **Ultra-efficient MoE Training** 2. **Full Video Input LLM** ## Contributors Thanks so much to all of our amazing contributors!|
fdcp |
anxiangsir |
yiyexy |
wideyard |
chengzheng345 |
killTheHostage |
mathCrazyy |
yunglechao |
|
RobitYadda |