HPML-EgoQA-Baseline

This is a finetuned version of LLaVA-OneVision-Qwen2-7B-OV for egocentric vision-language tasks.

Model Summary
Use
Limitations
Training
License
Citation

Model Summary

This model is a finetuned version of LLaVA-OneVision-Qwen2-7B-OV, fine-tuned on EgoIT-99K and Ego4D-like datasets for egocentric video question answering tasks. The base model is a 7B parameter multimodal model based on Qwen2 language model with a context window of 32K tokens, capable of understanding images, multi-image, and videos.

Base Model: lmms-lab/llava-onevision-qwen2-7b-ov
Finetuning Dataset: EgoIT-99K and Ego4D
Languages: English
Project: HPML (High-Performance Machine Learning) Project
Team Members: Sunidhi Tandel, Rahil, and team
Institution: HPML Project

Use

Intended use

This model is finetuned on EgoIT-99K and Ego4D datasets for egocentric vision-language understanding tasks, particularly video question answering from first-person perspective. The model inherits the base model's ability to interact with images, multi-image and videos, with enhanced capabilities for egocentric video understanding.

Feel free to share your generations in the Community tab!

Generation

We provide the simple generation process for using our model. For more details, you could refer to Github.

# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle

from PIL import Image
import requests
import copy
import torch

import sys
import warnings

warnings.filterwarnings("ignore")
pretrained = "sunidhitandel/hpml-egoqa-baseline"  # Finetuned model
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)  # Add any other thing you want to pass in llava_model_args

model.eval()

url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]

conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [image.size]


cont = model.generate(
    input_ids,
    images=image_tensor,
    image_sizes=image_sizes,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)

Training

Base Model

This model is finetuned from LLaVA-OneVision-Qwen2-7B-OV, which was trained on:

Architecture: SO400M + Qwen2
Pretraining Stage: LCS-558K, 1 epoch, projector
Mid Stage: A mixture of 4.7M high-quality synthetic data, 1 epoch, full model
Final-Image Stage: A mixture of 3.6M single-image data, 1 epoch, full model
OneVision Stage: A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model

Finetuning

Base Model: lmms-lab/llava-onevision-qwen2-7b-ov
Finetuning Dataset: EgoIT-99K and Ego4D (egocentric video QA data)
Task: Egocentric video question answering
Precision: bfloat16
Method: Full fine-tuning / LoRA (depending on configuration)

Hardware & Software

GPUs: Nvidia A100 (for finetuning)
Orchestration: Huggingface Trainer
Neural networks: PyTorch

Citation

If you use this finetuned model, please cite both the base model and this work:

@article{li2024llavaonevision,
      title={LLaVA-OneVision},
      author={Li, Bo and others},
      journal={arXiv preprint arXiv:2408.03326},
      year={2024}
}

@misc{hpml-egoqa-baseline,
      title={HPML-EgoQA-Baseline: Finetuned LLaVA-OneVision for Egocentric Video QA},
      author={Tandel, Sunidhi and Rahil and HPML Project Team},
      year={2024},
      howpublished={\url{https://huggingface.co/sunidhitandel/hpml-egoqa-baseline}},
      note={HPML Project - High-Performance Machine Learning for Egocentric Vision}
}

Acknowledgments

This work is part of the HPML (High-Performance Machine Learning) Project. We thank the LLaVA-OneVision team for providing the base model and the EgoIT-99K dataset contributors.

Team Members:

Sunidhi Tandel
Rahil
HPML Project Team

Downloads last month: 3

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for sunidhitandel/hpml-egoqa-baseline

Base model

lmms-lab/llava-onevision-qwen2-7b-ov

Finetuned

(17)

this model

Dataset used to train sunidhitandel/hpml-egoqa-baseline

Paper for sunidhitandel/hpml-egoqa-baseline

LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6, 2024 • 61

Evaluation results

accuracy on AI2D
self-reported

81.400
accuracy on ChartQA
self-reported

80.000
accuracy on DocVQA
self-reported

90.200
accuracy on InfoVQA
self-reported

70.700
accuracy on MathVerse
self-reported

26.200
accuracy on MathVista
self-reported

63.200
accuracy on MMBench
self-reported

80.800
score on MME-Perception
self-reported

1580.000
score on MME-Cognition
self-reported

418.000
accuracy on MMMU
self-reported

48.800
accuracy on MMVet
self-reported

57.500
accuracy on MMStar
self-reported

61.700
accuracy on Seed-Bench
self-reported

75.400
accuracy on Science-QA
self-reported

96.000
accuracy on ImageDC
self-reported

88.900
accuracy on MMLBench
self-reported

77.100
accuracy on RealWorldQA
self-reported

66.300
accuracy on Vibe-Eval
self-reported

51.700
accuracy on LLaVA-W
self-reported

90.700
accuracy on LLaVA-Wilder
self-reported

67.800

sunidhitandel
/

hpml-egoqa-baseline