TARA Model
TARA (Time-Aware Retrieval Adaptation) is a multimodal model for video and text understanding.
Installation & Setup
1. Install Git LFS (if not already installed)
Git LFS is required to download the model weights.
Ubuntu/Debian:
sudo apt-get install git-lfs
git lfs install
MacOS:
brew install git-lfs
git lfs install
Check the installation:
git lfs install
2. Clone the Repository
git clone https://huggingface.co/bpiyush/TARA
cd TARA
This will download all model weights (may take a few minutes depending on your connection).
3. Install Dependencies
- Create/activate the conda env (skip if you already have it):
conda create -n tara python=3.10 -y conda activate tara - Install CUDA 12.1 PyTorch wheels (adjust the index URL if you need a different CUDA/CPU build):
pip install --index-url https://download.pytorch.org/whl/cu121 \ torch==2.5.1+cu121 torchvision==0.20.1+cu121 torchaudio==2.5.1+cu121 - Install the remaining model dependencies:
pip install -r requirements.txt - (Optional) Verify the install:
python -c "import torch, transformers; print(torch.cuda.is_available(), transformers.__version__)"
Quick Start
See the script at demo_usage.py for a quick start. You can run it:
python demo_usage.py
OR use the snippet below:
import torch
from modeling_tara import TARA, read_frames_decord
model = TARA.from_pretrained(
".", # Load from current directory
device_map='auto',
torch_dtype=torch.bfloat16,
)
n_params = sum(p.numel() for p in model.model.parameters())
print(f"Number of parameters: {round(n_params/1e9, 3)}B")
# Embed a video
video_path = "./assets/folding_paper.mp4"
video_tensor = read_frames_decord(video_path, num_frames=16)
video_tensor = video_tensor.unsqueeze(0)
video_tensor = video_tensor.to(model.model.device)
with torch.no_grad():
video_emb = model.encode_vision(video_tensor).cpu().squeeze(0).float()
print(f"Video shape: {video_tensor.shape}") # torch.Size([1, 16, 3, 240, 426])
print(f"Video embedding shape: {video_emb.shape}") # torch.Size([4096])
# Embed a text
text = ['someone is folding a paper', 'cutting a paper', 'someone is folding a paper']
with torch.no_grad():
text_emb = model.encode_text(text).cpu().float()
print(f"Text embedding shape: {text_emb.shape}") # torch.Size([3, 4096])
Citation
If you use this model, please cite:
@misc{tara2024,
title={TARA: Time-Aware Retrieval Adaptation},
author={Your Name},
year={2024}
}
License
Apache 2.0