TARA / README.md
bpiyush's picture
Upload README.md with huggingface_hub
80764a7 verified
|
raw
history blame
2.69 kB

TARA Model

TARA (Time-Aware Retrieval Adaptation) is a multimodal model for video and text understanding.

Installation & Setup

1. Install Git LFS (if not already installed)

Git LFS is required to download the model weights.

Ubuntu/Debian:

sudo apt-get install git-lfs
git lfs install

MacOS:

brew install git-lfs
git lfs install

Check the installation:

git lfs install

2. Clone the Repository

git clone https://huggingface.co/bpiyush/TARA
cd TARA

This will download all model weights (may take a few minutes depending on your connection).

3. Install Dependencies

  • Create/activate the conda env (skip if you already have it):
    conda create -n tara python=3.10 -y
    conda activate tara
    
  • Install CUDA 12.1 PyTorch wheels (adjust the index URL if you need a different CUDA/CPU build):
    pip install --index-url https://download.pytorch.org/whl/cu121 \
      torch==2.5.1+cu121 torchvision==0.20.1+cu121 torchaudio==2.5.1+cu121
    
  • Install the remaining model dependencies:
    pip install -r requirements.txt
    
  • (Optional) Verify the install:
    python -c "import torch, transformers; print(torch.cuda.is_available(), transformers.__version__)"
    

Quick Start

See the script at demo_usage.py for a quick start. You can run it:

python demo_usage.py

OR use the snippet below:

import torch
from modeling_tara import TARA, read_frames_decord

model = TARA.from_pretrained(
    ".",  # Load from current directory
    device_map='auto',
    torch_dtype=torch.bfloat16,
)
n_params = sum(p.numel() for p in model.model.parameters())
print(f"Number of parameters: {round(n_params/1e9, 3)}B")

# Embed a video
video_path = "./assets/folding_paper.mp4"
video_tensor = read_frames_decord(video_path, num_frames=16)
video_tensor = video_tensor.unsqueeze(0)
video_tensor = video_tensor.to(model.model.device)
with torch.no_grad():
    video_emb = model.encode_vision(video_tensor).cpu().squeeze(0).float()
print(f"Video shape: {video_tensor.shape}")  # torch.Size([1, 16, 3, 240, 426])
print(f"Video embedding shape: {video_emb.shape}")  # torch.Size([4096])

# Embed a text
text = ['someone is folding a paper', 'cutting a paper', 'someone is folding a paper']
with torch.no_grad():
    text_emb = model.encode_text(text).cpu().float()
print(f"Text embedding shape: {text_emb.shape}")  # torch.Size([3, 4096])

Citation

If you use this model, please cite:

@misc{tara2024,
  title={TARA: Time-Aware Retrieval Adaptation},
  author={Your Name},
  year={2024}
}

License

Apache 2.0