YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

🎬 Marvel Video Retrieval β€” LanguageBind + QLoRA

A text-to-video retrieval system fine-tuned on a Marvel video dataset using LanguageBind + QLoRA (4-bit). Given a natural language query, the model retrieves the most semantically relevant video clip from a pre-indexed corpus.


πŸ“Œ Model Overview

Property Value
Base Model LanguageBind/LanguageBind_Video_FT
Architecture CLIP-style dual encoder (video + text)
Fine-tuning QLoRA (NF4 4-bit quantization + LoRA adapters)
Total Parameters 528.6M (base)
Trainable Params LoRA adapters only
Embedding Dimension 768
Best Epoch 5
Training Platform Kaggle (T4 Γ— 2 GPUs)

βš™οΈ Training Details

πŸ”Ή LoRA Configuration

LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj", "k_proj", "out_proj"],
)

πŸ”Ή Training Hyperparameters

lr = 2e-4
weight_decay = 0.01
batch_size = 4              # per device
accum_steps = 8             # effective batch size = 32
num_epochs = 10
patience = 4                # early stopping
warmup_steps = len(train_loader)  # 1 epoch warmup
scheduler = CosineAnnealingLR
quantization = "nf4_4bit"
max_text_length = 77

🌑️ Learnable Temperature

A learnable log-temperature parameter was trained jointly with the model using contrastive (InfoNCE) loss.

  • Final temperature: ~14.38

πŸ“¦ Repository Structure

sejal1411/video_retrieval-v2/
β”‚
β”œβ”€β”€ adapter_config.json          # LoRA configuration
β”œβ”€β”€ adapter_model.safetensors    # LoRA weights
β”‚
β”œβ”€β”€ training_checkpoint/
β”‚   β”œβ”€β”€ training_state.pt        # Full training state (optimizer, scheduler, temp)
β”‚   └── training_summary.json    # Metrics + hyperparameters
β”‚
β”œβ”€β”€ video_embeds.npy             # Video embeddings (876, 768)
β”œβ”€β”€ text_embeds.npy              # Text embeddings (876, 768)
β”œβ”€β”€ metadata.csv                 # Video paths + metadata
β”‚
β”œβ”€β”€ training_curve.png
β”œβ”€β”€ umap_visualization.png
β”œβ”€β”€ similarity_distribution.png
└── similarity_heatmap.png

πŸš€ Inference

from languagebind import LanguageBindVideo, LanguageBindVideoProcessor
from peft import PeftModel
import torch, json, numpy as np
import torch.nn.functional as F

# Load base model + LoRA adapter
base = LanguageBindVideo.from_pretrained("LanguageBind/LanguageBind_Video_FT")
model = PeftModel.from_pretrained(base, "sejal1411/video_retrieval-v2")
processor = LanguageBindVideoProcessor.from_pretrained("sejal1411/video_retrieval-v2")

# Load temperature
cfg = json.load(open("training_checkpoint/training_summary.json"))
log_temp = torch.tensor(cfg["log_temperature"])

model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Encode query
query = "Daredevil fighting in a dark hallway"
processed = processor("dummy_video.mp4", [query], context_length=77, return_tensors="pt")

with torch.no_grad():
    output = model(**{k: v.to(device) for k, v in processed.items()})
    query_embed = F.normalize(output.text_embeds, dim=-1).cpu().numpy()

# Retrieve top-k videos
video_embeds = np.load("video_embeds.npy")
scores = (query_embed @ video_embeds.T).squeeze()
top_k = np.argsort(scores)[::-1][:5]

print("Top-5 video indices:", top_k)

♻️ Resume Training

import torch

ckpt = torch.load("training_checkpoint/training_state.pt")

log_temperature = torch.nn.Parameter(ckpt["log_temperature"].to(device))
optimizer.load_state_dict(ckpt["optimizer_state"])
scheduler.load_state_dict(ckpt["scheduler_state"])

start_epoch = ckpt["epoch"]
best_val_loss = ckpt["best_val_loss"]
train_history = ckpt["train_history"]
val_history = ckpt["val_history"]

🧰 Dataset

Custom Marvel video dataset with keyword-style text-video pairs.

Split Ratio
Train 70%
Validation 15%
Test ~15%
Total Samples 876
  • Random seed: 42

πŸ“š Dependencies

transformers==4.35.0
tokenizers==0.14.0
peft==0.10.0
bitsandbytes
decord
torchvision==0.22.1
torchaudio==2.7.1
pytorchvideo==0.1.5
accelerate==0.21.0

πŸ“„ Citation

If you use this model, please cite:

@article{zhu2023languagebind,
  title   = {LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment},
  author  = {Zhu, Bin and Lin, Bin and Ning, Munan and Yan, Yang and others},
  journal = {arXiv preprint arXiv:2310.01852},
  year    = {2023}
}

πŸ™‹ Author

Sejal (@sejal1411)


Downloads last month
4
Safetensors
Model size
0.5B params
Tensor type
I64
Β·
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for sejal1411/video_retrieval-v2