🎬 Marvel Video Retrieval — LanguageBind + QLoRA

A text-to-video retrieval system fine-tuned on a Marvel video dataset using LanguageBind + QLoRA (4-bit). Given a natural language query, the model retrieves the most semantically relevant video clip from a pre-indexed corpus.

📌 Model Overview

Property	Value
Base Model	LanguageBind/LanguageBind_Video_FT
Architecture	CLIP-style dual encoder (video + text)
Fine-tuning	QLoRA (NF4 4-bit quantization + LoRA adapters)
Total Parameters	528.6M (base)
Trainable Params	LoRA adapters only
Embedding Dimension	768
Best Epoch	5
Training Platform	Kaggle (T4 × 2 GPUs)

⚙️ Training Details

🔹 LoRA Configuration

LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj", "k_proj", "out_proj"],
)

🔹 Training Hyperparameters

lr = 2e-4
weight_decay = 0.01
batch_size = 4              # per device
accum_steps = 8             # effective batch size = 32
num_epochs = 10
patience = 4                # early stopping
warmup_steps = len(train_loader)  # 1 epoch warmup
scheduler = CosineAnnealingLR
quantization = "nf4_4bit"
max_text_length = 77

🌡️ Learnable Temperature

A learnable log-temperature parameter was trained jointly with the model using contrastive (InfoNCE) loss.

Final temperature: ~14.38

📦 Repository Structure

sejal1411/video_retrieval-v2/
│
├── adapter_config.json          # LoRA configuration
├── adapter_model.safetensors    # LoRA weights
│
├── training_checkpoint/
│   ├── training_state.pt        # Full training state (optimizer, scheduler, temp)
│   └── training_summary.json    # Metrics + hyperparameters
│
├── video_embeds.npy             # Video embeddings (876, 768)
├── text_embeds.npy              # Text embeddings (876, 768)
├── metadata.csv                 # Video paths + metadata
│
├── training_curve.png
├── umap_visualization.png
├── similarity_distribution.png
└── similarity_heatmap.png

🚀 Inference

from languagebind import LanguageBindVideo, LanguageBindVideoProcessor
from peft import PeftModel
import torch, json, numpy as np
import torch.nn.functional as F

# Load base model + LoRA adapter
base = LanguageBindVideo.from_pretrained("LanguageBind/LanguageBind_Video_FT")
model = PeftModel.from_pretrained(base, "sejal1411/video_retrieval-v2")
processor = LanguageBindVideoProcessor.from_pretrained("sejal1411/video_retrieval-v2")

# Load temperature
cfg = json.load(open("training_checkpoint/training_summary.json"))
log_temp = torch.tensor(cfg["log_temperature"])

model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Encode query
query = "Daredevil fighting in a dark hallway"
processed = processor("dummy_video.mp4", [query], context_length=77, return_tensors="pt")

with torch.no_grad():
    output = model(**{k: v.to(device) for k, v in processed.items()})
    query_embed = F.normalize(output.text_embeds, dim=-1).cpu().numpy()

# Retrieve top-k videos
video_embeds = np.load("video_embeds.npy")
scores = (query_embed @ video_embeds.T).squeeze()
top_k = np.argsort(scores)[::-1][:5]

print("Top-5 video indices:", top_k)

♻️ Resume Training

import torch

ckpt = torch.load("training_checkpoint/training_state.pt")

log_temperature = torch.nn.Parameter(ckpt["log_temperature"].to(device))
optimizer.load_state_dict(ckpt["optimizer_state"])
scheduler.load_state_dict(ckpt["scheduler_state"])

start_epoch = ckpt["epoch"]
best_val_loss = ckpt["best_val_loss"]
train_history = ckpt["train_history"]
val_history = ckpt["val_history"]

🧰 Dataset

Custom Marvel video dataset with keyword-style text-video pairs.

Split	Ratio
Train	70%
Validation	15%
Test	~15%
Total Samples	876

Random seed: 42

📚 Dependencies

transformers==4.35.0
tokenizers==0.14.0
peft==0.10.0
bitsandbytes
decord
torchvision==0.22.1
torchaudio==2.7.1
pytorchvideo==0.1.5
accelerate==0.21.0

📄 Citation

If you use this model, please cite:

@article{zhu2023languagebind,
  title   = {LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment},
  author  = {Zhu, Bin and Lin, Bin and Ning, Munan and Yan, Yang and others},
  journal = {arXiv preprint arXiv:2310.01852},
  year    = {2023}
}

🙋 Author

Sejal (@sejal1411)

Downloads last month: 4

Safetensors

Model size

0.5B params

Tensor type

I64

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for sejal1411/video_retrieval-v2

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Paper • 2310.01852 • Published Oct 3, 2023 • 2