LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Paper β’ 2310.01852 β’ Published β’ 2
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
A text-to-video retrieval system fine-tuned on a Marvel video dataset using LanguageBind + QLoRA (4-bit). Given a natural language query, the model retrieves the most semantically relevant video clip from a pre-indexed corpus.
| Property | Value |
|---|---|
| Base Model | LanguageBind/LanguageBind_Video_FT |
| Architecture | CLIP-style dual encoder (video + text) |
| Fine-tuning | QLoRA (NF4 4-bit quantization + LoRA adapters) |
| Total Parameters | 528.6M (base) |
| Trainable Params | LoRA adapters only |
| Embedding Dimension | 768 |
| Best Epoch | 5 |
| Training Platform | Kaggle (T4 Γ 2 GPUs) |
LoraConfig(
r=8,
lora_alpha=32,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj", "k_proj", "out_proj"],
)
lr = 2e-4
weight_decay = 0.01
batch_size = 4 # per device
accum_steps = 8 # effective batch size = 32
num_epochs = 10
patience = 4 # early stopping
warmup_steps = len(train_loader) # 1 epoch warmup
scheduler = CosineAnnealingLR
quantization = "nf4_4bit"
max_text_length = 77
A learnable log-temperature parameter was trained jointly with the model using contrastive (InfoNCE) loss.
sejal1411/video_retrieval-v2/
β
βββ adapter_config.json # LoRA configuration
βββ adapter_model.safetensors # LoRA weights
β
βββ training_checkpoint/
β βββ training_state.pt # Full training state (optimizer, scheduler, temp)
β βββ training_summary.json # Metrics + hyperparameters
β
βββ video_embeds.npy # Video embeddings (876, 768)
βββ text_embeds.npy # Text embeddings (876, 768)
βββ metadata.csv # Video paths + metadata
β
βββ training_curve.png
βββ umap_visualization.png
βββ similarity_distribution.png
βββ similarity_heatmap.png
from languagebind import LanguageBindVideo, LanguageBindVideoProcessor
from peft import PeftModel
import torch, json, numpy as np
import torch.nn.functional as F
# Load base model + LoRA adapter
base = LanguageBindVideo.from_pretrained("LanguageBind/LanguageBind_Video_FT")
model = PeftModel.from_pretrained(base, "sejal1411/video_retrieval-v2")
processor = LanguageBindVideoProcessor.from_pretrained("sejal1411/video_retrieval-v2")
# Load temperature
cfg = json.load(open("training_checkpoint/training_summary.json"))
log_temp = torch.tensor(cfg["log_temperature"])
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Encode query
query = "Daredevil fighting in a dark hallway"
processed = processor("dummy_video.mp4", [query], context_length=77, return_tensors="pt")
with torch.no_grad():
output = model(**{k: v.to(device) for k, v in processed.items()})
query_embed = F.normalize(output.text_embeds, dim=-1).cpu().numpy()
# Retrieve top-k videos
video_embeds = np.load("video_embeds.npy")
scores = (query_embed @ video_embeds.T).squeeze()
top_k = np.argsort(scores)[::-1][:5]
print("Top-5 video indices:", top_k)
import torch
ckpt = torch.load("training_checkpoint/training_state.pt")
log_temperature = torch.nn.Parameter(ckpt["log_temperature"].to(device))
optimizer.load_state_dict(ckpt["optimizer_state"])
scheduler.load_state_dict(ckpt["scheduler_state"])
start_epoch = ckpt["epoch"]
best_val_loss = ckpt["best_val_loss"]
train_history = ckpt["train_history"]
val_history = ckpt["val_history"]
Custom Marvel video dataset with keyword-style text-video pairs.
| Split | Ratio |
|---|---|
| Train | 70% |
| Validation | 15% |
| Test | ~15% |
| Total Samples | 876 |
transformers==4.35.0
tokenizers==0.14.0
peft==0.10.0
bitsandbytes
decord
torchvision==0.22.1
torchaudio==2.7.1
pytorchvideo==0.1.5
accelerate==0.21.0
If you use this model, please cite:
@article{zhu2023languagebind,
title = {LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment},
author = {Zhu, Bin and Lin, Bin and Ning, Munan and Yan, Yang and others},
journal = {arXiv preprint arXiv:2310.01852},
year = {2023}
}
Sejal (@sejal1411)