Tamil-Embed-Base

A Tamil-specialized sentence embedding model fine-tuned from multilingual-e5-base (278M parameters) using Matryoshka representation learning.

Paper: "A Thousand Language Problem: Morphological Understanding in Linguistic AI"

Model Details

Property Value
Base model intfloat/multilingual-e5-base
Parameters 278M
Embedding dimensions 768 (supports Matryoshka: 768, 512, 256, 128, 64)
Training data NLI entailment pairs (ta) + Samanantar parallel corpus (~50K pairs)
Loss function MatryoshkaLoss + MultipleNegativesRankingLoss

Training

Two-stage training pipeline:

  1. Stage 1 (NLI Warm-up): Fine-tune on Tamil NLI entailment pairs (ANLI, FEVER, LING, MNLI, WANLI) with MatryoshkaLoss wrapping MultipleNegativesRankingLoss
  2. Stage 2 (Retrieval): Fine-tune on Samanantar English-Tamil parallel corpus with hard negatives

MTEB Results

IndicCrosslingualSTS benchmark (Spearman correlation):

Language Pair Score
en-hi (Hindi) 0.640
en-kn (Kannada) 0.584
en-ml (Malayalam) 0.582
en-bn (Bengali) 0.537
en-pa (Punjabi) 0.536
en-gu (Gujarati) 0.533
en-as (Assamese) 0.512
en-ta (Tamil) 0.489
en-mr (Marathi) 0.485
en-te (Telugu) 0.468

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Tamil-ai/tamil-embed-base")

sentences = [
    "query: தமிழ் மொழியின் வரலாறு என்ன?",
    "passage: தமிழ் மொழி 2000 ஆண்டுகளுக்கும் மேலான வரலாற்றைக் கொண்ட செம்மொழியாகும்.",
    "passage: Python is a popular programming language.",
]

embeddings = model.encode(sentences)
print(embeddings.shape)  # (3, 768)

# Compute similarity
from sentence_transformers.util import cos_sim
similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)  # Tamil passage should score higher

Matryoshka (variable dimensions)

# Use smaller dimensions for faster search with minimal quality loss
embeddings_256 = model.encode(sentences, output_value="sentence_embedding")[:, :256]
embeddings_128 = model.encode(sentences, output_value="sentence_embedding")[:, :128]

Intended Use

  • Tamil semantic search and retrieval
  • Cross-lingual English-Tamil similarity
  • Tamil document clustering
  • RAG (Retrieval Augmented Generation) for Tamil

Citation

@misc{tamilai2026embed,
  title={A Thousand Language Problem: Morphological Understanding in Linguistic AI},
  author={Tamil-AI},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/Tamil-ai/tamil-embed-base}
}
Downloads last month
31
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Tamil-ai/tamil-embed-base

Finetuned
(113)
this model

Evaluation results

  • Spearman (en-ta) on IndicCrosslingualSTS (en-ta)
    self-reported
    0.489