---
title: SciFact Multilingual Semantic Search
emoji: "\U0001F52C"
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
---

# SciFact Multilingual Semantic Search

A deployable semantic search engine over 5,183 scientific abstracts (SciFact dataset), using **ChromaDB** for vector storage and **multilingual-e5-small** for cross-lingual search in English, French, German, and Spanish.

**Live demo:** [huggingface.co/spaces/RJuro/scifact-semantic-search](https://huggingface.co/spaces/RJuro/scifact-semantic-search)

## Architecture

```
LOCAL (one-time setup)                    HF SPACES (runtime, CPU)
──────────────────────                    ────────────────────────
SciFact dataset                           Load ChromaDB from data/chroma_db/
    ↓                                     Load multilingual-e5-small
Encode 5,183 docs with                       ↓
multilingual-e5-small                     /search?q=... →
    ↓                                       encode query →
Save to ChromaDB ──── push via git ────→    ChromaDB query →
(data/chroma_db/)                           JSON results
```

**Key idea:** Corpus encoding happens once on your machine. Only query encoding runs on HF Spaces (CPU). This keeps the Space fast and cheap.

## Files

| File | Purpose |
|------|---------|
| `precompute.py` | Encodes all 5,183 SciFact docs and saves them to ChromaDB (run locally) |
| `app.py` | FastAPI server — loads ChromaDB + model, serves search API and frontend |
| `static/index.html` | Frontend — vanilla HTML/CSS/JS, no dependencies |
| `requirements.txt` | Python dependencies |
| `Dockerfile` | Container config for HF Spaces |
| `README.md` | This file (YAML header is required by HF Spaces) |

## Step-by-Step Deployment Guide

### Prerequisites

- Python 3.9+
- A free [Hugging Face](https://huggingface.co) account
- Git with [Git LFS](https://git-lfs.com/) installed (`brew install git-lfs` on macOS)

### Step 1 — Install dependencies

```bash
pip install -r requirements.txt
```

### Step 2 — Run precompute.py (local, one-time)

This downloads the SciFact dataset, encodes all 5,183 abstracts with `intfloat/multilingual-e5-small`, and saves the vectors + metadata into a persistent ChromaDB at `data/chroma_db/`.

```bash
python precompute.py
```

Takes ~2 minutes on CPU. When done you should see:

```
ChromaDB persisted to: .../data/chroma_db
Collection 'scifact': 5183 documents
```

Verify the output:

```bash
ls data/chroma_db/
# Should show: chroma.sqlite3  and a UUID-named directory
```

### Step 3 — Test locally

```bash
uvicorn app:app --port 7860
```

Open [localhost:7860](http://localhost:7860) in your browser. Try searching:
- `effects of vaccination` (English)
- `effets de la vaccination` (French)
- `Auswirkungen der Impfung` (German)

The same English-language corpus should return relevant results regardless of query language.

### Step 4 — Create a Hugging Face Space

Go to [huggingface.co/new-space](https://huggingface.co/new-space):
- **Space name:** choose any name (e.g. `scifact-semantic-search`)
- **SDK:** Docker
- **Visibility:** Public

Or use the CLI:

```bash
pip install huggingface-hub
huggingface-cli login  # paste your HF token
python -c "
from huggingface_hub import HfApi
api = HfApi()
url = api.create_repo('YOUR-SPACE-NAME', repo_type='space', space_sdk='docker')
print(url)
"
```

### Step 5 — Push to HF Spaces

Initialize git, enable LFS (needed because `chroma.sqlite3` is ~74 MB), and push:

```bash
git init
git lfs install
git lfs track "*.sqlite3" "*.bin"
git add .
git commit -m "Initial deploy"
git remote add origin https://huggingface.co/spaces/YOUR-USERNAME/YOUR-SPACE-NAME
git push origin main
```

If the push is rejected (HF creates a default commit), pull first:

```bash
git pull origin main --rebase
# Resolve any conflicts in .gitattributes / README.md (keep your versions)
git add .
git rebase --continue
git push origin main
```

### Step 6 — Wait for build

HF Spaces will build the Docker image (installs PyTorch, sentence-transformers, etc.). This takes 5-10 minutes on the first deploy. Watch progress in the Space's **Logs** tab.

Once the status shows **Running**, your app is live.

## How It Works

### Embedding model

**`intfloat/multilingual-e5-small`** (118M params, 384 dimensions)

This is a compact multilingual retrieval model. Critical detail — E5 models require prefixes:
- Documents: `passage: {text}`
- Queries: `query: {text}`

Without these prefixes, retrieval quality drops significantly.

### Vector database

**ChromaDB** with persistent storage and cosine distance. Documents are stored with precomputed embeddings so ChromaDB doesn't need to re-embed anything at runtime.

### Search flow

1. User types a query in any supported language
2. FastAPI encodes it with `query: {text}` prefix using the E5 model
3. ChromaDB finds the 5 nearest neighbors by cosine distance
4. Results are returned as JSON: `{rank, score, title, text}`
5. Score = 1 - cosine_distance (displayed as similarity percentage)

### Cross-lingual search

The multilingual E5 model maps text from different languages into the same vector space. A French query about vaccination lands near English documents about vaccination — no translation needed.

## Customization Ideas

- **Different dataset:** Replace `load_scifact()` in `precompute.py` with your own corpus
- **More languages:** The model supports 100+ languages — add more example chips in `index.html`
- **More results:** Change `top_k` parameter (default 5, max 20)
- **Reranking:** Add a cross-encoder reranker on top of the retrieval results for better precision