--- title: SciFact Multilingual Semantic Search emoji: "\U0001F52C" colorFrom: indigo colorTo: purple sdk: docker app_port: 7860 --- # SciFact Multilingual Semantic Search A deployable semantic search engine over 5,183 scientific abstracts (SciFact dataset), using **ChromaDB** for vector storage and **multilingual-e5-small** for cross-lingual search in English, French, German, and Spanish. **Live demo:** [huggingface.co/spaces/RJuro/scifact-semantic-search](https://huggingface.co/spaces/RJuro/scifact-semantic-search) ## Architecture ``` LOCAL (one-time setup) HF SPACES (runtime, CPU) ────────────────────── ──────────────────────── SciFact dataset Load ChromaDB from data/chroma_db/ ↓ Load multilingual-e5-small Encode 5,183 docs with ↓ multilingual-e5-small /search?q=... → ↓ encode query → Save to ChromaDB ──── push via git ────→ ChromaDB query → (data/chroma_db/) JSON results ``` **Key idea:** Corpus encoding happens once on your machine. Only query encoding runs on HF Spaces (CPU). This keeps the Space fast and cheap. ## Files | File | Purpose | |------|---------| | `precompute.py` | Encodes all 5,183 SciFact docs and saves them to ChromaDB (run locally) | | `app.py` | FastAPI server — loads ChromaDB + model, serves search API and frontend | | `static/index.html` | Frontend — vanilla HTML/CSS/JS, no dependencies | | `requirements.txt` | Python dependencies | | `Dockerfile` | Container config for HF Spaces | | `README.md` | This file (YAML header is required by HF Spaces) | ## Step-by-Step Deployment Guide ### Prerequisites - Python 3.9+ - A free [Hugging Face](https://huggingface.co) account - Git with [Git LFS](https://git-lfs.com/) installed (`brew install git-lfs` on macOS) ### Step 1 — Install dependencies ```bash pip install -r requirements.txt ``` ### Step 2 — Run precompute.py (local, one-time) This downloads the SciFact dataset, encodes all 5,183 abstracts with `intfloat/multilingual-e5-small`, and saves the vectors + metadata into a persistent ChromaDB at `data/chroma_db/`. ```bash python precompute.py ``` Takes ~2 minutes on CPU. When done you should see: ``` ChromaDB persisted to: .../data/chroma_db Collection 'scifact': 5183 documents ``` Verify the output: ```bash ls data/chroma_db/ # Should show: chroma.sqlite3 and a UUID-named directory ``` ### Step 3 — Test locally ```bash uvicorn app:app --port 7860 ``` Open [localhost:7860](http://localhost:7860) in your browser. Try searching: - `effects of vaccination` (English) - `effets de la vaccination` (French) - `Auswirkungen der Impfung` (German) The same English-language corpus should return relevant results regardless of query language. ### Step 4 — Create a Hugging Face Space Go to [huggingface.co/new-space](https://huggingface.co/new-space): - **Space name:** choose any name (e.g. `scifact-semantic-search`) - **SDK:** Docker - **Visibility:** Public Or use the CLI: ```bash pip install huggingface-hub huggingface-cli login # paste your HF token python -c " from huggingface_hub import HfApi api = HfApi() url = api.create_repo('YOUR-SPACE-NAME', repo_type='space', space_sdk='docker') print(url) " ``` ### Step 5 — Push to HF Spaces Initialize git, enable LFS (needed because `chroma.sqlite3` is ~74 MB), and push: ```bash git init git lfs install git lfs track "*.sqlite3" "*.bin" git add . git commit -m "Initial deploy" git remote add origin https://huggingface.co/spaces/YOUR-USERNAME/YOUR-SPACE-NAME git push origin main ``` If the push is rejected (HF creates a default commit), pull first: ```bash git pull origin main --rebase # Resolve any conflicts in .gitattributes / README.md (keep your versions) git add . git rebase --continue git push origin main ``` ### Step 6 — Wait for build HF Spaces will build the Docker image (installs PyTorch, sentence-transformers, etc.). This takes 5-10 minutes on the first deploy. Watch progress in the Space's **Logs** tab. Once the status shows **Running**, your app is live. ## How It Works ### Embedding model **`intfloat/multilingual-e5-small`** (118M params, 384 dimensions) This is a compact multilingual retrieval model. Critical detail — E5 models require prefixes: - Documents: `passage: {text}` - Queries: `query: {text}` Without these prefixes, retrieval quality drops significantly. ### Vector database **ChromaDB** with persistent storage and cosine distance. Documents are stored with precomputed embeddings so ChromaDB doesn't need to re-embed anything at runtime. ### Search flow 1. User types a query in any supported language 2. FastAPI encodes it with `query: {text}` prefix using the E5 model 3. ChromaDB finds the 5 nearest neighbors by cosine distance 4. Results are returned as JSON: `{rank, score, title, text}` 5. Score = 1 - cosine_distance (displayed as similarity percentage) ### Cross-lingual search The multilingual E5 model maps text from different languages into the same vector space. A French query about vaccination lands near English documents about vaccination — no translation needed. ## Customization Ideas - **Different dataset:** Replace `load_scifact()` in `precompute.py` with your own corpus - **More languages:** The model supports 100+ languages — add more example chips in `index.html` - **More results:** Change `top_k` parameter (default 5, max 20) - **Reranking:** Add a cross-encoder reranker on top of the retrieval results for better precision