--- license: gemma datasets: - Helsinki-NLP/tatoeba_mt language: - en - de base_model: - google/embeddinggemma-300m --- # Parametric UMAP: English-German Cross-Lingual Embeddings (768D → 2D) A Parametric UMAP model that projects 768-dimensional semantic embeddings from `google/embeddinggemma-300m` into a shared 2D cross-lingual space for English and German. ### Architecture - **Input**: 768-dimensional embeddings (from embeddinggemma-300m) - **Encoder**: - Dense(768 → 256) + ReLU - Dense(256 → 128) + ReLU - Dense(128 → 2) (linear output) - **Output**: 2D coordinates ### Training Details - **Base embeddings**: `google/embeddinggemma-300m` - **Training data**: 200,000 English-German sentence pairs from Helsinki-NLP/tatoeba_mt - **Method**: Parametric UMAP with cosine metric - **Framework**: TensorFlow 2.14 + umap-learn 0.5.5 - **Epochs**: 10 full epochs over UMAP graph - **Batch size**: 1000 edges ### Primary Use Cases **Visualization**: Plot bilingual text data in 2D for exploration **Similarity analysis**: Find semantically similar texts across languages **Cross-lingual clustering**: Group related content in EN/DE **Semantic search**: Fast nearest-neighbor search in 2D space ## How to Use ### Installation ```bash pip install sentence-transformers tensorflow umap-learn numpy ``` ### Basic Usage ```python import numpy as np from sentence_transformers import SentenceTransformer from tensorflow import keras # Load models embedding_model = SentenceTransformer("google/embeddinggemma-300m") umap_encoder = keras.models.load_model("path/to/encoder") # Your sentences sentences = [ "Hello world", "Hallo Welt" ] # Generate 768D embeddings embeddings_768d = embedding_model.encode( sentences, prompt_name="Clustering", convert_to_numpy=True ) # Project to 2D coords_2d = umap_encoder.predict(embeddings_768d) print(coords_2d) # [[0.11, 7.84], # [0.28, 7.70]] ``` ### Visualization Example ```python import matplotlib.pyplot as plt plt.figure(figsize=(10, 8)) plt.scatter(coords_2d[:, 0], coords_2d[:, 1]) for i, sent in enumerate(sentences): plt.annotate(sent, (coords_2d[i, 0], coords_2d[i, 1])) plt.xlabel("UMAP Dimension 1") plt.ylabel("UMAP Dimension 2") plt.title("2D Semantic Space") plt.show() ```