File size: 2,285 Bytes
17503a6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
license: gemma
datasets:
- Helsinki-NLP/tatoeba_mt
language:
- en
- de
base_model:
- google/embeddinggemma-300m
---
# Parametric UMAP: English-German Cross-Lingual Embeddings (768D → 2D)

A Parametric UMAP model that projects 768-dimensional semantic embeddings from `google/embeddinggemma-300m` into a shared 2D cross-lingual space for English and German.

### Architecture

- **Input**: 768-dimensional embeddings (from embeddinggemma-300m)
- **Encoder**:
  - Dense(768 → 256) + ReLU
  - Dense(256 → 128) + ReLU
  - Dense(128 → 2) (linear output)
- **Output**: 2D coordinates

### Training Details

- **Base embeddings**: `google/embeddinggemma-300m`
- **Training data**: 200,000 English-German sentence pairs from Helsinki-NLP/tatoeba_mt
- **Method**: Parametric UMAP with cosine metric
- **Framework**: TensorFlow 2.14 + umap-learn 0.5.5
- **Epochs**: 10 full epochs over UMAP graph
- **Batch size**: 1000 edges

### Primary Use Cases
**Visualization**: Plot bilingual text data in 2D for exploration  
**Similarity analysis**: Find semantically similar texts across languages  
**Cross-lingual clustering**: Group related content in EN/DE  
**Semantic search**: Fast nearest-neighbor search in 2D space

## How to Use

### Installation

```bash
pip install sentence-transformers tensorflow umap-learn numpy
```

### Basic Usage

```python
import numpy as np
from sentence_transformers import SentenceTransformer
from tensorflow import keras

# Load models
embedding_model = SentenceTransformer("google/embeddinggemma-300m")
umap_encoder = keras.models.load_model("path/to/encoder")

# Your sentences
sentences = [
    "Hello world",
    "Hallo Welt"
]

# Generate 768D embeddings
embeddings_768d = embedding_model.encode(
    sentences,
    prompt_name="Clustering",
    convert_to_numpy=True
)

# Project to 2D
coords_2d = umap_encoder.predict(embeddings_768d)

print(coords_2d)
# [[0.11, 7.84],
#  [0.28, 7.70]]
```

### Visualization Example

```python
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))
plt.scatter(coords_2d[:, 0], coords_2d[:, 1])

for i, sent in enumerate(sentences):
    plt.annotate(sent, (coords_2d[i, 0], coords_2d[i, 1]))

plt.xlabel("UMAP Dimension 1")
plt.ylabel("UMAP Dimension 2")
plt.title("2D Semantic Space")
plt.show()
```