---
tags:
- mixture-of-experts
- moe
- transformer
- language-model
- pytorch
- conditional-computation
datasets:
- custom
pipeline_tag: text-generation
license: mit
---

# Mixture-of-Experts Language Models

A PyTorch implementation exploring conditional computation in Transformers through Mixture-of-Experts (MoE).

## Models

This repository contains two MoE architectures:

### 1. Sparse MoE (Top-K Routing)
Routes each token to a fixed number of experts (k=2), increasing model capacity without proportionally increasing compute.

### 2. Dynamic MoE (Confidence-Based Routing)
Dynamically adjusts the number of experts per token based on routing confidence—"easy" tokens use fewer experts, "hard" tokens use more.

## Model Details

| Parameter | Sparse MoE | Dynamic MoE |
|-----------|------------|-------------|
| Layers | 4 | 4 |
| Hidden Dim | 512 | 512 |
| FFN Dim | 2048 | 2048 |
| Attention Heads | 8 | 8 |
| Experts | 8 | 4 |
| Routing | Top-2 | τ=0.8 threshold |
| Context Length | 256 | 256 |
| Vocab Size | 10,000 | 10,000 |

## Architecture

```
Input → Embedding → [Transformer Block × N] → RMSNorm → Linear → Output

Transformer Block:
  └─ RMSNorm → Multi-Head Self-Attention → Residual
  └─ RMSNorm → MoE Layer → Residual

MoE Layer:
  └─ Router (softmax gating)
  └─ Expert Selection (Top-K or Dynamic)
  └─ Weighted Expert Outputs
```

## Training

Both models were trained with:
- **Optimizer**: AdamW (β1=0.9, β2=0.95)
- **Learning Rate**: 3e-4 with cosine decay
- **Warmup Steps**: 2,000
- **Weight Decay**: 0.1

### Loss Functions

**Sparse MoE:**
```
L = L_CE + α * L_balance
```

**Dynamic MoE:**
```
L = L_CE + β * L_balance + γ * L_entropy
```

Where:
- `L_CE`: Cross-entropy loss
- `L_balance`: Load balancing loss (encourages uniform expert utilization)
- `L_entropy`: Entropy regularization (encourages sparse routing)

## Usage

```python
import torch
from moe.moelm import MoeLM, DynamicMOELM

# Load Sparse MoE
sparse_model = MoeLM(
    vocab_size=10000,
    num_layers=4,
    context_length=256,
    d_model=512,
    d_ff=2048,
    num_heads=8,
    num_experts=8,
    top_k=2
)
sparse_model.load_state_dict(torch.load("sparse_moe_final.pt"))

# Load Dynamic MoE
dynamic_model = DynamicMOELM(
    vocab_size=10000,
    num_layers=4,
    context_length=256,
    d_model=512,
    d_ff=2048,
    num_heads=8,
    num_experts=4,
    confidence_threshold=0.8
)
dynamic_model.load_state_dict(torch.load("dynamic_moe_final.pt"))
```

## Files

| File | Description |
|------|-------------|
| `sparse_moe_final.pt` | Sparse MoE model weights |
| `dynamic_moe_final.pt` | Dynamic MoE model weights |
| `sparse_moe_config.json` | Sparse MoE configuration |
| `dynamic_moe_config.json` | Dynamic MoE configuration |

## Citation

```bibtex
@misc{moe-lm-2024,
  title={Mixture-of-Experts Language Model},
  author={Chaitanya},
  year={2024},
  url={https://github.com/chaitanya/transformers-and-MOE}
}
```

## Reference

Based on ["Harder Tasks Need More Experts: Dynamic Routing in MoE Models"](https://arxiv.org/abs/2403.07652)

## License

MIT