--- tags: - mixture-of-experts - moe - transformer - language-model - pytorch - conditional-computation datasets: - custom pipeline_tag: text-generation license: mit --- # Mixture-of-Experts Language Models A PyTorch implementation exploring conditional computation in Transformers through Mixture-of-Experts (MoE). ## Models This repository contains two MoE architectures: ### 1. Sparse MoE (Top-K Routing) Routes each token to a fixed number of experts (k=2), increasing model capacity without proportionally increasing compute. ### 2. Dynamic MoE (Confidence-Based Routing) Dynamically adjusts the number of experts per token based on routing confidence—"easy" tokens use fewer experts, "hard" tokens use more. ## Model Details | Parameter | Sparse MoE | Dynamic MoE | |-----------|------------|-------------| | Layers | 4 | 4 | | Hidden Dim | 512 | 512 | | FFN Dim | 2048 | 2048 | | Attention Heads | 8 | 8 | | Experts | 8 | 4 | | Routing | Top-2 | τ=0.8 threshold | | Context Length | 256 | 256 | | Vocab Size | 10,000 | 10,000 | ## Architecture ``` Input → Embedding → [Transformer Block × N] → RMSNorm → Linear → Output Transformer Block: └─ RMSNorm → Multi-Head Self-Attention → Residual └─ RMSNorm → MoE Layer → Residual MoE Layer: └─ Router (softmax gating) └─ Expert Selection (Top-K or Dynamic) └─ Weighted Expert Outputs ``` ## Training Both models were trained with: - **Optimizer**: AdamW (β1=0.9, β2=0.95) - **Learning Rate**: 3e-4 with cosine decay - **Warmup Steps**: 2,000 - **Weight Decay**: 0.1 ### Loss Functions **Sparse MoE:** ``` L = L_CE + α * L_balance ``` **Dynamic MoE:** ``` L = L_CE + β * L_balance + γ * L_entropy ``` Where: - `L_CE`: Cross-entropy loss - `L_balance`: Load balancing loss (encourages uniform expert utilization) - `L_entropy`: Entropy regularization (encourages sparse routing) ## Usage ```python import torch from moe.moelm import MoeLM, DynamicMOELM # Load Sparse MoE sparse_model = MoeLM( vocab_size=10000, num_layers=4, context_length=256, d_model=512, d_ff=2048, num_heads=8, num_experts=8, top_k=2 ) sparse_model.load_state_dict(torch.load("sparse_moe_final.pt")) # Load Dynamic MoE dynamic_model = DynamicMOELM( vocab_size=10000, num_layers=4, context_length=256, d_model=512, d_ff=2048, num_heads=8, num_experts=4, confidence_threshold=0.8 ) dynamic_model.load_state_dict(torch.load("dynamic_moe_final.pt")) ``` ## Files | File | Description | |------|-------------| | `sparse_moe_final.pt` | Sparse MoE model weights | | `dynamic_moe_final.pt` | Dynamic MoE model weights | | `sparse_moe_config.json` | Sparse MoE configuration | | `dynamic_moe_config.json` | Dynamic MoE configuration | ## Citation ```bibtex @misc{moe-lm-2024, title={Mixture-of-Experts Language Model}, author={Chaitanya}, year={2024}, url={https://github.com/chaitanya/transformers-and-MOE} } ``` ## Reference Based on ["Harder Tasks Need More Experts: Dynamic Routing in MoE Models"](https://arxiv.org/abs/2403.07652) ## License MIT