Helix-mRNA: A Hybrid Foundation Model For Full Sequence mRNA Therapeutics
Paper β’ 2502.13785 β’ Published β’ 3
Fine-tuned version of NVIDIA NV-CodonFM-Encodon-80M-v1 for predicting mRNA stability (half-life) from coding sequences.
| Property | Value |
|---|---|
| Base Model | nvidia/NV-CodonFM-Encodon-80M-v1 |
| Parameters | 80M (77.9M total) |
| Architecture | BERT-style Transformer with Rotary Position Embeddings (RoPE) |
| Tokenization | Codon-level (69 vocab: 64 codons + 5 special tokens) |
| Max Length | 2,046 codons (~6,138 nucleotides) |
| Task | Regression β predict mRNA stability score |
| Input | mRNA/DNA coding sequence |
| Output | Continuous stability score (higher = more stable) |
Encoder: 6 Transformer layers, hidden_size=1024, 8 attention heads, 4096 FFN
Position: Rotary Position Embeddings (RoPE, ΞΈ=10000)
Pretraining: Masked Language Modeling (MLM) on >130M coding sequences from NCBI RefSeq
Fine-tuning: Regression head (mean-pooled β Dense β Tanh β Dense β scalar)
Strategy: Freeze first 4/6 layers, unfreeze last 2 + regression head
| Dataset | Samples | Description |
|---|---|---|
| mogam-ai/CDS-BART-mRNA-stability | 41,063 | iCodon vertebrate mRNA stability (human, mouse, frog, fish) |
| GleghornLab/mrna_stability_other | 65,356 | Additional multi-species mRNA stability data |
| Combined | 106,419 | Train: 74,519 / Val: 16,010 / Test: 15,890 |
Based on Helix-mRNA and BEACON:
| Parameter | Value |
|---|---|
| Optimizer | AdamW (backbone lr=5e-5, head lr=5e-4) |
| Weight Decay | 0.01 |
| Scheduler | Cosine with 100-step warmup |
| Epochs | 20 |
| Batch Size | 16 Γ 2 grad_accum = 32 effective |
| Max Length | 1024 codons |
| Precision | FP16 mixed precision |
| Frozen Layers | First 4 of 6 (embeddings + layers 0-3) |
| Trainable | Layers 4-5 + regression head (~26.3M params) |
| Model | Spearman Ο |
|---|---|
| CodonBERT | 0.35 |
| XE | 0.50 |
| Helix-mRNA | 0.52 |
| HELM | 0.53 |
pip install -r requirements.txt
from inference import CodonFMStabilityPredictor
# Load fine-tuned model
predictor = CodonFMStabilityPredictor.from_hub("Imranyai/CodonFM-80M-mRNA-stability")
# Predict stability
result = predictor.predict("AUGGCAGCCGAGACUCGGAACGUGGCCGGAGCAGAGGCCCCACCG...")
print(f"Stability score: {result['stability_score']:.4f}")
print(f"Sequence: {result['num_codons']} codons ({result['sequence_length_nt']} nt)")
sequences = [
"AUGGCAGCCGAGACUCGG...",
"AUGACAAUCGGUCAGACAAUG...",
"AUGGGGUCUUCAUCAUCAUC...",
]
results = predictor.predict_batch(sequences, batch_size=32)
for r in results:
print(f"Score: {r['stability_score']:.4f} ({r['num_codons']} codons)")
# Single sequence
python inference.py --sequence "AUGGCAGCCGAGACUCGG..."
# FASTA file
python inference.py --fasta input.fasta --output predictions.csv
# CSV file
python inference.py --csv data.csv --seq_column mRNA_seq --output results.csv
# Extract embeddings
python inference.py --fasta input.fasta --mode embeddings --output embeddings.npy
# Zero-shot MLM stability proxy (base model)
python inference.py --sequence "AUGGCAGCC..." --mode base_mlm
embeddings = predictor.get_embeddings(sequences, batch_size=32)
# Shape: [N, 1024] β use for clustering, classification, etc.
Run the full CodonFM/CodonBERT benchmark suite (5 tasks):
# Benchmark with base CodonFM model
python benchmark.py --mode base
# Benchmark with fine-tuned model
python benchmark.py --mode finetuned --model_repo Imranyai/CodonFM-80M-mRNA-stability
# Specific tasks only
python benchmark.py --mode base --tasks stability mrfp vaccine
# With GPU
python benchmark.py --mode base --device cuda --batch_size 64
| Task | Dataset | Samples | Metric | Description |
|---|---|---|---|---|
stability |
iCodon (CodonBERT) | 65,356 | Spearman Ο | mRNA half-life prediction |
mrfp |
mRFP Expression | 1,459 | Spearman Ο | Protein expression in E. coli |
vaccine |
CoV Vaccine Degradation | 2,400 | Spearman Ο | SARS-CoV-2 mRNA vaccine degradation |
riboswitch |
Tc-Riboswitches | 355 | Spearman Ο | Tetracycline riboswitch activity |
mlos |
MLOS Flu Vaccine | 167 | Spearman Ο | Flu vaccine antigen expression |
Evaluation method: Frozen embeddings β RandomForest regression (matching CodonFM evaluation methodology).
Full dataset documentation: DATASETS.md
# Download and audit all datasets
python data_setup.py --all
# Just download training data
python data_setup.py --training
# Preprocess, deduplicate, and export clean CSVs
python data_setup.py --preprocess --export ./processed_data
# Show the 69-token codon vocabulary
python data_setup.py --vocab
Key findings from data audit:
# Install dependencies
pip install -r requirements.txt
# Run training (GPU recommended)
python train_codonfm_stability.py
# Environment variables for customization:
LEARNING_RATE=5e-5 \
NUM_EPOCHS=20 \
BATCH_SIZE=16 \
FREEZE_LAYERS=4 \
MAX_LENGTH=1024 \
HUB_MODEL_ID=your-name/your-model \
python train_codonfm_stability.py
βββ README.md # This file
βββ DATASETS.md # Comprehensive dataset documentation
βββ requirements.txt # Python dependencies
βββ data_setup.py # Dataset download, preprocessing & audit
βββ train_codonfm_stability.py # Fine-tuning script
βββ inference.py # Inference API + CLI
βββ benchmark.py # Benchmark suite (5 tasks)
βββ config.json # Model configuration (after training)
βββ codon_vocab.json # Codon tokenizer vocabulary (after training)
βββ pytorch_model.bin # Fine-tuned model weights (after training)
@article{diez2022icodon,
title={iCodon customizes gene expression based on the codon composition},
author={Diez, Michay and others},
journal={Scientific Reports},
volume={12},
pages={12126},
year={2022}
}
@article{li2024codonbert,
title={CodonBERT large language model for mRNA vaccines},
author={Li, Sizhen and others},
journal={Genome Research},
volume={34},
number={7},
pages={1027--1035},
year={2024}
}
This model is governed by the NVIDIA Open Model License Agreement (inherited from the base model).
Base model
nvidia/NV-CodonFM-Encodon-80M-v1