🧬 CodonFM-80M — Fine-tuned for mRNA Stability Prediction

Fine-tuned version of NVIDIA NV-CodonFM-Encodon-80M-v1 for predicting mRNA stability (half-life) from coding sequences.

Model Overview

Property	Value
Base Model	nvidia/NV-CodonFM-Encodon-80M-v1
Parameters	80M (77.9M total)
Architecture	BERT-style Transformer with Rotary Position Embeddings (RoPE)
Tokenization	Codon-level (69 vocab: 64 codons + 5 special tokens)
Max Length	2,046 codons (~6,138 nucleotides)
Task	Regression — predict mRNA stability score
Input	mRNA/DNA coding sequence
Output	Continuous stability score (higher = more stable)

Architecture Details

Encoder: 6 Transformer layers, hidden_size=1024, 8 attention heads, 4096 FFN
Position: Rotary Position Embeddings (RoPE, θ=10000)
Pretraining: Masked Language Modeling (MLM) on >130M coding sequences from NCBI RefSeq
Fine-tuning: Regression head (mean-pooled → Dense → Tanh → Dense → scalar)
Strategy: Freeze first 4/6 layers, unfreeze last 2 + regression head

Training

Datasets

Dataset	Samples	Description
mogam-ai/CDS-BART-mRNA-stability	41,063	iCodon vertebrate mRNA stability (human, mouse, frog, fish)
GleghornLab/mrna_stability_other	65,356	Additional multi-species mRNA stability data
Combined	106,419	Train: 74,519 / Val: 16,010 / Test: 15,890

Training Recipe

Based on Helix-mRNA and BEACON:

Parameter	Value
Optimizer	AdamW (backbone lr=5e-5, head lr=5e-4)
Weight Decay	0.01
Scheduler	Cosine with 100-step warmup
Epochs	20
Batch Size	16 × 2 grad_accum = 32 effective
Max Length	1024 codons
Precision	FP16 mixed precision
Frozen Layers	First 4 of 6 (embeddings + layers 0-3)
Trainable	Layers 4-5 + regression head (~26.3M params)

Literature Comparison (Spearman ρ on mRNA Stability)

Model	Spearman ρ
CodonBERT	0.35
XE	0.50
Helix-mRNA	0.52
HELM	0.53

🚀 Quick Start

Installation

pip install -r requirements.txt

Inference — Single Sequence

from inference import CodonFMStabilityPredictor

# Load fine-tuned model
predictor = CodonFMStabilityPredictor.from_hub("Imranyai/CodonFM-80M-mRNA-stability")

# Predict stability
result = predictor.predict("AUGGCAGCCGAGACUCGGAACGUGGCCGGAGCAGAGGCCCCACCG...")
print(f"Stability score: {result['stability_score']:.4f}")
print(f"Sequence: {result['num_codons']} codons ({result['sequence_length_nt']} nt)")

Inference — Batch Prediction

sequences = [
    "AUGGCAGCCGAGACUCGG...",
    "AUGACAAUCGGUCAGACAAUG...",
    "AUGGGGUCUUCAUCAUCAUC...",
]
results = predictor.predict_batch(sequences, batch_size=32)
for r in results:
    print(f"Score: {r['stability_score']:.4f} ({r['num_codons']} codons)")

Inference — Command Line

# Single sequence
python inference.py --sequence "AUGGCAGCCGAGACUCGG..."

# FASTA file
python inference.py --fasta input.fasta --output predictions.csv

# CSV file
python inference.py --csv data.csv --seq_column mRNA_seq --output results.csv

# Extract embeddings
python inference.py --fasta input.fasta --mode embeddings --output embeddings.npy

# Zero-shot MLM stability proxy (base model)
python inference.py --sequence "AUGGCAGCC..." --mode base_mlm

Extract Embeddings (for downstream tasks)

embeddings = predictor.get_embeddings(sequences, batch_size=32)
# Shape: [N, 1024] — use for clustering, classification, etc.

📊 Benchmarking

Run the full CodonFM/CodonBERT benchmark suite (5 tasks):

# Benchmark with base CodonFM model
python benchmark.py --mode base

# Benchmark with fine-tuned model
python benchmark.py --mode finetuned --model_repo Imranyai/CodonFM-80M-mRNA-stability

# Specific tasks only
python benchmark.py --mode base --tasks stability mrfp vaccine

# With GPU
python benchmark.py --mode base --device cuda --batch_size 64

Benchmark Tasks

Task	Dataset	Samples	Metric	Description
`stability`	iCodon (CodonBERT)	65,356	Spearman ρ	mRNA half-life prediction
`mrfp`	mRFP Expression	1,459	Spearman ρ	Protein expression in E. coli
`vaccine`	CoV Vaccine Degradation	2,400	Spearman ρ	SARS-CoV-2 mRNA vaccine degradation
`riboswitch`	Tc-Riboswitches	355	Spearman ρ	Tetracycline riboswitch activity
`mlos`	MLOS Flu Vaccine	167	Spearman ρ	Flu vaccine antigen expression

Evaluation method: Frozen embeddings → RandomForest regression (matching CodonFM evaluation methodology).

📋 Dataset Setup & Preprocessing

Full dataset documentation: DATASETS.md

# Download and audit all datasets
python data_setup.py --all

# Just download training data
python data_setup.py --training

# Preprocess, deduplicate, and export clean CSVs
python data_setup.py --preprocess --export ./processed_data

# Show the 69-token codon vocabulary
python data_setup.py --vocab

Key findings from data audit:

Training data: 65,356 samples from multi-species mRNA stability profiles (z-normalized half-life)
mogam-ai dataset is a complete subset of GleghornLab — no need to combine both
All sequences use RNA alphabet (A,U,G,C), all divisible by 3, mean ~447 codons
Benchmark datasets auto-download from CodonBERT GitHub (5 tasks, 355–65K samples each)

🔬 Training from Scratch

# Install dependencies
pip install -r requirements.txt

# Run training (GPU recommended)
python train_codonfm_stability.py

# Environment variables for customization:
LEARNING_RATE=5e-5 \
NUM_EPOCHS=20 \
BATCH_SIZE=16 \
FREEZE_LAYERS=4 \
MAX_LENGTH=1024 \
HUB_MODEL_ID=your-name/your-model \
python train_codonfm_stability.py

Repository Contents

├── README.md                         # This file
├── DATASETS.md                       # Comprehensive dataset documentation
├── requirements.txt                  # Python dependencies
├── data_setup.py                     # Dataset download, preprocessing & audit
├── train_codonfm_stability.py        # Fine-tuning script
├── inference.py                      # Inference API + CLI
├── benchmark.py                      # Benchmark suite (5 tasks)
├── config.json                       # Model configuration (after training)
├── codon_vocab.json                  # Codon tokenizer vocabulary (after training)
└── pytorch_model.bin                 # Fine-tuned model weights (after training)

Citation

@article{diez2022icodon,
  title={iCodon customizes gene expression based on the codon composition},
  author={Diez, Michay and others},
  journal={Scientific Reports},
  volume={12},
  pages={12126},
  year={2022}
}

@article{li2024codonbert,
  title={CodonBERT large language model for mRNA vaccines},
  author={Li, Sizhen and others},
  journal={Genome Research},
  volume={34},
  number={7},
  pages={1027--1035},
  year={2024}
}