Text Classification
biology
genomics
mRNA
stability-prediction
codon
fine-tuned
regression

🧬 CodonFM-80M β€” Fine-tuned for mRNA Stability Prediction

Fine-tuned version of NVIDIA NV-CodonFM-Encodon-80M-v1 for predicting mRNA stability (half-life) from coding sequences.

Model Overview

Property Value
Base Model nvidia/NV-CodonFM-Encodon-80M-v1
Parameters 80M (77.9M total)
Architecture BERT-style Transformer with Rotary Position Embeddings (RoPE)
Tokenization Codon-level (69 vocab: 64 codons + 5 special tokens)
Max Length 2,046 codons (~6,138 nucleotides)
Task Regression β€” predict mRNA stability score
Input mRNA/DNA coding sequence
Output Continuous stability score (higher = more stable)

Architecture Details

Encoder: 6 Transformer layers, hidden_size=1024, 8 attention heads, 4096 FFN
Position: Rotary Position Embeddings (RoPE, ΞΈ=10000)
Pretraining: Masked Language Modeling (MLM) on >130M coding sequences from NCBI RefSeq
Fine-tuning: Regression head (mean-pooled β†’ Dense β†’ Tanh β†’ Dense β†’ scalar)
Strategy: Freeze first 4/6 layers, unfreeze last 2 + regression head

Training

Datasets

Dataset Samples Description
mogam-ai/CDS-BART-mRNA-stability 41,063 iCodon vertebrate mRNA stability (human, mouse, frog, fish)
GleghornLab/mrna_stability_other 65,356 Additional multi-species mRNA stability data
Combined 106,419 Train: 74,519 / Val: 16,010 / Test: 15,890

Training Recipe

Based on Helix-mRNA and BEACON:

Parameter Value
Optimizer AdamW (backbone lr=5e-5, head lr=5e-4)
Weight Decay 0.01
Scheduler Cosine with 100-step warmup
Epochs 20
Batch Size 16 Γ— 2 grad_accum = 32 effective
Max Length 1024 codons
Precision FP16 mixed precision
Frozen Layers First 4 of 6 (embeddings + layers 0-3)
Trainable Layers 4-5 + regression head (~26.3M params)

Literature Comparison (Spearman ρ on mRNA Stability)

Model Spearman ρ
CodonBERT 0.35
XE 0.50
Helix-mRNA 0.52
HELM 0.53

πŸš€ Quick Start

Installation

pip install -r requirements.txt

Inference β€” Single Sequence

from inference import CodonFMStabilityPredictor

# Load fine-tuned model
predictor = CodonFMStabilityPredictor.from_hub("Imranyai/CodonFM-80M-mRNA-stability")

# Predict stability
result = predictor.predict("AUGGCAGCCGAGACUCGGAACGUGGCCGGAGCAGAGGCCCCACCG...")
print(f"Stability score: {result['stability_score']:.4f}")
print(f"Sequence: {result['num_codons']} codons ({result['sequence_length_nt']} nt)")

Inference β€” Batch Prediction

sequences = [
    "AUGGCAGCCGAGACUCGG...",
    "AUGACAAUCGGUCAGACAAUG...",
    "AUGGGGUCUUCAUCAUCAUC...",
]
results = predictor.predict_batch(sequences, batch_size=32)
for r in results:
    print(f"Score: {r['stability_score']:.4f} ({r['num_codons']} codons)")

Inference β€” Command Line

# Single sequence
python inference.py --sequence "AUGGCAGCCGAGACUCGG..."

# FASTA file
python inference.py --fasta input.fasta --output predictions.csv

# CSV file
python inference.py --csv data.csv --seq_column mRNA_seq --output results.csv

# Extract embeddings
python inference.py --fasta input.fasta --mode embeddings --output embeddings.npy

# Zero-shot MLM stability proxy (base model)
python inference.py --sequence "AUGGCAGCC..." --mode base_mlm

Extract Embeddings (for downstream tasks)

embeddings = predictor.get_embeddings(sequences, batch_size=32)
# Shape: [N, 1024] β€” use for clustering, classification, etc.

πŸ“Š Benchmarking

Run the full CodonFM/CodonBERT benchmark suite (5 tasks):

# Benchmark with base CodonFM model
python benchmark.py --mode base

# Benchmark with fine-tuned model
python benchmark.py --mode finetuned --model_repo Imranyai/CodonFM-80M-mRNA-stability

# Specific tasks only
python benchmark.py --mode base --tasks stability mrfp vaccine

# With GPU
python benchmark.py --mode base --device cuda --batch_size 64

Benchmark Tasks

Task Dataset Samples Metric Description
stability iCodon (CodonBERT) 65,356 Spearman ρ mRNA half-life prediction
mrfp mRFP Expression 1,459 Spearman ρ Protein expression in E. coli
vaccine CoV Vaccine Degradation 2,400 Spearman ρ SARS-CoV-2 mRNA vaccine degradation
riboswitch Tc-Riboswitches 355 Spearman ρ Tetracycline riboswitch activity
mlos MLOS Flu Vaccine 167 Spearman ρ Flu vaccine antigen expression

Evaluation method: Frozen embeddings β†’ RandomForest regression (matching CodonFM evaluation methodology).

πŸ“‹ Dataset Setup & Preprocessing

Full dataset documentation: DATASETS.md

# Download and audit all datasets
python data_setup.py --all

# Just download training data
python data_setup.py --training

# Preprocess, deduplicate, and export clean CSVs
python data_setup.py --preprocess --export ./processed_data

# Show the 69-token codon vocabulary
python data_setup.py --vocab

Key findings from data audit:

  • Training data: 65,356 samples from multi-species mRNA stability profiles (z-normalized half-life)
  • mogam-ai dataset is a complete subset of GleghornLab β€” no need to combine both
  • All sequences use RNA alphabet (A,U,G,C), all divisible by 3, mean ~447 codons
  • Benchmark datasets auto-download from CodonBERT GitHub (5 tasks, 355–65K samples each)

πŸ”¬ Training from Scratch

# Install dependencies
pip install -r requirements.txt

# Run training (GPU recommended)
python train_codonfm_stability.py

# Environment variables for customization:
LEARNING_RATE=5e-5 \
NUM_EPOCHS=20 \
BATCH_SIZE=16 \
FREEZE_LAYERS=4 \
MAX_LENGTH=1024 \
HUB_MODEL_ID=your-name/your-model \
python train_codonfm_stability.py

Repository Contents

β”œβ”€β”€ README.md                         # This file
β”œβ”€β”€ DATASETS.md                       # Comprehensive dataset documentation
β”œβ”€β”€ requirements.txt                  # Python dependencies
β”œβ”€β”€ data_setup.py                     # Dataset download, preprocessing & audit
β”œβ”€β”€ train_codonfm_stability.py        # Fine-tuning script
β”œβ”€β”€ inference.py                      # Inference API + CLI
β”œβ”€β”€ benchmark.py                      # Benchmark suite (5 tasks)
β”œβ”€β”€ config.json                       # Model configuration (after training)
β”œβ”€β”€ codon_vocab.json                  # Codon tokenizer vocabulary (after training)
└── pytorch_model.bin                 # Fine-tuned model weights (after training)

Citation

@article{diez2022icodon,
  title={iCodon customizes gene expression based on the codon composition},
  author={Diez, Michay and others},
  journal={Scientific Reports},
  volume={12},
  pages={12126},
  year={2022}
}

@article{li2024codonbert,
  title={CodonBERT large language model for mRNA vaccines},
  author={Li, Sizhen and others},
  journal={Genome Research},
  volume={34},
  number={7},
  pages={1027--1035},
  year={2024}
}

License

This model is governed by the NVIDIA Open Model License Agreement (inherited from the base model).

Acknowledgements

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Imranyai/CodonFM-80M-mRNA-stability

Finetuned
(1)
this model

Datasets used to train Imranyai/CodonFM-80M-mRNA-stability

Papers for Imranyai/CodonFM-80M-mRNA-stability