license: cc-by-nc-nd-4.0
PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction π§¬π
This is the repository for PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction, a collection of machine learning predictors for canonical and non-canonical peptide property prediction using sequence and SMILES representations. 𧬠PeptiVerse π enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation.
Table of Contents
- Quick start
- Installation
- Repository Structure
- Training data collection
- Best model list
- Usage
- Property Interpretations
- Model Architecture
- Troubleshooting
- Citation
Quick Start
# Clone repository
git clone https://huggingface.co/ChatterjeeLab/PeptiVerse
# Install dependencies
pip install -r requirements.txt
# Run inference
python inference.py
Installation
Minimal Setup
- Easy start-up environment (using transformers, xgboost models)
pip install -r requirements.txt
Full Setup
- Additional access to trained SVM and ElastNet models requires installation of
RAPIDS cuML, with instructions available from their official github page (CUDA-capable GPU required). - Optional: pre-compiled Singularity/Apptainer environment (7.52G) is available at Google drive with everything you need (still need CUDA/GPU to load cuML models).
# test apptainer exec peptiverse.sif python -c "import sys; print(sys.executable)" # run inference (see below) apptainer exec peptiverse.sif python inference.py
Repository Structure
This repo contains important large files for PeptiVerse, an interactive app for peptide property prediction. Paper link.
PeptiVerse/
βββ training_data_cleaned/ # Processed datasets with embeddings
β βββ <property>/ # Property-specific data
β βββ train/val splits
β βββ precomputed embeddings
βββ training_classifiers/ # Trained model weights
β βββ <property>/
β βββ cnn_wt/ # CNN architectures
β βββ mlp_wt/ # MLP architectures
β βββ xgb_wt/ # XGBoost models
βββ tokenizer/ # PeptideCLM tokenizer
βββ training_data/ # Raw training data
βββ inference.py # Main prediction interface
βββ best_models.txt # Model selection manifest
βββ requirements.txt # Python dependencies
Training Data Collection
| Properties | Amino Acid Sequences | SMILES Sequences | ||
|---|---|---|---|---|
| 0 | 1 | 0 | 1 | |
| Classification | ||||
| Hemolysis | 4765 | 1311 | 4765 | 1311 |
| Non-Fouling | 13580 | 3600 | 13580 | 3600 |
| Solubility | 9668 | 8785 | - | - |
| Permeability (Penetrance) | 1162 | 1162 | - | - |
| Toxicity | - | - | 5518 | 5518 |
| Regression (N) | ||||
| Permeability (PAMPA) | - | 6869 | ||
| Permeability (CACO2) | - | 606 | ||
| Half-Life | 130 | 245 | ||
| Binding Affinity | 1436 | 1597 | ||
Best Model List
Full model set (cuML-enabled)
| Property | Best Model (Sequence) | Best Model (SMILES) | Task Type | Threshold (Sequence) | Threshold (SMILES) |
|---|---|---|---|---|---|
| Hemolysis | SVM | Transformer | Classifier | 0.2521 | 0.4343 |
| Non-Fouling | MLP | ENET | Classifier | 0.57 | 0.6969 |
| Solubility | CNN | β | Classifier | 0.377 | β |
| Permeability (Penetrance) | SVM | β | Classifier | 0.5493 | β |
| Toxicity | β | Transformer | Classifier | β | 0.3401 |
| Binding Affinity | unpooled | unpooled | Regression | β | β |
| Permeability (PAMPA) | β | CNN | Regression | β | β |
| Permeability (Caco-2) | β | SVR | Regression | β | β |
| Half-life | Transformer | XGB | Regression | β | β |
Note: unpooled indicates models operating on token-level embeddings with cross-attention, rather than mean-pooled representations.
Minimal deployable model set (no cuML)
| Property | Best Model (WT) | Best Model (SMILES) | Task Type | Threshold (WT) | Threshold (SMILES) |
|---|---|---|---|---|---|
| Hemolysis | XGB | Transformer | Classifier | 0.2801 | 0.4343 |
| Non-Fouling | MLP | XGB | Classifier | 0.57 | 0.3982 |
| Solubility | CNN | β | Classifier | 0.377 | β |
| Permeability (Penetrance) | XGB | β | Classifier | 0.4301 | β |
| Toxicity | β | Transformer | Classifier | β | 0.3401 |
| Binding Affinity | unpooled | unpooled | Regression | β | β |
| Permeability (PAMPA) | β | CNN | Regression | β | β |
| Permeability (Caco-2) | β | SVR | Regression | β | β |
| Half-life | xgb_wt_log | xgb_smiles | Regression | β | β |
Note: Models marked as SVM or ENET are replaced with XGB as these models are not currently supported in the deployment environment without cuML setups. xgb_wt_log indicated log-scaled transformation of time during training.
Usage
Local Application Hosting
- Host the PeptiVerse UI locally with your own resources.
# Configure models in best_models.txt
git clone https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse
python app.py
Dataset integration
- All properties are provided with raw_data/split_ready_csvs/huggingface_datasets.
- Selective download the data you need with
huggingface-cli
huggingface-cli download ChatterjeeLab/PeptiVerse \
--include "training_data_cleaned/**" \ # only this folder
--exclude "**/*.pt" "**/*.joblib" \ # skip weights/artifacts
--local-dir PeptiVerse_partial \
--local-dir-use-symlinks False # make real copies
- Or in python
from huggingface_hub import snapshot_download
local_dir = snapshot_download(
repo_id="ChatterjeeLab/PeptiVerse",
allow_patterns=["training_data_cleaned/**"], # only this folder
ignore_patterns=["**/*.pt", "**/*.joblib"], # skip weights/artifacts
local_dir="PeptiVerse_partial",
local_dir_use_symlinks=False, # make real copies
)
print("Downloaded to:", local_dir)
- Usage of the huggingface datasets (with pre-computed embeddings and splits)
- All embedding datasets are saved via
DatasetDict.save_to_diskand loadable with:
from datasets import load_from_disk ds = load_from_disk(PATH) train_ds = ds["train"] val_ds = ds["val"] - All embedding datasets are saved via
- A) Sequence Based (ESM-2 embeddings)
- Pooled (fixed-length vector per sequence)
- Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding.
- Each item:
sequence:
str; label:int(classification) orfloat(regression); embedding:float32[H](H=1280 for ESM-2 650M);
- Unpooled (variable-length token matrix)
- Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix.
- Each item:
sequence:
str; label:int(classification) orfloat(regression); embedding:float16[L, H](nested lists); attention_mask:int8[L]; length:int(=L);
- Pooled (fixed-length vector per sequence)
- B) SMILES-based (PeptideCLM embeddings)
- Pooled (fixed-length vector per sequence)
- Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding.
- Each item:
sequence:
str(SMILES); label:int(classification) orfloat(regression); embedding:float32[H];
- Unpooled (variable-length token matrix)
- Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix.
- Each item:
sequence:
str(SMILES); label:int(classification) orfloat(regression); embedding:float16[L, H](nested lists); attention_mask:int8[L]; length:int(=L);
- Pooled (fixed-length vector per sequence)
Quick Inference By Property Per Model
from inference import PeptiVersePredictor
pred = PeptiVersePredictor(
manifest_path="best_models.txt", # best model list
classifier_weight_root=".", # repo root (where training_classifiers/ lives)
device="cuda", # or "cpu"
)
# mode: smiles (SMILES-based models) / wt (Sequence-based models)
# property keys (with some level of name normalization)
# hemolysis
# nf (Non-Fouling)
# solubility
# permeability_penetrance
# toxicity
# permeability_pampa
# permeability_caco2
# halflife
# binding_affinity
seq = "GIVEQCCTSICSLYQLENYCN"
smiles = "CC(C)C[C@@H]1NC(=O)[C@@H](CC(C)C)N(C)C(=O)[C@@H](C)N(C)C(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](CC(C)C)N(C)C(=O)[C@H]2CCCN2C1=O"
# Hemolysis
out = pred.predict_property("hemolysis", mode="wt", input_str=seq)
print(out)
# {"property":"hemolysis","mode":"wt","score":prob,"label":0/1,"threshold":...}
out = pred.predict_property("hemolysis", mode="smiles", input_str=smiles)
print(out)
# Non-fouling (key is nf)
out = pred.predict_property("nf", mode="wt", input_str=seq)
print(out)
out = pred.predict_property("nf", mode="smiles", input_str=smiles)
print(out)
# Solubility (Sequence-only)
out = pred.predict_property("solubility", mode="wt", input_str=seq)
print(out)
# Permeability (Penetrance) (Sequence-only)
out = pred.predict_property("permeability_penetrance", mode="wt", input_str=seq)
print(out)
# Toxicity (SMILES-only)
out = pred.predict_property("toxicity", mode="smiles", input_str=smiles)
print(out)
# Permeability (PAMPA) (SMILES regression)
out = pred.predict_property("permeability_pampa", mode="smiles", input_str=smiles)
print(out)
# {"property":"permeability_pampa","mode":"smiles","score":value}
# Permeability (Caco-2) (SMILES regression)
out = pred.predict_property("permeability_caco2", mode="smiles", input_str=smiles)
print(out)
# Half-life (sequence-based + SMILES regression)
out = pred.predict_property("halflife", mode="wt", input_str=seq)
print(out)
out = pred.predict_property("halflife", mode="smiles", input_str=smiles)
print(out)
# Binding Affinity
protein = "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQV..." # target protein
peptide_seq = "GIVEQCCTSICSLYQLENYCN"
out = pred.predict_binding_affinity(
mode="wt",
target_seq=protein,
binder_str=peptide_seq,
)
print(out)
# {
# "property":"binding_affinity",
# "mode":"wt",
# "affinity": float,
# "class_by_threshold": "High (β₯9)" / "Moderate (7-9)" / "Low (<7)",
# "class_by_logits": same buckets,
# "binding_model": "pooled" or "unpooled",
# }
Interpretation
You can also find the same description in the paper or in the PeptiVerse app Documentation tab.
π©Έ Hemolysis Prediction
50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate.
Output interpretation:
- Score close to 1.0 = high probability of red blood cell membrane disruption
- Score close to 0.0 = non-hemolytic
π§ Solubility Prediction
Outputs a probability (0β1) that a peptide remains soluble in aqueous conditions.
Output interpretation:
- Score close to 1.0 = highly soluble
- Score close to 0.0 = poorly soluble
π― Non-Fouling Prediction
Higher scores indicate stronger non-fouling behavior, desirable for circulation and surface-exposed applications.
Output interpretation:
- Score close to 1.0 = non-fouling
- Score close to 0.0 = fouling
πͺ£ Permeability Prediction
Predicts membrane permeability on a log P scale.
Output interpretation:
- Higher values = more permeable (>-6.0)
- For penetrance predictions, it is a classification prediction, so within the [0, 1] range, closer to 1 indicates more permeable.
β±οΈ Half-Life Prediction
Interpretation: Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation.
β οΈ Toxicity Prediction
Interpretation: Outputs a probability (0β1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk.
π Binding Affinity Prediction
Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.
Interpretation:
- Scores β₯ 9 correspond to tight binders (K β€ 10β»βΉ M, nanomolar to picomolar range)
- Scores between 7 and 9 correspond to medium binders (10β»β·β10β»βΉ M, nanomolar to micromolar range)
- Scores < 7 correspond to weak binders (K β₯ 10β»βΆ M, micromolar and weaker)
- A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.
Model Architecture
- Sequence Embeddings: ESM-2 650M model / PeptideCLM model. Foundational embeddings are frozen.
- XGBoost Model: Gradient boosting on pooled embedding features for efficient, high-performance prediction.
- CNN/Transformer Model: One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns.
- Binding Model: Transformer-based architecture with cross-attention between protein and peptide representations.
- SVR Model: Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets.
- Others: SVM and Elastic Nets were trained with RAPIDS cuML, which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository.
Troubleshooting
LFS Download Issues
If files appear as SHA pointers:
huggingface-cli download ChatterjeeLab/PeptiVerse \
training_data_cleaned/hemolysis/hemo_smiles_meta_with_split.csv \
--local-dir . \
--local-dir-use-symlinks False
TODOs
Bug loading transformer half-life model now, will fix soon.
Citation
If you find this repository helpful for your publications, please consider citing our paper:
@article {zhang2025peptiverse,
author = {Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam},
title = {PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction},
year = {2026},
doi = {10.64898/2025.12.31.697180},
URL = {https://doi.org/10.64898/2025.12.31.697180},
journal = {bioRxiv}
}
To use this repository, you agree to abide by the CC-BY-NC-ND-4.0 license.
