PeptiVerse / README.md

pranamanam

Update README.md

4556803 verified about 11 hours ago

preview code

raw

history blame contribute delete

18.8 kB

metadata

license: cc-by-nc-nd-4.0

PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction 🧬🌌

This is the repository for PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction, a collection of machine learning predictors for canonical and non-canonical peptide property prediction using sequence and SMILES representations. 🧬 PeptiVerse 🌌 enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation.

Quick start
Installation
Repository Structure
Training data collection
Best model list
- Full model set (cuML-enabled)
- Minimal deployable model set (no cuML)
Usage
Property Interpretations
Model Architecture
Troubleshooting
Citation

Quick Start

# Clone repository
git clone https://huggingface.co/ChatterjeeLab/PeptiVerse

# Install dependencies
pip install -r requirements.txt

# Run inference
python inference.py

Installation

Minimal Setup

Easy start-up environment (using transformers, xgboost models)

pip install -r requirements.txt

Full Setup

Additional access to trained SVM and ElastNet models requires installation of RAPIDS cuML, with instructions available from their official github page (CUDA-capable GPU required).

Optional: pre-compiled Singularity/Apptainer environment (7.52G) is available at Google drive with everything you need (still need CUDA/GPU to load cuML models).

# test
apptainer exec peptiverse.sif python -c "import sys; print(sys.executable)"

# run inference (see below)
apptainer exec peptiverse.sif python inference.py

Repository Structure

This repo contains important large files for PeptiVerse, an interactive app for peptide property prediction. Paper link.

PeptiVerse/
├── training_data_cleaned/     # Processed datasets with embeddings
│   └── <property>/            # Property-specific data
│       ├── train/val splits
│       └── precomputed embeddings
├── training_classifiers/      # Trained model weights
│   └── <property>/           
│       ├── cnn_wt/           # CNN architectures
│       ├── mlp_wt/           # MLP architectures
│       └── xgb_wt/           # XGBoost models
├── tokenizer/                 # PeptideCLM tokenizer
├── training_data/             # Raw training data
├── inference.py               # Main prediction interface
├── best_models.txt            # Model selection manifest
└── requirements.txt           # Python dependencies

Training Data Collection

**Data distribution.** Classification tasks report counts for class 0/1; regression tasks report total sample size (N).
Properties	Amino Acid Sequences		SMILES Sequences
Properties	0	1	0	1
Classification
Hemolysis	4765	1311	4765	1311
Non-Fouling	13580	3600	13580	3600
Solubility	9668	8785	-	-
Permeability (Penetrance)	1162	1162	-	-
Toxicity	-	-	5518	5518
Regression (N)
Permeability (PAMPA)	-		6869
Permeability (CACO2)	-		606
Half-Life	130		245
Binding Affinity	1436		1597

Best Model List

Full model set (cuML-enabled)

Property	Best Model (Sequence)	Best Model (SMILES)	Task Type	Threshold (Sequence)	Threshold (SMILES)
Hemolysis	SVM	Transformer	Classifier	0.2521	0.4343
Non-Fouling	MLP	ENET	Classifier	0.57	0.6969
Solubility	CNN	–	Classifier	0.377	–
Permeability (Penetrance)	SVM	–	Classifier	0.5493	–
Toxicity	–	Transformer	Classifier	–	0.3401
Binding Affinity	unpooled	unpooled	Regression	–	–
Permeability (PAMPA)	–	CNN	Regression	–	–
Permeability (Caco-2)	–	SVR	Regression	–	–
Half-life	Transformer	XGB	Regression	–	–

Note: unpooled indicates models operating on token-level embeddings with cross-attention, rather than mean-pooled representations.

Minimal deployable model set (no cuML)

Property	Best Model (WT)	Best Model (SMILES)	Task Type	Threshold (WT)	Threshold (SMILES)
Hemolysis	XGB	Transformer	Classifier	0.2801	0.4343
Non-Fouling	MLP	XGB	Classifier	0.57	0.3982
Solubility	CNN	–	Classifier	0.377	–
Permeability (Penetrance)	XGB	–	Classifier	0.4301	–
Toxicity	–	Transformer	Classifier	–	0.3401
Binding Affinity	unpooled	unpooled	Regression	–	–
Permeability (PAMPA)	–	CNN	Regression	–	–
Permeability (Caco-2)	–	SVR	Regression	–	–
Half-life	xgb_wt_log	xgb_smiles	Regression	–	–

Note: Models marked as SVM or ENET are replaced with XGB as these models are not currently supported in the deployment environment without cuML setups. xgb_wt_log indicated log-scaled transformation of time during training.

Usage

Local Application Hosting

Host the PeptiVerse UI locally with your own resources.

# Configure models in best_models.txt

git clone https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse
python app.py

Dataset integration

All properties are provided with raw_data/split_ready_csvs/huggingface_datasets.
Selective download the data you need with huggingface-cli

huggingface-cli download ChatterjeeLab/PeptiVerse \
  --include "training_data_cleaned/**" \     # only this folder
  --exclude "**/*.pt" "**/*.joblib" \     # skip weights/artifacts
  --local-dir PeptiVerse_partial \
  --local-dir-use-symlinks False      # make real copies

Or in python

from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="ChatterjeeLab/PeptiVerse",
    allow_patterns=["training_data_cleaned/**"],     # only this folder
    ignore_patterns=["**/*.pt", "**/*.joblib"],     # skip weights/artifacts
    local_dir="PeptiVerse_partial",
    local_dir_use_symlinks=False,                   # make real copies
)
print("Downloaded to:", local_dir)

Usage of the huggingface datasets (with pre-computed embeddings and splits)
- All embedding datasets are saved via DatasetDict.save_to_disk and loadable with:
```
from datasets import load_from_disk
ds = load_from_disk(PATH)
train_ds = ds["train"]
val_ds = ds["val"]
```
A) Sequence Based (ESM-2 embeddings)
- Pooled (fixed-length vector per sequence)
  - Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding.
  - Each item: sequence: str; label: int (classification) or float (regression); embedding: float32[H] (H=1280 for ESM-2 650M);
- Unpooled (variable-length token matrix)
  - Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix.
  - Each item: sequence: str; label: int (classification) or float (regression); embedding: float16[L, H] (nested lists); attention_mask: int8[L]; length: int (=L);
B) SMILES-based (PeptideCLM embeddings)
- Pooled (fixed-length vector per sequence)
  - Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding.
  - Each item: sequence: str (SMILES); label: int (classification) or float (regression); embedding: float32[H];
- Unpooled (variable-length token matrix)
  - Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix.
  - Each item: sequence: str (SMILES); label: int (classification) or float (regression); embedding: float16[L, H] (nested lists); attention_mask: int8[L]; length: int (=L);

Quick Inference By Property Per Model

from inference import PeptiVersePredictor

pred = PeptiVersePredictor(
    manifest_path="best_models.txt",          # best model list
    classifier_weight_root=".",               # repo root (where training_classifiers/ lives)
    device="cuda",                            # or "cpu"
)

# mode: smiles (SMILES-based models) / wt (Sequence-based models) 
# property keys (with some level of name normalization)
# hemolysis
# nf (Non-Fouling)
# solubility
# permeability_penetrance
# toxicity
# permeability_pampa
# permeability_caco2
# halflife
# binding_affinity

seq = "GIVEQCCTSICSLYQLENYCN"
smiles = "CC(C)C[C@@H]1NC(=O)[C@@H](CC(C)C)N(C)C(=O)[C@@H](C)N(C)C(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](CC(C)C)N(C)C(=O)[C@H]2CCCN2C1=O"

# Hemolysis
out = pred.predict_property("hemolysis", mode="wt", input_str=seq)
print(out)
# {"property":"hemolysis","mode":"wt","score":prob,"label":0/1,"threshold":...}

out = pred.predict_property("hemolysis", mode="smiles", input_str=smiles)
print(out)

# Non-fouling (key is nf)
out = pred.predict_property("nf", mode="wt", input_str=seq)
print(out)

out = pred.predict_property("nf", mode="smiles", input_str=smiles)
print(out)

# Solubility (Sequence-only)
out = pred.predict_property("solubility", mode="wt", input_str=seq)
print(out)

# Permeability (Penetrance) (Sequence-only)
out = pred.predict_property("permeability_penetrance", mode="wt", input_str=seq)
print(out)

# Toxicity (SMILES-only)
out = pred.predict_property("toxicity", mode="smiles", input_str=smiles)
print(out)

# Permeability (PAMPA) (SMILES regression)
out = pred.predict_property("permeability_pampa", mode="smiles", input_str=smiles)
print(out)
# {"property":"permeability_pampa","mode":"smiles","score":value}

# Permeability (Caco-2) (SMILES regression)
out = pred.predict_property("permeability_caco2", mode="smiles", input_str=smiles)
print(out)

# Half-life (sequence-based + SMILES regression)
out = pred.predict_property("halflife", mode="wt", input_str=seq)
print(out)

out = pred.predict_property("halflife", mode="smiles", input_str=smiles)
print(out)

# Binding Affinity
protein = "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQV..."  # target protein
peptide_seq = "GIVEQCCTSICSLYQLENYCN"

out = pred.predict_binding_affinity(
    mode="wt",
    target_seq=protein,
    binder_str=peptide_seq,
)
print(out)
# {
#   "property":"binding_affinity",
#   "mode":"wt",
#   "affinity": float,
#   "class_by_threshold": "High (≥9)" / "Moderate (7-9)" / "Low (<7)",
#   "class_by_logits": same buckets,
#   "binding_model": "pooled" or "unpooled",
# }

Interpretation

You can also find the same description in the paper or in the PeptiVerse app Documentation tab.

🩸 Hemolysis Prediction

50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate.

Output interpretation:

Score close to 1.0 = high probability of red blood cell membrane disruption
Score close to 0.0 = non-hemolytic

💧 Solubility Prediction

Outputs a probability (0–1) that a peptide remains soluble in aqueous conditions.

Output interpretation:

Score close to 1.0 = highly soluble
Score close to 0.0 = poorly soluble

👯 Non-Fouling Prediction

Higher scores indicate stronger non-fouling behavior, desirable for circulation and surface-exposed applications.

Output interpretation:

Score close to 1.0 = non-fouling
Score close to 0.0 = fouling

🪣 Permeability Prediction

Predicts membrane permeability on a log P scale.

Output interpretation:

Higher values = more permeable (>-6.0)
For penetrance predictions, it is a classification prediction, so within the [0, 1] range, closer to 1 indicates more permeable.

⏱️ Half-Life Prediction

Interpretation: Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation.

☠️ Toxicity Prediction

Interpretation: Outputs a probability (0–1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk.

🔗 Binding Affinity Prediction

Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.

Interpretation:
- Scores ≥ 9 correspond to tight binders (K ≤ 10⁻⁹ M, nanomolar to picomolar range)
- Scores between 7 and 9 correspond to medium binders (10⁻⁷–10⁻⁹ M, nanomolar to micromolar range)
- Scores < 7 correspond to weak binders (K ≥ 10⁻⁶ M, micromolar and weaker)
- A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.

Model Architecture

Sequence Embeddings: ESM-2 650M model / PeptideCLM model. Foundational embeddings are frozen.
XGBoost Model: Gradient boosting on pooled embedding features for efficient, high-performance prediction.
CNN/Transformer Model: One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns.
Binding Model: Transformer-based architecture with cross-attention between protein and peptide representations.
SVR Model: Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets.
Others: SVM and Elastic Nets were trained with RAPIDS cuML, which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository.

Troubleshooting

LFS Download Issues

If files appear as SHA pointers:

huggingface-cli download ChatterjeeLab/PeptiVerse \
    training_data_cleaned/hemolysis/hemo_smiles_meta_with_split.csv \
    --local-dir . \
    --local-dir-use-symlinks False

TODOs

Bug loading transformer half-life model now, will fix soon.

Citation

If you find this repository helpful for your publications, please consider citing our paper:

@article {zhang2025peptiverse,
    author = {Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam},
    title = {PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction},
    year = {2026},
    doi = {10.64898/2025.12.31.697180},
    URL = {https://doi.org/10.64898/2025.12.31.697180},
    journal = {bioRxiv}
}

To use this repository, you agree to abide by the CC-BY-NC-ND-4.0 license.

ChatterjeeLab
/

PeptiVerse

PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction 🧬🌌

Table of Contents

Quick Start

Installation

Minimal Setup

Full Setup

Repository Structure

Training Data Collection

Best Model List

Full model set (cuML-enabled)

Minimal deployable model set (no cuML)

Usage

Local Application Hosting

Dataset integration

Quick Inference By Property Per Model

Interpretation

🩸 Hemolysis Prediction

💧 Solubility Prediction

👯 Non-Fouling Prediction

🪣 Permeability Prediction

⏱️ Half-Life Prediction

☠️ Toxicity Prediction

🔗 Binding Affinity Prediction

Model Architecture

Troubleshooting

LFS Download Issues

TODOs

Citation