The official code repository for InstaNovo. This repo contains the code for training and inference of InstaNovo and InstaNovo+. InstaNovo is a transformer neural network with the ability to translate fragment ion peaks into the sequence of amino acids that make up the studied peptide(s). InstaNovo+, inspired by human intuition, is a multinomial diffusion model that further improves performance by iterative refinement of predicted sequences.
Links:
- Publication in Nature Machine Intelligence: InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale proteomics experiments
- InstaNovo blog: https://instanovo.ai/
- Documentation: https://instadeepai.github.io/InstaNovo/
Developed by:
InstaNovo is available as a HuggingFace Space at
hf.co/spaces/InstaDeepAI/InstaNovo for quick
testing and evaluation. You can upload your own spectra files in .mgf
, .mzml
, or .mzxml
format
and run de novo predictions. The results will be displayed in a table format, and you can download
the predictions as a CSV file. The HuggingFace Space is powered by the InstaNovo model and the
InstaNovo+ model for iterative refinement.
To use InstaNovo Python package with command line interface, we need to install the module via
pip
:
pip install instanovo
If you have access to an NVIDIA GPU, you can install InstaNovo with the GPU version of PyTorch (recommended):
pip install "instanovo[cu124]"
If you are on macOS, you can install the CPU-only version of PyTorch:
pip install "instanovo[cpu]"
InstaNovo provides a comprehensive command line interface (CLI) for both prediction and training tasks.
To get help and see the available commands:
instanovo --help
To see the version of InstaNovo, InstaNovo+ and some of the dependencies:
instanovo version
To get help about the prediction command line options:
instanovo predict --help
The default is to run predictions first with the transformer-based InstaNovo model, and then further improve the performance by iterative refinement of these predicted sequences by the diffusion-based InstaNov+ model.
instanovo predict --data-path ./sample_data/spectra.mgf --output-path predictions.csv
Which results in the following output:
scan_number,precursor_mz,precursor_charge,experiment_name,spectrum_id,diffusion_predictions_tokenised,diffusion_predictions,diffusion_log_probabilities,transformer_predictions,transformer_predictions_tokenised,transformer_log_probabilities,transformer_token_log_probabilities
0,451.25348,2,spectra,spectra:0,"['A', 'L', 'P', 'Y', 'T', 'P', 'K', 'K']",ALPYTPKK,-0.03160184621810913,LAHYNKK,"L, A, H, Y, N, K, K",-424.5889587402344,"[-0.5959059000015259, -0.0059959776699543, -0.01749008148908615, -0.03598890081048012, -0.48958998918533325, -1.5242897272109985, -0.656516432762146]"
To evaluate InstaNovo performance on an annotated dataset:
instanovo predict --evaluation --data-path ./sample_data/spectra.mgf --output-path predictions.csv
Which results in the following output:
scan_number,precursor_mz,precursor_charge,experiment_name,spectrum_id,diffusion_predictions_tokenised,diffusion_predictions,diffusion_log_probabilities,targets,transformer_predictions,transformer_predictions_tokenised,transformer_log_probabilities,transformer_token_log_probabilities
0,451.25348,2,spectra,spectra:0,"['L', 'A', 'H', 'Y', 'N', 'K', 'K']",LAHYNKK,-0.06637095659971237,IAHYNKR,LAHYNKK,"L, A, H, Y, N, K, K",-424.5889587402344,"[-0.5959059000015259, -0.0059959776699543, -0.01749008148908615, -0.03598890081048012, -0.48958998918533325, -1.5242897272109985, -0.656516432762146]"
Note that the --evaluation
flag includes the targets
column in the output, which contains the
ground truth peptide sequence. Metrics will be calculated and displayed in the console.
The configuration file for inference may be found under
instanovo/configs/inference/ folder. By default, the
default.yaml
file is used.
InstaNovo uses command line arguments for commonly used parameters:
--data-path
- Path to the dataset to be evaluated. Allows.mgf
,.mzml
,.mzxml
,.ipc
or a directory. Glob notation is supported: eg.:./experiment/*.mgf
--output-path
- Path to output csv file.--instanovo-model
- Model to use for InstaNovo. Either a model ID (currently supported:instanovo-v1.1.0
) or a path to an Instanovo checkpoint file (.ckpt format).--instanovo-plus-model
- Model to use for InstaNovo+. Either a model ID (currently supported:instanovoplus-v1.1.0-alpha
) or a path to an Instanovo+ checkpoint file (.ckpt format).--denovo
- Whether to do de novo predictions. If you want to evaluate the model on annotated data, use the flag--evaluation
flag.--with-refinement
- Whether to use InstaNovo+ for iterative refinement of InstaNovo predictions. Default isTrue
. If you don't want to use refinement,use the flag--no-refinement
.
To override the configuration values in the config files, you can use command line arguments. For example, by default beam search with one beam is used. If you want to use beam search with 5 beams, you can use the following command:
instanovo predict --data-path ./sample_data/spectra.mgf --output-path predictions.csv num_beams=5
Note the lack of prefix --
before num_beams
in the command line argument because you are
overriding the value of key defined in the config file.
Output description
When output_path
is specified, a CSV file will be generated containing predictions for all the
input spectra. The model will attempt to generate a peptide for every MS2 spectrum regardless of
confidence. We recommend filtering the output using the log_probabilities and delta_mass_ppm
columns.
Column | Description | Data Type | Notes |
---|---|---|---|
scan_number | Scan number of the MS/MS spectrum | Integer | Unique identifier from the input file |
precursor_mz | Precursor m/z (mass-to-charge ratio) | Float | The observed m/z of the precursor ion |
precursor_charge | Precursor charge state | Integer | Charge state of the precursor ion |
experiment_name | Experiment name derived from input filename | String | Based on the input file name (mgf, mzml, or mzxml) |
spectrum_id | Unique spectrum identifier | String | Combination of experiment name and scan number (e.g., yeast:17738 ) |
targets | Target peptide sequence | String | Ground truth peptide sequence (if available) |
predictions | Predicted peptide sequences | String | Model-predicted peptide sequence |
predictions_tokenised | Predicted peptide sequence tokenized by amino acids | List[String] | Each amino acid token separated by commas |
log_probabilities | Log probability of the entire predicted sequence | Float | Natural logarithm of the sequence confidence, can be converted to probability with np.exp(log_probabilities). |
token_log_probabilities | Log probability of each token in the predicted sequence | List[Float] | Natural logarithm of the sequence confidence per amino acid |
delta_mass_ppm | Mass difference between precursor and predicted peptide in ppm | Float | Mass deviation in parts per million |
InstaNovo 1.1.0 includes a new model instanovo-v1.1.0.ckpt
trained on a larger dataset with more
PTMs.
Note: The InstaNovo Extended 1.0.0 training data mis-represented Cysteine as unmodified for the majority of the training data. Please update to the latest version of the model.
Training Datasets
- ProteomeTools Part
I (PXD004732),
II (PXD010595), and
III (PXD021013)
(referred to as the all-confidence ProteomeToolsAC-PT
dataset in our paper) - Additional PRIDE dataset with more modifications:
(PXD000666, PXD000867, PXD001839, PXD003155, PXD004364, PXD004612, PXD005230, PXD006692, PXD011360, PXD011536, PXD013543, PXD015928, PXD016793, PXD017671, PXD019431, PXD019852, PXD026910, PXD027772) - Massive-KB v1
- Additional phosphorylation dataset
(not yet publicly released)
Natively Supported Modifications
Amino Acid | Single Letter | Modification | Mass Delta (Da) | Unimod ID |
---|---|---|---|---|
Methionine | M | Oxidation | +15.9949 | [UNIMOD:35] |
Cysteine | C | Carboxyamidomethylation | +57.0215 | [UNIMOD:4] |
Asparagine, Glutamine | N, Q | Deamidation | +0.9840 | [UNIMOD:7] |
Serine, Threonine, Tyrosine | S, T, Y | Phosphorylation | +79.9663 | [UNIMOD:21] |
N-terminal | - | Ammonia Loss | -17.0265 | [UNIMOD:385] |
N-terminal | - | Carbamylation | +43.0058 | [UNIMOD:5] |
N-terminal | - | Acetylation | +42.0106 | [UNIMOD:1] |
See residue configuration under instanovo/configs/residues/extended.yaml
Data to train on may be provided in any format supported by the SpectrumDataHandler. See section on data conversion for preferred formatting.
To train the auto-regressive transformer model InstaNovo using the config file instanovo/configs/instanovo.yaml, you can use the following command:
instanovo transformer train --help
To update the InstaNovo model config, modify the config file under instanovo/configs/model/instanovo_base.yaml
To train the diffusion model InstaNovo+ using the config file instanovo/configs/instanovoplus.yaml, you can use the following command:
instanovo diffusion train --help
To update the InstaNovo+ model config, modify the config file under instanovo/configs/model/instanovoplus_base.yaml
If you want to run predictions with only InstaNovo, you can use the following command:
instanovo transformer predict --help
If you want to run predictions with only InstaNovo+, you can use the following command:
instanovo diffusion predict --help
You can first run predictions with InstaNovo
instanovo transformer predict --data-path ./sample_data/spectra.mgf --output-path instanovo_predictions.csv
and then use the predictions as input for InstaNovo+:
instanovo diffusion predict --data-path ./sample_data/spectra.mgf --output-path instanovo_plus_predictions.csv instanovo_predictions_path=instanovo_predictions.csv
InstaNovo introduces a Spectrum Data Class: SpectrumDataFrame.
This class acts as an interface between many common formats used for storing mass spectrometry,
including .mgf
, .mzml
, .mzxml
, and .csv
. This class also supports reading directly from
HuggingFace, Pandas, and Polars.
When using InstaNovo, these formats are natively supported and automatically converted to the
internal SpectrumDataFrame supported by InstaNovo for training and inference. Any data path may be
specified using glob notation. For example you
could use the following command to get de novo predictions from all the files in the folder
./experiment
:
instanovo predict --data_path=./experiment/*.mgf
Alternatively, a list of files may be specified in the inference config.
The SpectrumDataFrame also allows for loading of much larger datasets in a lazy way. To do this, the
data is loaded and stored as .parquet
files in a
temporary directory. Alternatively, the data may be saved permanently natively as .parquet
for
optimal loading.
Example usage:
Converting mgf files to the native format:
from instanovo.utils import SpectrumDataFrame
# Convert mgf files native parquet:
sdf = SpectrumDataFrame.load("/path/to/data.mgf", lazy=False, is_annotated=True)
sdf.save("path/to/parquet/folder", partition="train", chunk_size=1e6)
Loading the native format in shuffle mode:
# Load a native parquet dataset:
sdf = SpectrumDataFrame.load("path/to/parquet/folder", partition="train", shuffle=True, lazy=True, is_annotated=True)
Using the loaded SpectrumDataFrame in a PyTorch DataLoader:
from instanovo.transformer.dataset import SpectrumDataset
from torch.utils.data import DataLoader
ds = SpectrumDataset(sdf)
# Note: Shuffle and workers is handled by the SpectrumDataFrame
dl = DataLoader(
ds,
collate_fn=SpectrumDataset.collate_batch,
shuffle=False,
num_workers=0,
)
Some more examples using the SpectrumDataFrame:
sdf = SpectrumDataFrame.load("/path/to/experiment/*.mzml", lazy=True)
# Remove rows with a charge value > 3:
sdf.filter_rows(lambda row: row["precursor_charge"]<=2)
# Sample a subset of the data:
sdf.sample_subset(fraction=0.5, seed=42)
# Convert to pandas
df = sdf.to_pandas() # Returns a pd.DataFrame
# Convert to polars LazyFrame
lazy_df = sdf.to_polars(return_lazy=True) # Returns a pl.LazyFrame
# Save as an `.mgf` file
sdf.write_mgf("path/to/output.mgf")
SpectrumDataFrame Features:
- The SpectrumDataFrame supports lazy loading with asynchronous prefetching, mitigating wait times between files.
- Filtering and sampling may be performed non-destructively through on file loading
- A two-fold shuffling strategy is introduced to optimise sampling during training (shuffling files and shuffling within files).
To use your own datasets, you simply need to tabulate your data in either Pandas or Polars with the following schema:
The dataset is tabular, where each row corresponds to a labelled MS2 spectra.
sequence (string)
The target peptide sequence including post-translational modificationsmodified_sequence (string) [legacy]
The target peptide sequence including post-translational modificationsprecursor_mz (float64)
The mass-to-charge of the precursor (from MS1)charge (int64)
The charge of the precursor (from MS1)mz_array (list[float64])
The mass-to-charge values of the MS2 spectrumintensity_array (list[float32])
The intensity values of the MS2 spectrum
For example, the DataFrame for the nine species benchmark dataset (introduced in Tran et al. 2017) looks as follows:
sequence | precursor_mz | precursor_charge | mz_array | intensity_array | |
---|---|---|---|---|---|
0 | GRVEGMEAR | 335.502 | 3 | [102.05527 104.052956 113.07079 ...] | [ 767.38837 2324.8787 598.8512 ...] |
1 | IGEYK | 305.165 | 2 | [107.07023 110.071236 111.11693 ...] | [ 1055.4957 2251.3171 35508.96 ...] |
2 | GVSREEIQR | 358.528 | 3 | [103.039444 109.59844 112.08704 ...] | [801.19995 460.65268 808.3431 ...] |
3 | SSYHADEQVNEASK | 522.234 | 3 | [101.07095 102.0552 110.07163 ...] | [ 989.45154 2332.653 1170.6191 ...] |
4 | DTFNTSSTSN[UNIMOD:7]STSSSSSNSK | 676.282 | 3 | [119.82458 120.08073 120.2038 ...] | [ 487.86942 4806.1377 516.8846 ...] |
For de novo prediction, the sequence
column is not required.
We also provide a conversion script for converting to native SpectrumDataFrame (sdf) format:
instanovo convert --help
This project is set up to use uv to manage Python and dependencies. First, be sure you have uv installed on your system.
On Linux and macOS:
curl -LsSf https://astral.sh/uv/install.sh | sh
On Windows:
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
Note: InstaNovo is built for Python >=3.10, <3.13 and tested on Linux.
Then fork this repo (having your own fork will make it easier to contribute) and clone it.
git clone https://github.com/YOUR-USERNAME/InstaNovo.git
cd InstaNovo
Activate the virtual environment:
source .venv/bin/activate
And install the dependencies. If you don't have access to a GPU, you can install the CPU-only version of PyTorch:
uv sync --extra cpu
uv run pre-commit install
If you do have access to an NVIDIA GPU, you can install the GPU version of PyTorch (recommended):
uv sync --extra cu124
uv run pre-commit install
Both approaches above also install the development dependencies. If you also want to install the documentation dependencies, you can do so with:
uv sync --extra cu124 --group docs
To upgrade all packages to the latest versions, you can run:
uv lock --upgrade
uv sync --extra cu124
InstaNovo uses pytest
for testing. To run the tests, you can use the following command:
uv run instanovo/scripts/get_zenodo_record.py # Download the test data
python -m pytest --cov-report=html --cov --random-order --verbose .
To see the coverage report, run:
python -m coverage report -m
To view the coverage report in a browser, run:
python -m http.server --directory ./coverage
and navigate to http://0.0.0.0:8000/
in your browser.
InstaNovo uses pre-commit hooks to ensure code quality. To run the linters, you can use the following command:
pre-commit run --all-files
To build the documentation locally, you can use the following commands:
uv sync --extra cu124 --group docs
git config --global --add safe.directory "$(dirname "$(pwd)")"
rm -rf docs/reference
python ./docs/gen_ref_nav.py
mkdocs build --verbose --site-dir docs_public
mkdocs serve
If you have a pip
or conda
based workflow and want to generate a requirements.txt
file, you
can use the following command:
uv export --format requirements-txt > requirements.txt
To set the Python interpreter in VSCode, open the Command Palette (Ctrl+Shift+P
), search for
Python: Select Interpreter
, and select ./.venv/bin/python
.
Code is licensed under the Apache License, Version 2.0 (see LICENSE)
The model checkpoints are licensed under Creative Commons Non-Commercial (CC BY-NC-SA 4.0)
If you use InstaNovo in your research, please cite the following paper:
@article{eloff_kalogeropoulos_2025_instanovo,
title = {InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale
proteomics experiments},
author = {Eloff, Kevin and Kalogeropoulos, Konstantinos and Mabona, Amandla and Morell,
Oliver and Catzel, Rachel and Rivera-de-Torre, Esperanza and Berg Jespersen,
Jakob and Williams, Wesley and van Beljouw, Sam P. B. and Skwark, Marcin J.
and Laustsen, Andreas Hougaard and Brouns, Stan J. J. and Ljungars,
Anne and Schoof, Erwin M. and Van Goey, Jeroen and auf dem Keller, Ulrich and
Beguir, Karim and Lopez Carranza, Nicolas and Jenkins, Timothy P.},
year = 2025,
month = {Mar},
day = 31,
journal = {Nature Machine Intelligence},
doi = {10.1038/s42256-025-01019-5},
issn = {2522-5839},
url = {https://doi.org/10.1038/s42256-025-01019-5}
}
Big thanks to Pathmanaban Ramasamy, Tine Claeys, and Lennart Martens of the CompOmics research group for providing us with additional phosphorylation training data.