Unifying the state-of-the-art fragment ion intensity prediction models to improve experiment agnostic protein quantification
Topic
Accurate prediction of fragment ion intensities is a central component of modern proteomics workflows, directly impacting peptide identification, spectrum matching, and quantitative analyses in mass spectrometry (MS). Prosit, a GRU-based deep learning model, has set a benchmark in this domain by enabling highly accurate prediction of MS/MS fragment ion intensities. Building on this success, a Transformer-based Prosit architecture has recently been developed in our lab.
Over the past years, multiple large-scale proteomics datasets have been collected and used to train independent Prosit models, each tailored to specific instrument types, ion types, or post-translational modifications (PTMs). While these specialized models achieve strong performance within their respective domains, they are limited in their ability to generalize across heterogeneous data sources.
Furthermore, recent studies have shown that single-cell proteomics data differs from bulk data, exhibiting overall lower ion intensities and a reduced presence of annotated peaks. In a previous project, we demonstrated that incorporating the sum of y and b fragment ion intensities as an additional model input can improve Prosit’s prediction accuracy, particularly for low-intensity and single-cell–like spectra.
This project aims to develop a unified, “all-in-one" Prosit Transformer model (Figure 1) trained on a diverse collection of proteomics datasets. The integrated training data will consist of:
The PROSPECT PTMs dataset
Data containing an expanded collection of fragment ion types beyond classical b and y ions
timsTOF-based data
An external dataset containing the lactylation PTM acquired on two different instrument types
Additional publicly available data
By integrating these complementary data sources, the resulting model will be PTM-aware, capable of predicting fragment ion intensities across multiple instrument types, and flexible with respect to different ion types.
The central challenge of this project lies in harmonizing heterogeneous datasets with varying fragmentation behaviors, ion definitions, and experimental biases. Addressing these challenges will involve careful data processing and standardization, model design choices, and evaluation strategies to ensure robust and generalizable performance.
Aim
Development of a Unified Transformer Architecture
Implement and refine a Transformer-based Prosit model capable of handling multi-modal inputs (modified sequences, collision energy, etc.)
Design a flexible output head to accommodate an expanded vocabulary of fragment ions (e.g., a, b, c, x, y, z ions, and neutral losses).
Data Harmonization and Standardization
Integrate diverse datasets including PROSPECT PTM, timsTOF, MultiFrag, lactylation, and other datasets into a unified training pipeline.
Develop robust preprocessing workflows to normalize intensities and standardize PTM encodings across different experimental setups.
Performance Evaluation and Generalization
Evaluate the "all-in-one" model against specialized state-of-the-art models.
Assess the model's ability to generalize unseen PTMs and different mass spectrometry fragmentation methods (HCD, CID, ECD, ETciD, EID, UVPD).
Impact on Downstream Quantification
Investigate how improved prediction accuracy translates to better protein quantification and peptide-spectrum match (PSM) rescoring.
General Schedule
Phase 1: Methods, Tools, Techniques
The first phase consists of a series of seminars and lectures in which you will learn the basics of various topics necessary for the project. There will be a mix of presentations by team members of our research group, practical sessions where applicable, and short presentations prepared by the participants:
Kickoff Seminar: Introduction to the course structure, organizational aspects, and overall project goals. Students will get an overview of the project and possible focus areas corresponding to the aims.
Proteomics Beyond Bulk: An introduction to general MS-based proteomics SCP technologies, experimental design, and computational approaches for peptide identification and quantification.
Hands-on Data Processing and deep Learning for Proteomics: Practical session on handling large-scale parquet proteomics datasets and encoding peptide sequences for DL models. Further, there will be an introduction to the DLOmix, a Python framework for Deep Learning in Proteomics, and PROSPECT (PROteometools SPECTrum compendium) large, annotated datasets. We will also dive deep into the evolution of Prosit, from GRUs to Transformers, and the importance of fragment ion intensity in rescoring.
Working with git as a team: This lecture will provide you with a project management system for working on larger coding projects. We will cover the concepts of issues, branches, pull requests, and the review process.
Additional Topics - In case you want to get deeper knowledge, we are open to holding an additional seminar on a topic of your choice. You decide.
Phase 2: Research project planning
In the second phase, we want you to prepare a detailed project plan. At the end of this phase, you will present your plan and discuss it with us. We will assist you during the planning of your project and provide you with feedback to ensure that you are able to bring your project to a successful end. Most importantly, you should discuss the following points:
Requirement Analysis: Definition of research questions, pipeline extensions, and QC metrics to be implemented
Organization: Milestones, task distribution, and time planning. Students are encouraged to use project management tools. Communication will be conducted via Slack
Phase 3: Implementation and Research
This is the main phase of your project. According to your plan, you will implement, integrate, and test your work according to the plan. We will hold weekly progress meetings to discuss your progress.
Semester Work: Students are expected to work throughout the semester. On-site work is encouraged but not mandatory; virtual participation is possible
Full-Time Block: Depending on progress, an optional intensive block of two to three weeks may be scheduled. The specific time and requirements will be discussed with you
Submission: Deliverables include implemented code, documentation, benchmarking results, and a written report. Students will present their work in a final presentation
Skills Gained
Expertise in Transformer-based deep learning architectures for biological sequences
Experience in Large-scale data engineering and harmonization of heterogeneous scientific data
In-depth knowledge of Mass Spectrometry principles and PTM biology
Competence in benchmarking AI models against state-of-the-art biological baselines
Organisation
Programming language: Python (PyTorch, TensorFlow, Pandas, NumPy)
Must-have skills
- Intermediate programming skills in Python
- Solid understanding of Deep Learning
- Experience with data science libraries
Good-to-have skills
- Familiarity with MS proteomics data
- Experience with GPU computing and model training at scale
Supervisors
- Victor Giurcoiu - primary
- Jesse Angelis - primary
- Mathias Wilhelm - secondary
Grading
- Presentation [max. 30 minutes, whole team]
- Report [~10 pages, including all sections]
- Track record of indivual contributions must be supplied
Team size This project is designed for a team of 3-4 people.
Location OG-L 19, Maximus-von-Imhof-Forum 3, 85354 Freising
Submission
- A Git repository with code and documentation
A comprehensive report detailing the project
Material
All materials are made available in TUM Moodle.
Literature
Gessulat, S., Schmidt, T., Zolg, D. P., Samaras, P., Schnatbaum, K., Zerweck, J., Knaute, T., Rechenberger, J., Delanghe, B., Huhmer, A., Reimer, U., Ehrlich, H., Aiche, S., Kuster, B., & Wilhelm, M. (2019). Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nature Methods, 16(6), 509–518. https://doi.org/10.1038/s41592-019-0426-7
Gabriel, W., Zolg, D. P., Giurcoiu, V., Shouman, O., Prokofeva, P., Seefried, F., Bayer, F. P., Lautenbacher, L., Soleymaniniya, A., Schnatbaum, K., Zerweck, J., Knaute, T., Delanghe, B., Huhmer, A., Wenschuh, H., Reimer, U., Médard, G., Kuster, B., Wilhelm, M., . . . Wilhelm, M. (2025). Learning the Unseen: Data-Augmented Deep Learning for PTM Discovery with Prosit-PTM. bioRxiv (Cold Spring Harbor Laboratory). https://doi.org/10.1101/2025.11.07.687302
Steen, H., & Mann, M. (2004). The abc’s (and xyz’s) of peptide sequencing. Nature Reviews Molecular Cell Biology, 5(9), 699–711. https://doi.org/10.1038/nrm1468
Picciani, M., Gabriel, W., Giurcoiu, V., Shouman, O., Hamood, F., Lautenbacher, L., Jensen, C. B., Müller, J., Kalhor, M., Soleymaniniya, A., Kuster, B., Matthew, & Wilhelm, M. (2023). Oktoberfest: Open‐source spectral library generation and rescoring pipeline based on Prosit. PROTEOMICS, 24(8), e2300112. https://doi.org/10.1002/pmic.202300112
Angelis, J., Schröder, E. A., Xiao, Z., Gabriel, W., & Wilhelm, M. (2025). Peptide Property Prediction for mass spectrometry using AI: An introduction to state of the art models. PROTEOMICS, 25(9–10), e202400398. https://doi.org/10.1002/pmic.202400398
