Description:

Feature Extraction is a crucial task in Proteomics as it reveales underlying mechanisms such as a drug's mode of action which is important in drug response prediction. Traditionally autoencoders have been used to accomplish feature extraction, by compressing high dimensional data (i.e. protein expression) into a condensed latent space (embedding) with a small set of coefficients. Although this has been successful in many applications, the interpretability of such latent spaces is inherently difficult as humans generally don't understand how machines learn through a series of hidden layers in an encoder / decoder network. In order to make this latent space more interpretable, one needs to be able to annotate coefficients in the latent space with a specific biologicial function. This was attempted by a tool called VEGA. Instead of providing all possible connections between individual neurons (unsupervised approach) and forcing the network to learn which connections ars important, it adds prior information to the decoder network by adding a biologically meaningful structure (supervised approach). This was tested on scRNA-seq data but our main focus is to explore transferring this approach to proteomics data. Ultimately, we hope that annotated embeddings give us a better understanding of how and why a drug leads to a specific response. 

Tasks:

  1. Reproduce the results of VEGA
  2. Apply the model to proteomics data and evaluate how well it works
  3. Only for master thesis: Explore different models by adding structure to the encoder as well and see if adding additional prior information (e.g. GO, KEGG annotations) can help.

Prerequisites: 

  • Experience with Python is required
  • Experience with one of tensorflow/keras/pytorch is recommended
  • Understanding biological data, specifically. scRNA-seq / proteomics data is recommended

Contact:

Mario & Mathias

Description:

Prosit is our deep learning architecture for peptide property prediction. It utilizes data acquired in the context of the ProteomeTools project or data stored in ProteomicsDB for training. The goal of this project is to extend the capabilities of Prosit by either extending the number of supported properties Prosit is able to predict, circumvent current limitations of Prosit (e.g. peptide length between 7-30), or increase the accuracy of predictions. Multiple projects are possible in this context ranging from smaller tests (Bachelor thesis and internships) to full Master thesis. A very interesting avenue is the prediction of spectra ab initio, where training does not rely on the annotation of spectra anymore.

Tasks:

  1. Introductory exploration/understanding of fragmentation spectra and Prosit
  2. Explore different avenues to see which one you are more comfortable to work on.
  3. Training two models - one for retention time and one for fragment intensities

Prerequisites: 

  • Experience with Python is required
  • Experience with keras/tensorflow is a plus
  • No biological background is necessary 

Contact:

Wassim & Mathias

 

Description:

Although one would expect that the abundance of a peptide is the deciding factor in determining the observed intensity in a mass spectrometry-based experiment, it is often the case that these two aspects correlate poorly. This is due to many confounding factors that distort the observed intensity differently and influence peptides differently. This hampers the use of mass spectrometry to quantify differences across peptides and impairs our ability to compare protein expression values between each other. This is particularly aggravating for targeted mass spectrometry, where only a preselected subset of peptides is monitored. To alleviate the problem of selecting peptides that do not represent their protein’s intensity well, different approaches have been devised to predict their “flyability” (response factor).

The goal of this project is to extend Prosit to enable the prediction of flyability. For learning, we will be able to make use of the data stored in ProteomicsDB.

Tasks:

  1. Introductory exploration/understanding of fragmentation spectra and Prosit
  2. Determine which peptide features to use for accurate prediction 
  3. Training a model for flyability prediction.

Prerequisites: 

  • Experience with Python is required
  • Experience with keras/tensorflow is a plus
  • No biological background is necessary 

Contact:

Ludwig

Description:

Mouse models are the first stage of preclinical trials. Mice are infected with the examined disease-type and then treated with the candidate drugs at different concentrations. Depending on the results, the drug-candidate will then proceed to the next phase. Mice have many tissues in common with humans but these still differ in many ways, for example in size, and more importantly in genome. Researchers have defined homologenes between the two species allowing the inter-species comparisons. The goal of this project is to compare different omics types across species. 

Tasks:

  1. Introductory exploration/understanding of different omics-types.
  2. Compare protein expression values across species for the same tissue.
  3. Compare other omics-types.
  4. Present results in a detailed and comprehensive way, both in table format but also in plots.

Prerequisites: 

  • Experience in R or Python
  • Basic understanding in biology is a plus

Contact:

   Ludwig & Mathias

 

Description:

Metabolomics, the large-scale study of metabolites, is quickly gaining in popularity as yet another layer of biochemical information. These small molecules can give us information about the current state of a cell and its interactions with proteins are fundamental for biological pathways. As these small molecules do not have the same regular chain structure as RNA and proteins, they have traditionally been much harder to analyze and their coverage has therefore been rather low until recently. Recently, many efforts have been made to integrate proteomics and metabolomics, though the integration has been rather superficial and inconclusive regarding the question of whether we gain more information by combining the two. The goal of this project is to integrate metabolomics data in ProteomicsDB and evaluate what are the benefits of including this data.

Tasks:

  1. Introductory exploration/understanding of SAP-HANA.
  2. Explore different metabolomics data.
  3. Integrate metabolomics data in ProteomicsDB.
  4. Explore different analysis tools for evaluation the benefits of integrating metabolomics data.

Prerequisites: 

  • Experience in a programming language (e.g. Python)
  • Experience with SQL.
  • Basic understanding in biology is a plus

Contact:

   Ludwig & Mathias

 

Description:

 

Endogenous peptides are crucial in medical diagnosis and therapy, and their detection can facilitate the identification of biomarkers in the circulating proteome for diagnostic tool development. However, untargeted identification of these peptides in body fluids, such as plasma, faces challenges in sample preparation, data acquisition, and analysis. Despite efforts to enhance endogenous peptide detection, a standardized analysis pipeline for peptidome mining is currently lacking. As part of this project, you will work with peptidome data obtained from a pilot stroke cohort in a case-control study. Your primary task will involve developing a package/visualization tool.

Tasks:

  1. Introductory exploration/understanding of the cohort data.
  2. Developing a visualization tool to facilitate peptidome mining.
  3. Enable comparisons between different groups.
  4. Localize peptides within proteins

Prerequisites: 

  • Experience with R or Python is required.
  • Basic understanding in biology is a plus.

Contact:

   Mathias & Chien-Yun Lee

 

Description:

Development of big projects like ProteomicsDB needs the assurance that each new addition or extension of the functionalities of the platform does not break the previous working state. Continuous integration (CI) on a platform can reveal such errors. Proper (unit) tests have to be declared and run upon each new code addition or git branch merge. There exist tests for many functionalities in ProteomicsDB already, using Jenkins as the CI agent that triggers specific pipelines on certain events. Currently, when creating a merge request against the develop branch of the ProteomicsDB Jenkins will trigger an automatic build of the minimal development version to make sure that all dependencies are met. The same pipeline checks for compliance with a coding style. The goal of this project is to design and implement separate Jenkins pipelines for testing some or all functionalities.

Tasks:

  1. Introductory exploration/understanding of Git and CI (Jenkins).
  2. Implement separate Jenkins pipelines for testing functionalities
  3. Building a minimal ProteomicsDB from specific commits, create a release upon merging on master branch.
  4. Add unit tests will be added to increase the code coverage.

Prerequisites: 

  • Experience in a programming.
  • Experience in CI and Git is a plus.

Contact:

   Wassim & Mathias

 

Description:

 Over the course of the last years, we have developed multiple internal R shiny apps that guide and assist wet-lab scientists in the interrogation and analysis of chemical proteomics datasets. This includes for example binding curve fitting, visualization, classification, and selectivity calculation. The goal of the project is to integrate and extend our current repertoire of apps to handle chemical proteomics data into a single “workbench”.

Tasks:

  1. Introductory exploration/understanding of current R shiny apps developed.
  2. Inregrate and adjust current apps into one app.
  3. Develop fully functional tool usable by all chemical proteomics scientists.

Prerequisites: 

  • Experience in R.
  • Biological/Chemical knowledge is not required

Contact:

   Mathias

 

Description:

Match-between runs is an effective method to reduce missing values in experiments of multiple runs. The method uses the fact that even though not all peptides are subjected to fragmentation, the quantitative information for these peptides is still recorded in the MS1 spectra. We can, thus, transfer identifications to these so-called MS1 features, provided we have identified a highly similar MS1 feature in a different run. We have a much more extreme version of this problem in our large-scale resource, ProteomicsDB, where a highly heterogeneous set of experiments and runs are represented. The goal of this project is to investigate the viability of applying Match-between-runs on the repository scale.

Tasks:

  1. Introductory exploration/understanding of different algorithms used to apply MBR.
  2. Apply algorithms on ProteomicsDB datasets.
  3. Explore the reliability of the transferred identifications using statistical models.
  4. Explore different analysis tools for evaluation the benefits of integrating metabolomics data.

Prerequisites: 

  • Experience in a programming language (e.g. Python)
  • Experience with SQL.
  • Basic understanding in biology is a plus

Contact:

   Wassim & Mathias

 

Description:

 

Life Sciences are increasingly driven by data and thus require more and more computational resources. This can not only be ascribed to the necessity of finding, acquiring, integrating and processing large data volumes but also to applying analytical procedures such as machine learning in order to obtain new insights into biological processes. ProteomicsDB not only serves as a mechanism by which data can be shared with the scientific community, it also provides multiple data analytics that enables users to explore the data from a number of interesting angles. Most gratifyingly, the drug selectivity information displayed in ProteomicsDB is now used by the molecular tumor board of the Comprehensive Cancer Center Munich to aid in clinical decision making, clearly demonstrating the very high potential value of proteomic data and informatics in the area of personalized medicine. The goal of this project is to increase the number of tools and services to improve data exploration, hypothesis building, and validation.

Tasks:

  1. Introductory exploration/understanding of ProteomicsDB.
  2. Explore different avenues to expand current analytics tools.
  3. Develop and integrate an analytics tool in ProteomicsDB.

Prerequisites: 

  • Experience with Java script is required.
  • Experience with SQL is a plus.
  • No biological background is necessary.

Contact:

   Ludwig & Mathias

 

Feel free to contact us directly! Also in case you want to propose your own topic.