When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.
Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.
Abstract
Transmembrane transport proteins play a crucial role in essential cellular processes by facilitating the movement of substrates across biological membranes. Substrate specificity is a key feature of these proteins, as it determines their selective binding and transport of specific substances. Wet lab methods for predicting substrate specificities, such as protein-ligand binding assays and transporter uptake assays, are often expensive, time-consuming, and impractical for large-scale analyses. To address these limitations, computational approaches, particularly machine learning (ML) techniques, offer promising alternatives for substrate specificity prediction.
This research introduces novel computational methods for predicting the substrates transported by a given protein sequence. While certain transport protein groups are well-characterized, others lack experimentally annotated sequences, leading to imbalanced datasets with minority classes underrepresented. To address this, few-shot, one-shot, and zero-shot learning approaches are employed to preserve the data distribution. These models, based primarily on metric learning and open classification, utilize protein language models (PLMs). Additionally, a zero-shot learning model is introduced, leveraging both PLMs and large language models (LLMs) to tackle the complex task of de novo protein prediction.
The manuscript first introduces an automatic pipeline to build machine learning-ready datasets for specific substrate groups, integrating ChEBI and Gene Ontology databases. This pipeline addresses the lack of protein sequence datasets labeled with their interacting substrates. Preliminary studies demonstrate the reliability of transformer-based PLMs, adapted from natural language processing, in this domain. The research then expands into three key projects: TooT-Open-ICAT, which predicts inorganic ions transported by transmembrane proteins using open-world classification; TooT-Triplet-SPEC, which predicts specific substrates through metric learning and triplet sampling; and TooTranslator, which extends substrate specificity prediction to de novo proteins by shifting from a classification task to a regression approach.
The introduced low-shot learning models contribute to the substrate specificity prediction by improving the SOTA from 11 general classes of substrates to 93 specific substrates achieving an MCC of 0.92 in TooTranslator model. This research also shows improvements of sample to the true label distances for unseen classes during the training which is promising results for de novo prediction in future studies.