When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.
Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.
Abstract
This thesis, titled “Membrane Protein Classification with Protein Language Models”, iinvestigates the application of advanced computational techniques to enhance the prediction accuracy of membrane proteins. The research focuses on improving protein informatics, particularly for membrane proteins, which are crucial for cellular functions and pharmacological targeting but remain challenging to characterize due to their complex structures.
The study employs Protein Language Models (PLMs) derived from natural language processing, including ProtBERT, ProtT5, ESM1b, ESM2, and Ankh, combined with various classifiers. It utilizes specific datasets: DS-M from the TooT-M project for membrane proteins, DS-T from Mishra et al. for transporters, and DS-C from the DeepIon project for ion channels and transporters. These PLMs were pretrained on extensive datasets such as UniRef50 (40 million proteins) and BFD (2 billion proteins).
The research comprises four interconnected projects. The first project demonstrates that fine-tuning ProtBERT-BFD outperforms frozen representations in membrane protein prediction, and the combination of PLM with logistic regression surpasses previous methods of membrane protein classification. The second project integrates Convolutional Neural Networks (CNNs) with PLMs, enhancing the classification of transporter and ion channel proteins. The third project evaluates six PLMs, identifying ESM-1b as the top performer across most tasks, particularly in generalizing to new datasets. The fourth project incorporates secondary structure information into PLMs, showing limited universal improvement but enhancing membrane protein classification precision.
This research introduces novel methodological approaches, including the pioneering use of PLMs for membrane protein prediction, the integration of secondary structure information into PLMs, and the first comprehensive analysis of PLMs on membrane proteins across various classifiers, dataset balances, and precision floating point considerations. The study addresses the challenge of limited annotated data through transfer learning strategies, though it faced limitations including computational resource constraints and data-related issues.
The findings contribute to bioinformatics by providing insights into PLM applications for protein classification, with potential implications for drug discovery and personalized medicine. Future research directions include exploring larger datasets, refining methods for integrating secondary structure information, and improving model interpretability and transparency.