When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.
Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.
Abstract
As a major field of machine learning and deep learning, self-supervised learning has disrupted many fields: computer vision, speech processing, natural language processing, etc. Automatic speaker verification (ASV), as one of the most convenient means of biometric recognition, uses the voiceprint of a speaker to verify their identity. For that, most of the deep speaker embedding models are trained in a fully supervised manner and require large speaker-labeled datasets for training. However, well-annotated datasets can be expensive and time-consuming to obtain. Besides, as is the case for every biometric, spoofing is a major threat to the security offered by ASV systems, and good audio spoofing detection performance requires access to datasets of large numbers of various spoofing attacks.
This doctoral thesis, embodied by several published research papers, provides a summary of the main research topics and questions, a discussion of the theoretical and technical points related to my PhD research, including details of my published work and highlights of the obtained results, the encountered challenges and limitations, and the suggested future work to improve the current results and address some of the questions/problems yet to find answers/solutions.
The objective of this thesis is to build efficient self-supervised learning-based speech processing systems mainly for the problem of audio automatic speaker verification (ASV), and its subfield of audio deepfake detection (a.k.a audio anti-spoofing) for the development of countermeasures to protect ASV systems from spoofing attacks.
To improve the generalization and robustness of self-supervised ASV systems, this thesis investigates and compares the performance of a large variety of models and architectures and explores different self-supervised learning methods for the task of self-supervised speaker verification. This includes recent state-of-the-art self-supervised objectives to train our embedding networks, and several regularization techniques and metric learning loss functions to boost generalization and performance.
Additionally, to reduce the need for large annotated datasets during training, this thesis performs thoroughly and qualitatively a large scale study of current clustering objectives for speaker clustering, analyses their influence on the performance and behaviour of the downstream speaker verification systems, develops multiple variants of a novel state-of-the-art general-purpose clustering algorithm and proposes several frameworks and label-noise robust loss functions to mitigate the negative effect of noisy labels on generalization during training. Notably, we shed light on several previously overlooked qualitative aspects of clustering-based pseudo-labels that can impact downstream performance. Moreover, using annotations generated by several variants of our proposed clustering methods leads to considerable improvements and state-of-the-art self-supervised speaker verification performance. The thesis further introduces various novel audio augmentations to mimic audio spoofing attacks and reduce the need for spoofed audio during training or improve the performance of current supervised systems. We also investigate alternative input features to conventional acoustic features for better robustness of spoofing countermeasure systems.
Extensive experiments are carried out, in this thesis, using benchmark datasets and methods, to demonstrate the effectiveness of the different proposed methods and frameworks.
We conclude the thesis by discussing limitations to our work and by providing possible future research directions.