Thesis defences

PhD Oral Exam - Behshid Shayesteh, Information Systems Engineering

Machine Learning for Fault Prediction in Clouds

Date & time

Thursday, July 18, 2024
10 a.m. – 1 p.m.

Cost

This event is free

Organization

School of Graduate Studies

Contact

Nadeem Butt

Where

Engineering, Computer Science and Visual Arts Integrated Complex
1515 St. Catherine W.
Room 3.309

Accessible location

Yes

When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.

Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.

Abstract

The vast adoption of cloud computing has led to a significant increase in the size and complexity of data centers, resulting in an increased possibility of the occurrence of faults. The occurrence of fault can negatively impact the performance, availability, and reliability of cloud services and result in significant maintenance cost and loss of revenue for cloud service providers. Therefore, fault prediction in cloud environments is a critical task for ensuring the performance, availability and reliability of cloud services. Machine Learning (ML) techniques are increasingly used for this purpose due to their ability to recognize and predict patterns that may indicate potential faults. While predicting faults in cloud using ML enables a proactive approach to prevent faults in clouds, building accurate prediction models that can maintain their performance in dynamic cloud environments is not an easy task. One problem is occurrence of concept drift, where changes in the data distribution of cloud performance metrics, which are used for training these models, can cause the models performance to degrade over time. Similarly, feature drift, which refers to changes in the relevance of features used for training the model for fault prediction, can also degrade model performance over time. Additionally, the accuracy of ML models is influenced by several data-related parameters, necessitating selection of these parameters to achieve a high model performance.

This thesis mainly focuses on addressing the challenges of employing ML models for predicting faults as well as predicting application performance degradation caused by these faults in cloud environments. We first propose a concept drift adaptation algorithm for fault prediction in cloud environments using Reinforcement Learning (RL). This algorithm considers the cloud operator's requirements, and uses RL to select the most appropriate drift adaptation method as well as data size for adaptation that fulfills the operator's requirements. Second, we propose a feature drift adaptation solution for adapting the model to feature drifts while predicting application performance degradation in cloud environments. This solution consists of a feature drift detector that detects feature drifts by monitoring the performance of the prediction model as well as the feature importance, and a feature drift adaptor that measures the drift severity to adapt the prediction model. Finally, we propose a multi-objective optimization algorithm to automate selection of the training data size, the data sampling interval, the input window, and the prediction horizon for training an ML model that predicts application performance degradation in clouds. This algorithm maximizes the performance of the prediction model while minimizing resource consumption of data collection and storage.

Events