Cloud computing enables ubiquitous on-demand network access to a shared pool of configurable computing resources with minimal management efforts from the user. It has evolved as a key computing paradigm to enable a wide variety of applications such as e-commerce, social networks, high-performance computing, mission-critical applications, and Internet of Things (IoT). Ensuring the quality of service of applications deployed in inherently complex and fault-prone cloud environments is of utmost concern to service providers and end users. Machine learning-based fault management solutions enable proactive identification and mitigation of faults in cloud environments to attain the desired reliability, though they require labeled cloud metrics data for training and evaluation. Moreover, the high dynamicity in cloud environments brings forth emerging data distributions, which necessitate frequent labeling of cloud metrics data stemming from an evolving data distribution for model adaptation. In this thesis, we study the problem of data labeling for fault detection in cloud environments, paying close attention to the phenomenon of evolving cloud metric
data distributions. More specifically, we propose a test suite-based active learning framework for automated labeling of cloud metrics data with the corresponding cloud system state while accounting for emerging fault patterns and data or concept drifts. We implemented our solution on a cloud testbed and introduced various emerging data distribution scenarios to evaluate the proposed frameworkâs labeling efficacy over known and emerging data distributions. According to our evaluation results, the proposed framework achieves about 41% higher weighted F1-score and 34% higher average Area Under One-vs-Rest Receiver Operating Characteristic curves (OvR ROC AUC score) than a system without any adaptation for emerging data distributions.