When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.
Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.
Abstract
The increasing complexity and scale of modern IT infrastructures necessitate innovative strategies to maintain efficiency, reliability, and cost effectiveness. Large scale industrial systems require precise capacity planning to manage fluctuating demands, prevent downtime, and operate within optimal cost parameters. However, traditional capacity planning methods often fall short in today’s dynamic environments. This dissertation introduces an agentic approach to AIOps (Artificial Intelligence for IT Operations) aimed at enhancing the maintenance and operational stability of large scale systems, with a focus on capacity planning scenarios.
Effective capacity planning is essential for stable system operations. Over provisioning leads to resource waste, while under provisioning can cause failures and diminished performance. By utilizing load testing data and advanced machine learning (ML) models, we propose a blueprint process that optimizes system capacity planning. Integrating ML into this process enhances predictive capabilities, enabling proactive resource scaling, reducing costs, and increasing system resilience.
A significant challenge in optimizing this process is the inefficiency and time consuming nature of traditional load testing. Existing methodologies often require substantial manual effort and considerable time to simulate large scale workloads. To address this, we propose a framework that streamlines load testing through automation and early stopping rules based on spike detection techniques for system Key Performance Indicators (KPIs). By leveraging a system’s ability to predict KPI spikes, we can dynamically adjust capacity as needed.
Lastly, we aim to integrate these processes into tools utilized by LLM (Large Language Model) agents within an AIOps system. These tools will act as intermediaries for monitoring and maintaining large scale systems. This integration will establish a fully managed architecture, where AIOps agents enhance the IT operations team’s ability to perform proactive maintenance, respond to new incidents, autonomously monitor system health, predict potential issues, and implement proactive measures to maintain optimal performance.\This dissertation presents a novel approach to enhancing efficiency in large scale systems by combining automation and load testing improvements with machine learning and LLM agents. By developing a comprehensive, scalable framework, this research seeks to reduce operational overhead and establish a new standard for IT system management and load testing practices within the Software Development Life Cycle (SDLC) in industrial settings.