Software testing is a critical aspect of ensuring the quality of software. Its primary objective is to validate that software meets specified requirements, operates as intended, and is free of errors. Testing encompasses various dimensions, including both functional and non-functional aspects, and is typically conducted at different stages such as unit testing, integration testing, system testing, and acceptance testing.
During continuous integration, development activities are often halted upon encountering test failures, necessitating further investigation and debugging. Ideally, tests should produce consistent results, with developers and testers expecting the same outcomes when tests are executed repeatedly on the same version of the software. However, certain tests may exhibit non-deterministic behavior, commonly known as “flaky tests”. These tests can provide ambiguous signals to developers, eroding trust in the test suite and making it challenging to determine whether the software is flawed. Flaky tests pose reproduction and debugging difficulties, preventing the seamless integration of code changes.
Despite being a recognized phenomenon for decades, academic attention towards test flakiness has only recently increased, with studies exploring its root causes and offering strategies for prevention, detection, and mitigation. In this context, the current dissertation aims to contribute to the advancement of research in two directions. First, we focus on predicting the lifetime of a flaky test, an issue that has been left unaddressed in the flaky tests research area. Secondly, we question the efficiency of previous studies in discerning flaky failures from legitimate failures in the context of a large-scale industrial project. To ensure the reliability of our study, we use the result of the Chromium build process as our dataset for both of these studies.
In our investigation of the historical patterns of flaky tests in Chrome, we identified that 40% of flaky tests remain unresolved, while 38% are typically addressed within the initial 15 days of introduction. Subsequently, we developed a predictive model focused on identifying tests with quicker resolutions. Our model demonstrated a precision of 73% and a Matthews Correlation Coefficient (MCC) approaching 0.39 in forecasting the lifespan class of flaky tests.
In our second contribution, we analyzed Chromium’s continuous integration and investigated the scale of flakiness in Chrome. Furthermore, we applied the state-of-the-art flaky test prediction methods on our dataset to predict flaky failures. We discovered that current vocabulary-based flaky test detection approaches misclassify 78% of legitimate failures as flaky failures when applied to the Chromium dataset. The results also revealed that the source code of tests is not enough indicator for predicting flaky failures, and other execution-related features must be contributed for better performance.