When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.
Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.
Abstract
Database-backed applications interact with the database management system (DBMS), such as MySQL, for persistent data storage. These database accesses play a central role in database-based applications and are crucial for their maintenance and quality. Developers build database-based applications to access relational databases using object-oriented programming languages such as Java, Python, C#, PHP, and C++. Since object-oriented programming is a different paradigm compared to relational databases, developers use different technologies to ease database access by abstracting persistent data as objects. Specifically, developers often rely on two main access technologies: (i) executing a Structured Query Language (SQL) query and manually converting the results to objects; and (ii) using the Object Relational Mapping (ORM) frameworks, which automatically generate SQL queries and convert the results to objects based on various object database mapping configurations. However, developers may face different database access challenges when using different technologies. Moreover, due to the abstraction of ORM frameworks, developers may face challenges when debugging database access problems. ORM automatically generates SQL queries based on various ORM configurations (e.g., the relationship among object types) and the invoked ORM APIs. As a result, developers do not have direct control over how ORM generates SQL queries. If there is a database access issue associated with a problematic generated SQL query, developers may have difficulties knowing how and where the SQL query is generated in the application code, causing challenges in debugging database access problems. Motivated by the importance and challenges of database access, in this thesis, we first conduct an empirical study of database access bugs in seven large-scale Java open source applications that use relational database management systems. Specifically, by manually examining the bug reports and commit histories ranging from 5 to 16 years, we investigate and derive the characteristics such as categories, root cause, impact, and occurrence of database access issues when using different and popular database access technologies. Our empirical study provides motivations and guidelines for future research to help avoid, detect, and test database access bugs in database-backed applications. To assist developers in debugging database access problems, we propose an approach for locating the origin (i.e., the control flow path containing a sequence of method calls) that generates a given SQL query. It achieves state-of-the-art localization accuracy and improves Top@5 accuracy by 225% and 333% compared to the baseline approach when using SQL session logs and individual query logs, respectively. We also find that our approach can help developers locate data access issues that generate problematic SQL queries (i.e., slow SQL queries and database deadlocks). In conclusion, this thesis uncovers the root causes of database access issues and demonstrates that leveraging both static analysis and information retrieval techniques can help developers debug database access issues associated with problematic SQL queries. It also opens the door for future research in the area of assisting in the development and automatic generation of tests for database access code to improve the quality of database-backed applications.