Skip to main content
Thesis defences

Automated Data Preparation Using Graph Neural Networks


Date & time
Tuesday, August 27, 2024
9 a.m. – 11 a.m.
Speaker(s)

Niki Monjazeb

Cost

This event is free

Organization

Department of Computer Science and Software Engineering

Where

ER Building
2155 Guy St.
Room Zoom

Wheel chair accessible

Yes

Abstract

   The process of data preparation is a time-consuming portion of data scientists’ work. Being able to automate this work will improve the quality of the machine learning results and free data scientists to shift their focus to the machine learning task at hand. My research presents a system to automate this process by learning from the data preparation steps taken from others working on similar datasets. To automate data cleaning and transformation, datasets and their corresponding notebooks were extracted from Kaggle, their information was abstracted before being uploaded into a knowledge graph. Graph Neural Network (GNN) models were trained on those knowledge graphs, and the most commonly used cleaning and transformation operations for similar datasets were inferred. These operations are offered to the user as recommendations that they can apply to their dataset using the corresponding APIs. These recommendations have outperformed their stateof-the-arts counterparts in terms of time, memory consumption, and accuracy. To detect similarity inclusion dependencies (sIND), knowledge graphs from datasets in the Prague Relational Learning Repository (Motl & Schulte, 2024) were created. From those knowledge graphs, the columns deemed to have an inclusion dependency were studied until features leading to this dependency were observed. These features were used to create a model that could predict the sIND between columns. The resulting model was able to correctly predict more sIND pairs, in a shorter timespan than its competitor. This holistic platform can easily be integrated into any Data Science Pipeline (DSP) and facilitate the data preparation process for data scientists.

Back to top

© Concordia University