Skip to main content
notice

Seminar: Holistic and Extensible Business Intelligence and Big Data Cleaning


Dr. Jaroslaw Szlichta (University of Toronto)

Tuesday, February 11, 2014, 10:30AM-12PM, EV 2.260

Abstract

Understanding the semantics of data is important for optimization of queries for business intelligence and data quality analysis. In this talk, we will present our holistic and extensible business intelligence and data cleaning techniques that help to improve data analysis and data quality, and we will outline future directions in light of the big data era.

As business intelligence applications have become more complex and as data volumes have grown, the analytic queries needed to support these applications have become more complex too. The increasing complexity raises performance issues and numerous challenges for query optimization. We introduced order dependencies (ODs) in data management systems. (ODs capture monotonicity properties in the data.) Our main goal is to investigate the inference problem for ODs, both in theory and in practice. We have developed query optimization techniques using ODs for business intelligence queries over data warehouses. These operations and techniques we have implemented in IBM DB2 engine. We have shown how ODs can be used to improve the performance of real and benchmark analysis queries (providing an average 50% speed up).

Poor data quality is a barrier to effective, high-quality decision making based on data. Current data cleaning techniques apply mostly to traditional enterprise data rather that to big data, which is not only large but also more dynamic and heterogeneous. Declarative data cleaning encodes data semantics as constraints (rules) and errors arise when the data violates the constraints. Declarative data cleaning has emerged as an effective tool for both assessing and improving the quality of data. Recently, unified approaches that repair errors in data and constraints have been proposed. However, both data-only and unified approaches are by and large static. They apply cleaning to a single snapshot of the data and constraints. We have proposed a continuous data cleaning framework that can be applied to dynamic data. Our approach permits both the data and its semantics to evolve and suggests repairs based on the accumulated evidence as statistics. We built a classifier that predicts types of repairs needed (data repair, constraint repair, or hybrid of both) to resolve an inconsistency, and learns from past user repair preferences to recommend more accurate repairs in the future.

Bio

Jarek Szlichta is a Postdoctoral Fellow at University of Toronto working with Professor Renée Miller. His research concerns big data, business intelligence, data analytics, information integration, heterogeneous computing, systems, web search and machine learning. He received doctoral degree from York University. During that time he spent a 3-year fellowship at IBM Centre for Advanced Studies in Toronto. His research at IBM includes optimization of queries for business intelligence, and its focus is on order dependencies. He is a recipient of IBM Research Student-of-the-Year award (2012) "for having insights and perspective that has significantly contributed to IBM in a matter of great importance". Previously he worked at Comarch Research & Development on designing and implementing OCEAN GenRap system, which is an innovative data analytics reporting solution. This work was recognized by receiving the prestigious CeBIT Business Award (2007). For a list of publications, please visit Jarek’s website.




Back to top

© Concordia University