I’ve been a member of DDCM since 2006, conducting research with a primary focus on data quality. My early (PhD) research was focussed on duplicate detection and data fusion. My current research deals with data quality in a more general sense. A selection of my current and previous projects can be found below.
I’m the principal researcher and one of the main developers of ledc, short for lightweight engine for data quality control. This framework brings together most of our research on data quality. It covers the entire data quality pipeline, from discovery of rules over localization of errors to generation of repairs. The framework is open source and can be found on GitLab.
The news tracker platform offers a different view on news quality by monitoring edits made to news articles. The main idea of this platform is to use machine learning techniques to develop models for automated categorization of edits. Such models can be used to obtain estimates of the amount of errors corrected during certain periods of time or related to certain events. The models are publicly available at GitHub.
Data quality can be measured in many different ways, but few approaches allow for quantification of differences between datasets. In this project, we hypothesize that the cost to use data allows for such quantification. We develop an experimental setting in which different versions of the same database can be compared in their ease-to-use.
Merging duplicate data in a relational database can be a difficult task in the presence of integrity constraints. The outcomes of this project were two algorithms. The first allows to learn specificity relations from duplicate data with high accuracy. The second guides propagation of merge operations throughout a relational database. Both algorithms are available in the open source tool sqlmerge.
To extract data from documents (e.g., XML documents, HTML pages) one can rely on XPath queries. In this project, we proposed a method to learn generic XPath queries from a few examples provided by annotators. The technique combines alignment of individual XPath queries with six refinement strategies. Source code is currently not available, but the main results can be found here.
Throughout the years, our team has played a supporting role in research projects with a strong need of decent data management. Here are some of the projects in which we cooperated.