About

I’ve been a member of DDCM since 2006, conducting research with a primary focus on data quality. My early (PhD) research was focussed on duplicate detection and data fusion. My current research deals with data quality in a more general sense. A selection of my current and previous projects can be found below.

Current projects

ledc

I’m the principal researcher and one of the main developers of ledc, short for lightweight engine for data quality control. This framework brings together most of our research on data quality. It covers the entire data quality pipeline, from discovery of rules over localization of errors to generation of repairs. The framework is open source and can be found on GitLab.

Previous projects

News tracking

The news tracker platform offers a different view on news quality by monitoring edits made to news articles. The main idea of this platform is to use machine learning techniques to develop models for automated categorization of edits. Such models can be used to obtain estimates of the amount of errors corrected during certain periods of time or related to certain events. The models are publicly available at GitHub.

Cost-based analysis of data quality

Data quality can be measured in many different ways, but few approaches allow for quantification of differences between datasets. In this project, we hypothesize that the cost to use data allows for such quantification. We develop an experimental setting in which different versions of the same database can be compared in their ease-to-use.

Fusion of multi-valued and hierarchical data

Merging duplicate data in a relational database can be a difficult task in the presence of integrity constraints. The outcomes of this project were two algorithms. The first allows to learn specificity relations from duplicate data with high accuracy. The second guides propagation of merge operations throughout a relational database. Both algorithms are available in the open source tool sqlmerge.

Data wrappers with XPath

To extract data from documents (e.g., XML documents, HTML pages) one can rely on XPath queries. In this project, we proposed a method to learn generic XPath queries from a few examples provided by annotators. The technique combines alignment of individual XPath queries with six refinement strategies. Source code is currently not available, but the main results can be found here.

Support projects

Throughout the years, our team has played a supporting role in research projects with a strong need of decent data management. Here are some of the projects in which we cooperated.

The Disbiome database is a popular data repository that contains experimental findings on the correlation between diseases and changes in the microbiome.
In the CryptoDrug project, we composed a temporal database containing transactions made on several cryptomarkets.
The Alkamid database provides an overview of plant occurring N-alkylamides and their physicochemical properties.
The Database of Byzantine Book Epigrams (DBBE) offers both textual and contextual data of book epigrams from medieval Greek manuscripts dating up to the fifteenth century.