Research

Digital information sources are growing tremendously. Data sets which are too large or too complex cannot be efficiently handled by traditional information management systems. Such data sets are often named 'Big Data'. More and more organizations nowadays put large efforts in efficiently collecting, organizing and managing their data. DDCM research aims to support these efforts by investigating and developing new technologies for coping with the many challenges that stem from 'Big Data' and from the natural heterogeneous and imperfect/uncertain character of information.

Current research aims at the development of new techniques for improving data quality, integrating data from heterogeneous sources, giving better accessibility to databases, documents and multimedia archives and supporting data analysis and decision making. Research topics are, among others:

Data quality

Striving for good data quality in an information system is of utmost importance because data quality is propagated to the results of all data querying, information retrieval and data analysis tasks performed on the data.

In general data quality depends on many factors including validity, reliability, objectivity, integrity, generalizability, relevance and utility. DDCM investigates (semi-)automatic processes for improving data quality.  

An important research branch focuses on the development of novel techniques for coreference detection in data sources. This means that, given a description of an object, the technique should check whether this object description or similar object descriptions are stored in the data source and if so, measure  the extent to which the object descriptions are similar.

Examples of applications studied by the group are

  • searching for buildings with similar characteristics in real estate applications,
  • comparing of ante and post mortem ear photographs in victim identification,
  • searching for similar recordings in a sound archive, and
  • searching for similar web pages.

Another research branch is studying the (semi-)automatic merging (or fusion) of data that have been identified as being coreferent. Merging techniques are important in view of the removal of duplicates (or quasi duplicates) in data collections.

Our techniques are characterized by 

  • flexible comparison operators that allow to fine-tune the comparison process in accordance with human expertise,
  • the measurement of the (un)certainty about the coreference of two objects, which expresses to what extent it is possible that both object are coreferent or not,
  • the ability to rank the results based on their (un)certainty measurement,
  • additional feedback to the user which allows to track back why two objects are coreferent (or not), and
  • the ability to explicitly handle imperfect information which might be caused by missing data, imprecision, uncertainty or inconsistency.

Data integration

In recent years, numerous digital documents, databases, webpages, sensor data and many other heterogeneous data sources (also known as big data) have become available through the WWW and information clouds. Efficiently using and exploiting all this information in querying, retrieval and analysis tasks is a real IT challenge. In data integration part of our research, we investigate how structured data (databases), semi-structured data (webpages) and unstructured data (digitized text and multimedia) can be (temporarily) integrated in a single, consistent structure that is better suited for further querying, retrieval or analysis. 

One aspect of our research focuses on the development of a unique software framework for the integration, cleaning and analysis of data contained in multiple data sources.

Handling of imperfect information

Most information systems are specifically designed for managing perfect information. However real-world information is often imperfect. In fact human communication and handling often does not require exact numbers and facts, so a lot of useful information available in emails, documents, memos, etc. is inherently imperfect. Data imperfection can be caused by imprecision, vagueness, uncertainty, incompleteness or inconsistency.

In our research we investigate techniques, based on soft computing, that allow to model and handle data imperfections as adequate as possible. As well the structural aspects, as the behavioral aspects (operators) of the information modelling are studied. Hereby we aim to register the available information as adequate as possible, without causing information loss. This leads to semantic richer answer sets and analysis results. Indeed soft computing allows it to enrich the results with information on their associated satisfaction-level and/or uncertainty-level. As such the real value of the information is expressed more clearly, which is definitely an added value for users and decision makers.

Special research attention goes to the handling of spatial and temporal information. Indeed space and time are special characteristics (dimensions) that apply in many cases when considering the context of a piece of information. Many information systems like Business Intelligence tools and Geographical Information Systems are specifically designed to handle spatio-temporal information. Enriching such systems with facilities for handling imperfect spatio-temporal data is an important research theme for the group.

Content based management of multimedia

The combination of soft computing techniques and database technology introduces new methodologies to extract and discover information and knowledge from multimedia systems. This results in more comprehensive discovered knowledge, and in an enhancement of the capabilities of information systems to handle real world data.

Multimedia systems are computer-delivered electronic systems that allow the user to control, combine, and/or manipulate different types of media, such as text, sound, video, computer graphics, and animation.

The growing of multimedia databases and data storage has caused a corresponding growth in the need to analyze and exploit them. Traditionally, multimedia data sources are mainly indirectly queried via their associated stored metadata, which contain descriptive information about these media sources. The main interest of our research is to extract more meaningful and useful information by exploring the content and structure of the media files themselves.

Important topics in this field are

  • feature extraction in multimedia documents as text, audio, images and video,
  • multimedia document indexing,
  • multimedia document classification and clustering, and
  • flexible, content based search algorithms.

Document management

A considerable part of our research is related to the management and analysis of digital documents, with an emphasis on the development of metasearch engines for text, xml and html documents. A metasearch engine searches over multiple data sources and then combines the results into a single list.

Important topics in this field are

  • the identification of corresponding parts in different documents (schema mapping),
  • data integration (applied to the search results),
  • multiple document summarization,
  • the handling of sensor data, and
  • the handling of unstructured documents.

Another aspect of the research focuses on the development of a software framework for metasearch.

Flexible querying and information retrieval from documents

Flexible querying and information retrieval techniques are used to improve access and human interaction with information systems. They aim to make it easier for users to find what they are looking for. The research of the group concentrates on the study and development of such techniques within the context of database management systems and search engines.

Research on flexible querying investigates how query formulations can be relaxed in order to better reflect what the user is looking for. In fact, in many cases the user can only approximately specify what she/he is searching. For example, one can search for all paintings that have been painted around 1914 (also partially accepting paintings that were painted in 1912, 1913, 1915 or 1916). Using traditional querying techniques it is not possible to assign a partial satisfaction degree to database entries that only partially satisfy the query criteria. Hence the need for more advanced techniques. Among these techniques we study:

Flexible query criteria
As illustrated above, users sometimes have a need to approximately describe their query criteria. It might be the case that approximate description exactly denotes what they are looking for, e.g. young employees with high salaries, or there might be a need to relax (or strengthen) a criterion in order to avoid an empty (or overcrowded) query result. Using traditional techniques this can only be achieved by querying for a range of values (e.g. salary is larger than 2.500 Euro), hereby eliminating the values falling just outside the boundaries of the range (e.g. 2.499 Euro), which is probably not the best solution. A better way to achieve this is by introducing a query satisfaction which is not an all-or-nothing concept (as with traditional techniques), but a matter of degree. E.g. the lager the salary, the higher the satisfaction degree. In this way, the otherwise strict boundaries are relaxed. Another type of flexible constraints takes advantage of similarities defined among concepts. For instance, when a user seeks for a light grey colored car, she/he might, to a lesser extent, also be interested in white cars.

Weighted query criteria
It is often the case that users do not find all selection criteria equally important. E.g. the user may find it more important that the employee has a high salary than that the employee is young. Weights with values between 0 and 1, attached to the query criteria, can be used to acknowledge the respective importance of the different criteria, where a weight of 1 denotes 'fully important' and weight 0 denotes 'not important at all'. In a more advanced application, weights can also be used to model flexible quantifiers. In traditional queries only the universal quantifier (all) or the existential quantifier (at least one) are supported through the use of the logical connectors 'AND' and 'OR' between query criteria. However, in human reasoning other quantifiers like mostat least threea few, etc. are used. supporting such quantifiers in a query language brings this query language closer to human reasoning. Another aspect in this context, is the handling of mandatory vs. optional criteria and sufficient vs. optional criteria where satisfaction of an optional condition should lead to a bonus (and dissatisfaction of an optional condition should lead to a penalty) in the computation of the overall query satisfaction

Bipolar query criteria
An advanced aspect of querying concerns the handling of the so-called heterogeneous bipolar characteristics of human questionings. This expresses that humans, when searching for objects, often express what objects they want (positive objects), what objects they don't want (negative objects), but also tacitly consider a collection of objects about which they are indifferent. Indifferent objects might be acceptable or not, but they are not mentioned in the search criteria. For example, if somebody is looking for a car, she/he can specify that a black car is wanted and a white care is not wanted. However, due to the fact that there are numerous car colors (some of them even not known to the user), nothing is specified about e.g. dark blue cars, which might also be acceptable in this case. In traditional query systems, cars with colors that are not specified in the query would be neglected during query evaluation. However, in a commercial setting, the customer might maybe not be interested in any of the available black cars (because of other preferences not being satisfied) while an available dark blue car (neglected by the system) would be acceptable. Properly handling this kind of situations is the motivation for our research on bipolarity.

Information retrieval is related to flexible querying, but aims to inform on the existence (or non-existence) and whereabouts of documents relating to a user request. In information retrieval we want to find and rank those documents which (partially) match the request. In both flexible querying and information retrieval, we aim to use flexible criteria which support natural language constructs. Another important topic in the field of information retrieval is to enhance indexing of text documents. A key point is to cope adequately with the relevancy of the different user preferences.

Crowdsourcing

With social media like Facebook, LinkedIN, Twitter and Google+ huge online communities become available. Social media offer new information sources which on their turn bring along some novel challenges. In this part of our research we investigate how crowdsourcing information can be efficiently used for extending existing data sources with complementary data and hence improving information processing tasks like database querying, information retrieval and data analysis. Among others we study:

  • how to estimate the value (in terms of confidence) of crowdsourcing data?
  • how to extract clusters of similar opinions from crowdsourcing data?
  • how to find people with a similar user profile, based on the answers they provide on inquiries and opinion polls?

Decision support

Decision support systems help decision makers in selecting the best case/situation/action out of a list of candidates. In our group we study multi-criteria decision analysis systems. In such systems, the user describes her/his preferences in criteria that must be (partially) satisfied. These criteria are ordered in a tree structure such that each criterion can be subdivided in sub criteria. For example, when selecting a car, sub criteria of the general selection might be performance, safety, and environmental issues. Criteria that have no sub criteria are represented in the leaf nodes in the tree structure and called elementary criteria. They represent conditions that should be evaluated for each candidate case under consideration. For example, an elementary criterion related to a car's safety is the presence of airbags for the driver and front passenger. This criterion can be evaluated for each candidate car. Considering a single candidate case, the evaluation of all elementary criteria results in elementary suitability degrees for that case. The next step is then the combination (aggregation) of all elementary suitability degrees in order to compute an overall suitability degree of the case. Aggregation should be done in such a way that it reflects the hierarchic structure of the criteria specifications and it adequately reflects the way how the decision maker reasons in her/his decision making. Considering, e.g. questions like 'Are all criteria of equal importance?', 'Are all criteria mandatory?', 'Should all criteria be satisfied (to the same extend)?', etc.

Research topics studied by the group are, among others:

  • advanced aggregation structures for criteria,
  • balancing cost vs. suitability,
  • group decision making (where multiple decision makers are involved), and
  • handling uncertainty (for those cases where not all data is available).

To support the research mentioned above, the research group conducts both fundamental research (in uncertainty modelling and (fuzzy) logic) and applied research in search for new technologies. Beside of these the group has also built a profound theoretical and practical expertise in the modelling of information, the design and implementation of databases and the digitizing and archiving of (multimedia) information.