Courses in the current semester

Data Warehouses, Business Intelligence, Data Mining

  • Wednesday, 9:15 am-10:45 am, Online meeting
  • Lectures are provided as videos (in Stud.IP)

Open topics for Bachelor and Master thesis

In the fileds NoSQL data, schema management, and schema evolution, master theses can be given. Please make an appointment via email (meike.klettkeuni-rostockde).

Duplicate Detection and Elimination in NoSQL Databases – Master thesis

Duplicate Detection and Elimination in NoSQL Databases – Master thesis

Introduction:

Duplicate detection and elimination is a well-studied field in relational databases. It belongs to the data preprocessing steps. Duplicates are dataset which have either an identical presentation (that means identical values in all attributes tasks), also dataset with a certain similarity can be duplicates.

Duplicate elimination can be executed for a single database.  It is also a relevant task if dataset from different databases are integrated. In this case, in the integrated datasets, duplicates can be available that have to be detected and cleaned. In the cleaning process, duplicate tuples have to be combined.

Task:

The main task of this thesis is to study the different approaches for data deduplication (blockwise, multi-step, ..)  and to extend them onto the application for JSON documents (with the JSON object model). The available approaches for relational databases shall be implemented for NoSQL databases (JSON documents). In the implementation the JSON characteristics have to be considered (arrays, nested objects, properties) and the duplicate detection has to consider the structural information (keys in JSON as well as the values).  For being applicable onto different domains, the duplicate detection approach shall be parametrized, so that the user can specify the threshold for defining the similarity, the concrete similarity function and so on.

Subtasks:

  • Study of Literature on Duplicate elimination in relational databases
  • Study of Literature on JSON documents and implicit JSON structures
  • Development of duplicate detection for JSON documents
    • Comparison of key:value pairs of 2 JSON documents
    • Based on it: comparing of complete trees of JSON documents; development of a similarity function for trees
    • Define a threshold for finding similar documents (duplicates)
    • Combining of duplicates to one JSON document
    • Proof with the duplicate elimination benchmark  (translates into JSON before)

Implementation:

  • Python
  • based on mongoDB datasets

Literature: