Courses in the current semester
Data Warehouses, Business Intelligence, Data Mining
- Wednesday, 9:15 am-10:45 am, Online meeting
- Lectures are provided as videos (in Stud.IP)
Open topics for Bachelor and Master thesis
In the fileds NoSQL data, schema management, and schema evolution, master theses can be given. Please make an appointment via email (meike.klettkeuni-rostockde).
Duplicate Detection and Elimination in NoSQL Databases – Master thesis
Duplicate detection and elimination is a well-studied field in relational databases. It belongs to the data preprocessing steps. Duplicates are dataset which have either an identical presentation (that means identical values in all attributes tasks), also dataset with a certain similarity can be duplicates.
Duplicate elimination can be executed for a single database. It is also a relevant task if dataset from different databases are integrated. In this case, in the integrated datasets, duplicates can be available that have to be detected and cleaned. In the cleaning process, duplicate tuples have to be combined.
The main task of this thesis is to study the different approaches for data deduplication (blockwise, multi-step, ..) and to extend them onto the application for JSON documents (with the JSON object model). The available approaches for relational databases shall be implemented for NoSQL databases (JSON documents). In the implementation the JSON characteristics have to be considered (arrays, nested objects, properties) and the duplicate detection has to consider the structural information (keys in JSON as well as the values). For being applicable onto different domains, the duplicate detection approach shall be parametrized, so that the user can specify the threshold for defining the similarity, the concrete similarity function and so on.
- Study of Literature on Duplicate elimination in relational databases
- Study of Literature on JSON documents and implicit JSON structures
- Development of duplicate detection for JSON documents
- Comparison of key:value pairs of 2 JSON documents
- Based on it: comparing of complete trees of JSON documents; development of a similarity function for trees
- Define a threshold for finding similar documents (duplicates)
- Combining of duplicates to one JSON document
- Proof with the duplicate elimination benchmark (translates into JSON before)
- based on mongoDB datasets
- Sven Puhlmann, Melanie Weis, Felix Naumann: XML Duplicate Detection Using Sorted Neighborhoods. EDBT 2006: 773-791
- Melanie Weis, Felix Naumann: Detecting Duplicate Objects in XML Documents. IQIS 2004: 10-19
- Uwe Draisbach, Peter Christen, Felix Nauman: Transforming Pairwise Duplicates to Entity Clusters for High-quality Duplicate Detection. J. Data and Information Quality 12(1): 3:1-3:30 (2020)
- AK Elmagarmid, PG Ipeirotis: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering ( Volume: 19 , Issue: 1 , Jan. 2007
- Benchmarks: hpi.de/naumann/projects/repeatability/duplicate-detection/a-duplicate-detection-benchmark-for-xml-data.html