Project: NoSQL Schema Evolution and Big Data Migration at Scale (Darwin)

  • together with Prof. Dr.-Ing. Stefanie Scherzinger, OTH Regensburg and Prof. Dr. Uta Störl, Darmstadt University of Applied Sciences, founded by Deutsche Forschungsgemeinschaft DFG for 3 years

Project Description

This project centers on schema evolution and scalable data migration in NoSQL data stores.  

Over the last decade, NoSQL databases have become state of the art components in the software architecture stack. These systems are designed to manage data at scale. Agile developers in particular appreciate the schema flexibility of these systems, allowing them to incrementally build software without having to declare a fixed schema up front.   Even though the data store itself may be schema-free, the application code expects the persisted entities to adhere to a certain structure (or schema). With each new release of the application, any legacy data already persisted in the production data store must be migrated to match the schema expected by the latest application release.   Based on the schema evolution operations that map between succeeding schema versions, we can derive the data migration operations for a specific NoSQL data store.  

There are various degrees of freedom for realizing schema evolution, ranging from a complete migration of all legacy entities (eager migration) to a migration of only those legacy entities that are actually accessed by the application (lazy migration). Hybrid migration strategies are conceivable as well.   Based on an empirical study on the co-evolution of the application code and its implied schema in open source code repositories, we identify practically relevant migration strategies. Further, we intend to design new variations, assess the tradeoffs involved, and examine the use cases when they are best applied.   To support software engineers in their daily work, we propose a data migration advisor. Based on a cost model developed by us, this advisor suggests project-specific data migration strategies, as well as their parametrization.   Our proposal makes the following contributions:

  1. Conducting an empirical study on the co-evolution of application code and NoSQL data instances in open source code repositories.
  2. Rooting efficient NoSQL data migration strategies on theoretical foundations.
  3. Assessing data migration strategies using our cost model.
  4. Providing an expressive schema evolution language that supports practically relevant schema changes.
  5. Implementing these concepts within Darwin, our prototype of a system-independent schema management component. 
  6. Conceptual design and contribution of a NoSQL data migration advisor, as well as 
  7. Development and contribution of a representative NoSQL schema evolution benchmark.

A primary goal of this project is to make our research insights available in the form of open source tools.   To this date, no reference architecture for NoSQL schema evolution is available. Thus, we are confident that we can make a valuable contribution with this research project.

Online Tool for Generating JSON Documents

screendump

Felix Beuster developed  in his master thesis in the  Darwin project an online tool for generating JSON test datasets. This tool can produce JSON documents with sample values for a predefined structure. A special feature is the generating of heterogeneous JSON documents (with different structures in each dataset).

The data generator ist available here.

 

Research Fields and Topics

Research Fields

  • Data Warehouses 
  • Data integration
  • NoSQL Databases
  • XML and Databases
  • Recommender Systems

Finalized Projects

  • Octopus-TX: Flexibility in ETL Processes for Data Warehouse Applications
  • CodeX: Conceptual Design and Evolution of XML applications