The new world of data modelling
Data aggregation in the world of Big Data is changing the way companies deploy products. Thinking about a product in terms of non-relational databases requires a shift in thought in terms of modeling your data. Trying to implement relational data models ‘as-is’ with NoSql will lead to severe performance hits to your application. NoSql data modeling relies on techniques like data flattening, aggregation and use of inverted indexes that defy the relational model paradigm to achieve performance and scalability.
This blog series tries to draw a couple of guidelines for transitioning to a NoSql backed application. We’ll take an application that would have a rather simple data model to work with if it were implemented on top of a relational database and try to model it in the context of a key-value store database, namely Riak. Specific to Riak we’ll draw attention to how MapReduce queries can be used to conduct analytics (do aggregates, ‘counts’ if you will) on the data we manage with our sample application. We will then contrast this approach to a ‘basic operations’based – like &’fetch’ only – implementation from a performance perspective. I’ll also expose some of the downsides of working with MapReduce and try to point out where to draw the line when employing Riak’s MapReduce.
Basic knowledge of Riak and associated NoSql concepts is prefered but not required. Code samples are provided in Erlang. You can learn more about Riak here.
For the scope of providing practical examples let’s consider the following application: “blog watch”
Our application’s goal is to collect data about bloggers: articles they have written, sites that their articles live on and the topic of the articles. Data is brought in to the application by a process that periodically scans target blogging sites for new articles.
So, a user robert78 might have posted 123 articles in total – 25 on reddit, 75 on boingboing and 23 on slashdot. The topics covered are ‘distributed systems’, ‘weather forecast’ and ‘best pencil ever made’.
Concept One: Map the entity relationships
We’re interested in depicting statistics (mainly counts) from the data we have gathered:
A non-functional requirement for the system is that it needs to be highly available.
Below is a diagram (fig. 1) depicting how our entities relate to each other.
fig. 1 – relation among entities manipulated by our sample application
In the next blog, we will address relational data modelling and compare that to the NoSQL model.