Introduction to Data Aggregation with NoSql Databases: Blog Series Part i

The new world of data modelling

Data aggregation in the world of Big Data is changing the way companies deploy products. Thinking about a product in terms of non-relational databases requires a shift in thought in terms of modeling your data. Trying to implement relational data models ‘as-is’ with NoSql will lead to severe performance hits to your application. NoSql data modeling relies on techniques like data flattening, aggregation and use of inverted indexes that defy the relational model paradigm to achieve performance and scalability.

This blog series tries to draw a couple of guidelines for transitioning to a NoSql backed application. We’ll take an application that would have a rather simple data model to work with if it were implemented on top of a relational database and try to model it in the context of a key-value store database, namely Riak. Specific to Riak we’ll draw attention to how MapReduce queries can be used to conduct analytics (do aggregates, ‘counts’ if you will) on the data we manage with our sample application. We will then contrast this approach to a ‘basic operations’based – like &’fetch’ only – implementation from a performance perspective. I’ll also expose some of the downsides of working with MapReduce and try to point out where to draw the line when employing Riak’s MapReduce.

Basic knowledge of Riak and associated NoSql concepts is prefered but not required. Code samples are provided in Erlang. You can learn more about Riak here.

Sample application

For the scope of providing practical examples let’s consider the following application: “blog watch”

Our application’s goal is to collect data about bloggers: articles they have written, sites that their articles live on and the topic of the articles. Data is brought in to the application by a process that periodically scans target blogging sites for new articles.

So, a user robert78 might have posted 123 articles in total – 25 on reddit, 75 on boingboing and 23 on slashdot. The topics covered are ‘distributed systems’, ‘weather forecast’ and ‘best pencil ever made’.

Concept One: Map the entity relationships

We’re interested in depicting statistics (mainly counts) from the data we have gathered:

  • how many blogs were posted by a blogger
  • how many blogs were posted by a blogger on a certain topic
  • how many topics are covered by a blogger

A non-functional requirement for the system is that it needs to be highly available.

Below is a diagram (fig. 1) depicting how our entities relate to each other.

relation among entities manipulated by our sample application

fig. 1 – relation among entities manipulated by our sample application

In the next blog, we will address relational data modelling and compare that to the NoSQL model.

Robert Cristian

Robert Cristian

Director of Advanced Technology

Cristian Robert is the Director of the Advanced Technology Group for 3Pillar Global’s Romanian branch. In this position Robert is responsible for driving R&D efforts and advancing the technical expertise of 3Pillar Global. His main interests include software architecture design, reactive systems and functional programming.

2 Responses to “Introduction to Data Aggregation with NoSql Databases: Blog Series Part i”
  1. srinath N K on

    I am learning the topic

  2. Rupesh Malladi on

    I am learning the topic

Leave a Reply