Introduction to Data Aggregation with NoSql Databases: Blog Series Part ii

Introduction: Relational data modeling vs. NoSql

In the previous blog we discussed a sample blog watch application. We will continue to model the data for that application in more detail in this blog. The relations among entities we previously presented are extremely easy to implement on top of a relational database. A couple of join operations would get us the counts of  the number of blogs on a site for a specific blogger. But as the data managed by our system increases, the queries to get counts from the database will get slower as the amount of data the application manages increases.

In the ideal case you’d get a constant response time for a count. In real life, getting a very small increase in response times for a large increase in data is acceptable.

In the case of a relational database you can get there by flattening and partitioning the data, which we’re not going to discuss here; however, this approach will constitute an overhead to the client application.

Those mechanisms that would allow you to scale as the amount of data increases are provided out-of-the-box by most of the NoSql databases, which would make them a strong competitor to a relational database for our application. The problem of getting close to constant response times for the count queries still holds. But with a NoSql database the problem might be easier to tackle as the path to a solution is mostly drawn by the model we choose for the data. The database will do all the data partitioning for you and will optimize access to these partitions. You also get availability and partition tolerance for free. So let’s see how we can build our application and do some counting on top of Riak.

Concept Two: Design your model in such a manner that you don’t have to ‘search’ to get to the objects of interest

Depicted below is a first shot at structuring our data using a key-value store.

data_aggregation_with_riak

 

As you can see from the above diagram we structured our data so that the entities in the blogs bucket link to all other objects in the system. So a blog object would hold the links – keys  –  to its associated blogger, topic and blogging site objects, residing in their respective bloggers topics and blogging sites buckets.

An algorithm to look for the number of blogs posted by user robert78 on blogspot would be as follows:

from all the objects in the blogs bucket filter in the ones that match robert96 and blogspot
return their count

Note how we must check each object in the blogs bucket to come up with a result. It is a filtering operation on ALL objects in the blogsbucket. You’ll notice that this will have to happen with each algorithm that we come up with to answer the questions we described as being the goal of our application.

Walking over all objects in a bucket is a very expensive operation (for reasons not discussed here, but just know that an object belongs to a bucket only because it is tagged to be so; to identify all the objects that belong to a bucket all the keys in the system would have to be examined) and not recommended (see http://docs.basho.com/riak/latest/tutorials/querying/MapReduce/#Inputs).

There are features that Riak provides in order to help with searching but are less efficient than if we’d know how to go directly to the objects that interest us. Thus to efficiently get to the objects of interest for a computation you should make a goal of not having to search for them. Instead, design your model in such a manner that the algorithm that returns the objects needed for a computation will always know where objects to visit are – in the case of Riak, this translates to knowing the keys for the objects of interest OR at least be able to filter them in from a limited list of candidates.

The main tools for achieving the goal stated above are

  • data flattening
  • coding in an object’s key some information about the object

Above train of thought depicts an important point to relate to when designing data models to work with NoSql: think of your model in terms of how you’re going to query the data.

Concept Three:  Implement a Scans Bucket

Let’s try to improve our model so that instead of searching for objects and filtering on all the objects in the bucket we know the keys of the objects we’re interested in.

fig. 2 – model that uses data flattening to link between entities

riak_image_2

Note we introduced a new bucket, scans. The scans bucket will hold information about the blogging sites we crawl (say, daily) for new blog posts. Scan objects will hold a list of tuples identifying the site scanned, the key of a new blog post discovered, the key of the blogger that posted it and the key of the topic of the blog post. For the scans that we’ll perform the next day, we’ll create a new scan object. The keys of the scan objects code in the date that the scans were performed on.

We modified the structure of the other objects to hold a list of scan keys, the relation being that if say for example blogger robert78 has posted a blog on 15th of March 2012 then we’ll add the key of the scan performed on that day to his list of scans, meaning: “there’s at least one reference to user robert78 in the scans performed on 15th of March 2012”. The same relation goes as well for the other entities in the system.

With this new model, the algorithm to look for the number of blogs posted by user robert78 on blogspot would be:

get the list of all scan keys where robert78 is referenced;
fetch all the scan object’s values from above list;
from each scan’s value (list of tuples) filter in the tuples that reference blogspot and
return their count

Although our algorithm gained in complexity – by means of increased number of steps – we don’t have to go over all the elements in a bucket to get to our targeted objects. In terms of the cost for database operations our new algorithm is much lighter than having to scan all objects in a bucket – consider having millions of items in there, that’s what scalability is all about.

The final blog in this series will address the key concepts of map-reduce operations and serializing object values, which both add to the scalability of the solution.

Robert Cristian

Robert Cristian

Director of Advanced Technology

Cristian Robert is the Director of the Advanced Technology Group for 3Pillar Global’s Romanian branch. In this position Robert is responsible for driving R&D efforts and advancing the technical expertise of 3Pillar Global. His main interests include software architecture design, reactive systems and functional programming.

Leave a Reply

Related Posts

The 3 Keys to Building Products That Drive Retention –... I had the privilege of being invited to speak at the Wearable Technology Show in Santa Clara this week, where I gave a bit of a reprisal of a talk I d...
High Availability and Automatic Failover in Hadoop Hadoop in Brief Hadoop is one of the most popular sets of big data processing technologies/frameworks in use today. From Adobe and eBay to Facebook a...
3Pillar CEO David DeWolf Quoted in Enterprise Mobility Excha... David DeWolf, Founder and CEO of 3Pillar Global, was recently quoted in a report by Enterprise Mobility Exchange on the necessity of understanding and...
How the Right Tech Stack Fuels Innovation – The Innova... On this episode of The Innovation Engine podcast, we take a look at how choosing the right tech stack can fuel innovation in your company. We'll talk ...
The Road to AWS re:Invent 2018 – Weekly Predictions, P... For the last two weeks, I’ve been making predictions of what might be announced at AWS’ upcoming re:Invent conference. In week 1, I made some guesses ...