Introduction: Relational data modeling vs. NoSql
In the previous blog we discussed a sample blog watch application. We will continue to model the data for that application in more detail in this blog. The relations among entities we previously presented are extremely easy to implement on top of a relational database. A couple of join operations would get us the counts of the number of blogs on a site for a specific blogger. But as the data managed by our system increases, the queries to get counts from the database will get slower as the amount of data the application manages increases.
In the ideal case you’d get a constant response time for a count. In real life, getting a very small increase in response times for a large increase in data is acceptable.
In the case of a relational database you can get there by flattening and partitioning the data, which we’re not going to discuss here; however, this approach will constitute an overhead to the client application.
Those mechanisms that would allow you to scale as the amount of data increases are provided out-of-the-box by most of the NoSql databases, which would make them a strong competitor to a relational database for our application. The problem of getting close to constant response times for the count queries still holds. But with a NoSql database the problem might be easier to tackle as the path to a solution is mostly drawn by the model we choose for the data. The database will do all the data partitioning for you and will optimize access to these partitions. You also get availability and partition tolerance for free. So let’s see how we can build our application and do some counting on top of Riak.
Concept Two: Design your model in such a manner that you don’t have to ‘search’ to get to the objects of interest
Depicted below is a first shot at structuring our data using a key-value store.
As you can see from the above diagram we structured our data so that the entities in the blogs bucket link to all other objects in the system. So a blog object would hold the links – keys – to its associated blogger, topic and blogging site objects, residing in their respective bloggers topics and blogging sites buckets.
An algorithm to look for the number of blogs posted by user robert78 on blogspot would be as follows:
from all the objects in the blogs bucket filter in the ones that match robert96 and blogspot
return their count
Note how we must check each object in the blogs bucket to come up with a result. It is a filtering operation on ALL objects in the blogsbucket. You’ll notice that this will have to happen with each algorithm that we come up with to answer the questions we described as being the goal of our application.
Walking over all objects in a bucket is a very expensive operation (for reasons not discussed here, but just know that an object belongs to a bucket only because it is tagged to be so; to identify all the objects that belong to a bucket all the keys in the system would have to be examined) and not recommended (see http://docs.basho.com/riak/latest/tutorials/querying/MapReduce/#Inputs).
There are features that Riak provides in order to help with searching but are less efficient than if we’d know how to go directly to the objects that interest us. Thus to efficiently get to the objects of interest for a computation you should make a goal of not having to search for them. Instead, design your model in such a manner that the algorithm that returns the objects needed for a computation will always know where objects to visit are – in the case of Riak, this translates to knowing the keys for the objects of interest OR at least be able to filter them in from a limited list of candidates.
The main tools for achieving the goal stated above are
Above train of thought depicts an important point to relate to when designing data models to work with NoSql: think of your model in terms of how you’re going to query the data.
Concept Three: Implement a Scans Bucket
Let’s try to improve our model so that instead of searching for objects and filtering on all the objects in the bucket we know the keys of the objects we’re interested in.
fig. 2 – model that uses data flattening to link between entities
Note we introduced a new bucket, scans. The scans bucket will hold information about the blogging sites we crawl (say, daily) for new blog posts. Scan objects will hold a list of tuples identifying the site scanned, the key of a new blog post discovered, the key of the blogger that posted it and the key of the topic of the blog post. For the scans that we’ll perform the next day, we’ll create a new scan object. The keys of the scan objects code in the date that the scans were performed on.
We modified the structure of the other objects to hold a list of scan keys, the relation being that if say for example blogger robert78 has posted a blog on 15th of March 2012 then we’ll add the key of the scan performed on that day to his list of scans, meaning: “there’s at least one reference to user robert78 in the scans performed on 15th of March 2012”. The same relation goes as well for the other entities in the system.
With this new model, the algorithm to look for the number of blogs posted by user robert78 on blogspot would be:
get the list of all scan keys where robert78 is referenced;
fetch all the scan object’s values from above list;
from each scan’s value (list of tuples) filter in the tuples that reference blogspot and
return their count
Although our algorithm gained in complexity – by means of increased number of steps – we don’t have to go over all the elements in a bucket to get to our targeted objects. In terms of the cost for database operations our new algorithm is much lighter than having to scan all objects in a bucket – consider having millions of items in there, that’s what scalability is all about.
The final blog in this series will address the key concepts of map-reduce operations and serializing object values, which both add to the scalability of the solution.