A run through Riak's MapReduce -Blog Part 1 of 3
5:23 pm, Apr 26, 2012
Introduction to Riak
Riak is a key-value/tuple store, implementation of Amazon’s Dynamo paper specification. It’s mainly written in Erlang, a functional concurrency oriented programming language (developed initially by Ericsson), having some bits built using C and C++ (storage engine like Bitcask or Level Db). Not being a classical relational database, the way to query it is different from SQL query we are all familiar with. NoSql databases like Riak have multiple ways to access data and each way has its own benefits, depending on the conditions of usage. Thus it is important to understand the mechanisms of such data access ways.
Some of the usual ways to access data in Riak are:
- key lookup (recommended way of getting the data out)
- secondary indexes (introduced mainly in the new versions of Riak)
- link walking
- riak search (search capabilities using Lucene and Solr API)
- MapReduce (which is the topic that would be discussed in this post)
What is MapReduce and how is it used with Riak ?
Note: the current stable version of Riak, at the time that this post was written, was 1.0.3 and this blog series uses that version as a baseline.
MapReduce’s purpose is to distribute the query across multiple processes to reap the benefits of distributed systems. Riak’s MapReduce has two steps to Map, which works to break up a large chunk of work into smaller ones and act on each one of them. The second step is the Reduce step is to combine the many results from the Map step into one desired result step.
Before we dive deeper into the mechanisms of Riak’s MapReduce, We want to explain something unique about Riak. Riak functions on a concept called the ‘bucket’. A bucket is a logical separation of entities (key-value tuples). The Riak bucket also comes with properties which are configurable. Physically, there is no bucket structure in Riak, as a key is seen to have the following form: hash(bucket_name + key_name) and not bucket/hash(key). A Riak bucket shouldn’t be seen as a relational database table. Armed with this knowledge of bucket, here are some implications we need to note.
- Typically in a production environment, it’s not recommended to get all the keys from a bucket. The main reason for this implication is that, if we make such a request, Riak needs to go over the entire dataset, getting out only the keys that match the specified bucket. Hence this action is highly resource intensive.
- Another implication is that it is not possible to delete all data in a bucket, as there is no physical distinction between a bucket and a key.
Why is understanding a bucket important in Riak? That is because, while doing a MapReduce query in Riak, it is imperative to have the prior knowledge of the input keys for which the map phase would run. This would mean that Riak would not have to do a full dataset scan in order to find the needed input keys (although, it’s possible, but again, not recommended in a production environment).
One important thing to mention here, is that Riak’s MapReduce, doesn’t have the same purpose as Hadoop’s batch processing. By default, Riak’s MapReduce would have a timeout value set to 60 seconds (which can be raised or lowered) and that’s because it’s made for low latency job/queries (e.g. being able to finish a MapReduce job within a http request).