A run through Riak's MapReduce -Blog Part 2 of 3
11:26 am, May 1, 2012
The Anatomy of the Riak MapReduce
In the previous blog we discussed some MapReduce basics in Riak and the concept of “bucket” and how a bucket acts as a logical container for a key-value pair. Let us dive a little deeper into RIak’s MapReduce in this blog.
Riak’s MapReduce queries have two components:
- a list of inputs
- a list of ‘phases’.
Each element of the input list is a bucket-key pair. Each element of a phases list, is a chunk of information of a map, a reduce or a link function (details on where to find that function; the body of the function; static data passed to the function; a flag that would indicate if the result that would indicate if the result of the phase would be included into the final result).
One important characteristic of Riak’s MapReduce is that you’re able to combine however you like (in whichever order you like) the map and reduce phases (e.g. map phase -> map phase -> reduce phase -> reduce phase -> etc.) as opposed to other NoSQL solutions out there which restricts that to a specific order (map, followed by reduce).
For more information on MapReduce, see the original Google paper.
For demonstrating usage, we’re going to use the twitter like data model (for the simplicity of examples) as it’s also easy to visualize:
We’ll have the following buckets: “tweets” and “users” - where we’ll persist the user profile information.
Key: 1212123 (bucket: “tweets”)
Key: “alinpopa” (bucket: “users”)
We will now look at ways to retrieve information from these two buckets, using MapReduce. As I’ve mentioned, there are multiple ways that you can work with MapReduce in Riak, so I’m going to walk you through this.
Using Erlang ProtocolBuffers Client
Assuming that your application is written in Erlang, and you want to have access to Riak, then one option would be to use the existing Riak Erlang Client. This client uses Riak’s ProtocolBuffers API for sending/retrieving data.
Let’s try a simple fetch of all tweets from the database (of course, you’d not do something like that in your day to day work or on a Production Data):
Having the following returned value:
Let us look at the above code in detail,
- Line 4: Connects to Riak using Protocolbuffers port
- Line 7: Displays the result to stdout
Also, if you really, really, really want to get all the values from a bucket, you can do that as follows: ( don’t try this on a Production system please! )
Notice that, instead of specifying a list of tuples, we’re using a binary (<<”tweets”>>) which says what bucket is used. In our final blog, we’ll explore the differences between using Java and Erlang.