Introduction to Data Aggregation with NoSql Databases: Blog Series Part iii

Concept Four: For map-reduce operations, take advantage of ‘data locality’

In the context of a distributed system data locality describes how it is more expensive to transfer data to the node that knows how to process it rather than taking the computation to the node that holds the data.

For map-reduce operations Riak provides data locality for map operations – ie. all mapping will be performed on the nodes that hold the objects subjected to the map phases. Contrast this to reduce phases, which will be always run on the ‘coordinating node’ for the map-reduce operation.  For more details on how map-reduce operations work with Riak, read on Basho’s Riak map-reduce page:

  • To minimize the cost of a map-reduce operation on a Riak cluster, try to resume the map-reduce chaining to a single reduce – if not eliminate it at all (will talk about this in a little bit).
  • If you do want to chain in a reduce operation, try to keep the computing that is done in this phase to a minimum and feed it the least amount of input possible.
  • To get back to not having a reduce phase at all – map operations can be configured to return their results to the client invoking the map-reduce operation; thus you should evaluate if having a reduce operation in the map-reduce chain will be more effective if is to be performed on a node in the Riak cluster – or – on the API client itself.

Note that in the context of our application one can tune the amount of filtering you do on the blog objects with the amount of a scan’s object size to the expense of the number of get operations you’ll perform on the scans bucket. For example, if you decide to store a month’s worth of scanning information in one scan bucket, you’ll do approximately 30 times more filtering and 30 times less fetching from that bucket. Always time your queries as you go along with the project. Figure out a good balance from the metrics in the non-functional requirements – number of users, expected growth rate etc.

Concept Five: Serialize object values in a format that can be processed by code that lives on cluster nodes.

As mentioned above, getting to the objects that are subject to a computation can be done by using Riak’s capability of chaining map operations. If you choose to do so there’s a good chance that with every map operation you’ll have to take a peek at the objects’ values involved. Map operations are performed on the nodes that hold the data – that means that the nodes must know at least how to deserialize the data stored in an object.

Out-of-the-box, Riak nodes speak Erlang and JavaScript, so it is only natural that you’ll think about storing your object’s data in either Erlang binary terms or JSON. JavaScript can’t understand Erlang binary terms but Riak’s Erlang libraries can understand JSON, so JSON is common ground for both. Thus the choice that will suit most applications would be to store an object’s data in JSON format as most applications sitting on top of Riak won’t be written in Erlang.

Processing object data with Erlang will bring you a performance increase over working with JavaScript, as JavaScript is a layer sitting on top of Riak’s Erlang code and every JavaScript object you’ll manipulate will eventually get translated to an Erlang term.

If for some reason neither Erlang binary terms or JSON fit your needs you’ll most likely have to transfer in each object’s value to the API client for processing with a suitable library OR deploy on every Riak node Erlang or JavaScript code that will know to interpret your object’s data format.


Hopefully this blog series will prove useful in starting with NoSQL in terms of data modeling and data aggregation. The code provided should also shed some light on doing map-reduce queries with Riak’s Erlang client.

Robert Cristian

Robert Cristian

Director of Advanced Technology

Cristian Robert is the Director of the Advanced Technology Group for 3Pillar Global’s Romanian branch. In this position Robert is responsible for driving R&D efforts and advancing the technical expertise of 3Pillar Global. His main interests include software architecture design, reactive systems and functional programming.

Leave a Reply

Related Posts

How to Manage the “Need for Speed” Without Sacri... The pace of innovation today is faster than it has ever been. Customers are much more active and vocal thanks to social and mobile channels, and the c...
Determining the First Release The first thing you release needs to put the solution to your customer's most important problem in their hands. Deciding what the most important probl...
The Art of Building Rapid (and Valuable) Proofs of Concept Clients and stakeholders want results. They want assurances that their investment is well spent and they're building the right product. The software d...
Are You Doing Stuff or Creating Value? You can put a bunch of stickies on the wall, create tons of JIRA tickets, and commit lots of code, but are you creating value? Is the work your produc...
Costovation – Giving Your Customers Exactly What They ... On this episode of The Innovation Engine podcast, we delve into “cost-ovation,” or innovation that gives your customers exactly what they want – and n...