August 5, 2013

Big Data and Machine Learning: Building a Recommendation Engine

In the previous blog on machine learning, we learnt about applying machine learning techniques to recommendation engines and an overview of collaborative filtering (CF) algorithms implemented in Apache Mahout. In this post, we’ll discuss how to build a recommendation engine using Mahout.

Let us take an example of a movie rating application that allows users to rate movies and suggests other movies that they might like. Following could be a data set where some users have rated some movies on a scale of 1 to 5 (highest). The empty cells denote that the user has not rated the movie.

movie rating app

If you had to recommend Alice one or more movies from the table, how would you do it? She really likes Rambo and Rocky, so she either likes Sylvester Stallone or action movies, so you would probably suggest Thor or Training Day. She somewhat seems to like romantic movies, so you’d probably suggest Before Sunset.

A CF recommender obviously does not have all of this contextual information so it relies purely on existing relationship between users and movies, i.e. Eddie also rates the same movies higher, so there is a significant chance of Alice liking other movies highly rated by Eddie. What movies would a CF recommender likely recommend to Alice? We’ll answer this question by examining the major components in context of Mahout – the Data Model, a notion of Similarity, the User Neighborhood and the Recommender.

Data Model

The data model is representation of the data set in Figure 1. Mahout only accepts numeric identifiers for users and items; so we need to assign a unique numeric ID to each user and each item. Each cell in the table will translate to a comma separated tuple of user ID, item ID and preference value. Translating the given dataset, this is the data model for Mahout:

Data Model

In Mahout, the DataModel interface represents the data model; there are implementations for loading this data from a file, a relational database, MongoDB or you could add your own implementation. For loading the data from a file, you would place each tuple in one line (like a list).


Given the data model, the recommendation engine must build a measure of similarity. For example, in the given data model, it would seems Alice and Eddie have similar tastes since they both seem to like Rocky & Rambo and like Garden State as much.

Recall from the earlier post, a CF recommender can either compute the similarity between users or between items. Mahout represents the notion of similarity between users with the UserSimilarity interface and the ItemSimilarity interface for similarity between items. There are multiple algorithms like Pearson Correlation, Spearman Correlation, Log Likelihood that define measures of similarity and are implemented in Mahout.

Regardless of the algorithm used, the inputs to the algorithm are two users (or two items), the items they have expressed preferences for (or users who prefer these items) and the preferences values; the output is a measure of the similarity between the two users (or items).

User Neighborhood

For user based recommenders, once the similarity between users has been computed, it can become computationally very expensive to examine every item the other users have expressed some preference, in order to making a recommendation. This is often unnecessary as dissimilar users will not have an impact on the final result. In order to limit the number other users considered a neighborhood of similar users is used.

In Mahout, this is expressed by the UserNeighborHood interface and the implementation can either consider the nearest N users or use a threshold for similarity. Determining the ideal neighborhood size is a matter of experimentation. If it is too small, it is likely to omit similar users and if it is too big, you get the same recommendations for a higher computing price. In order to build the neighborhood, both the Data Model and (User) Similarity components are needed.

There is no such thing as “ItemNeighborhood”, this is because the item-based recommender already limits the items to the ones that the user (asking for the recommendation) has expressed a preference for.


The recommender is the application facing component that is responsible for making the recommendations. Mahout supports both user-based recommenders and item-based recommenders via the UserBasedRecommender and ItemBasedRecommender interfaces respectively.

User-based Recommender

A user-based recommender finds other users similar to a given user and recommends the top preferred items of the other users for which the given user has not expressed any preference.

Component Interaction

Mahout provides two implementations for a user based recommender – a generic recommender and a Boolean preferences recommender. The latter is used when the data set does not contain preference values, rather every user ID, item ID double signifies a preference of 1.0 and every double that is not present signifies a preference of 0.

Item-based Recommender

An item based recommender takes into account the items for which a given user has expressed some preference and recommends most similar other items. As mentioned above, item based recommenders do not need to compute the neighborhood, but use all the other components. Item-based recommenders may be used when the number of items is low compared to the number of users, this could provide a significant performance advantage.

Mahout provides a number of item-based implementations – a generic recommender, Boolean preferences recommender, Slope one, SVD and KNN.

CF Recommendations for Dataset

We can now run a CF recommender for the dataset and see the results we get. For the computations, a generic user-based recommender has been used, with Pearson Correlation similarity and a neighborhood of 3 users.

CF Recommendations for Dataset

If you go back to what we recommended to Alice with our contextual information (Thor / Training Day, Before Sunset), the CF recommender came eerily close to our thought process!

In practice, a recommendation engine has to deal with much larger datasets and real-time information as well. In the next post we’ll cover a very interesting idea where automated sensors monitor the users’ fitness activities and health products are suggested to them using fitness trends as a measure of similarity between users.