Big Data and Machine Learning: Building a Recommendation Engine

In the previous blog on machine learning, we learnt about applying machine learning techniques to recommendation engines and an overview of collaborative filtering (CF) algorithms implemented in Apache Mahout. In this post, we’ll discuss how to build a recommendation engine using Mahout.

Let us take an example of a movie rating application that allows users to rate movies and suggests other movies that they might like. Following could be a data set where some users have rated some movies on a scale of 1 to 5 (highest). The empty cells denote that the user has not rated the movie.

movie rating app

If you had to recommend Alice one or more movies from the table, how would you do it? She really likes Rambo and Rocky, so she either likes Sylvester Stallone or action movies, so you would probably suggest Thor or Training Day. She somewhat seems to like romantic movies, so you’d probably suggest Before Sunset.

A CF recommender obviously does not have all of this contextual information so it relies purely on existing relationship between users and movies, i.e. Eddie also rates the same movies higher, so there is a significant chance of Alice liking other movies highly rated by Eddie. What movies would a CF recommender likely recommend to Alice? We’ll answer this question by examining the major components in context of Mahout – the Data Model, a notion of Similarity, the User Neighborhood and the Recommender.

Data Model

The data model is representation of the data set in Figure 1. Mahout only accepts numeric identifiers for users and items; so we need to assign a unique numeric ID to each user and each item. Each cell in the table will translate to a comma separated tuple of user ID, item ID and preference value. Translating the given dataset, this is the data model for Mahout:

Data Model

In Mahout, the DataModel interface represents the data model; there are implementations for loading this data from a file, a relational database, MongoDB or you could add your own implementation. For loading the data from a file, you would place each tuple in one line (like a list).

Similarity

Given the data model, the recommendation engine must build a measure of similarity. For example, in the given data model, it would seems Alice and Eddie have similar tastes since they both seem to like Rocky & Rambo and like Garden State as much.

Recall from the earlier post, a CF recommender can either compute the similarity between users or between items. Mahout represents the notion of similarity between users with the UserSimilarity interface and the ItemSimilarity interface for similarity between items. There are multiple algorithms like Pearson Correlation, Spearman Correlation, Log Likelihood that define measures of similarity and are implemented in Mahout.

Regardless of the algorithm used, the inputs to the algorithm are two users (or two items), the items they have expressed preferences for (or users who prefer these items) and the preferences values; the output is a measure of the similarity between the two users (or items).

User Neighborhood

For user based recommenders, once the similarity between users has been computed, it can become computationally very expensive to examine every item the other users have expressed some preference, in order to making a recommendation. This is often unnecessary as dissimilar users will not have an impact on the final result. In order to limit the number other users considered a neighborhood of similar users is used.

In Mahout, this is expressed by the UserNeighborHood interface and the implementation can either consider the nearest N users or use a threshold for similarity. Determining the ideal neighborhood size is a matter of experimentation. If it is too small, it is likely to omit similar users and if it is too big, you get the same recommendations for a higher computing price. In order to build the neighborhood, both the Data Model and (User) Similarity components are needed.

There is no such thing as “ItemNeighborhood”, this is because the item-based recommender already limits the items to the ones that the user (asking for the recommendation) has expressed a preference for.

Recommender

The recommender is the application facing component that is responsible for making the recommendations. Mahout supports both user-based recommenders and item-based recommenders via the UserBasedRecommender and ItemBasedRecommender interfaces respectively.

User-based Recommender

A user-based recommender finds other users similar to a given user and recommends the top preferred items of the other users for which the given user has not expressed any preference.

Component Interaction

Mahout provides two implementations for a user based recommender – a generic recommender and a Boolean preferences recommender. The latter is used when the data set does not contain preference values, rather every user ID, item ID double signifies a preference of 1.0 and every double that is not present signifies a preference of 0.

Item-based Recommender

An item based recommender takes into account the items for which a given user has expressed some preference and recommends most similar other items. As mentioned above, item based recommenders do not need to compute the neighborhood, but use all the other components. Item-based recommenders may be used when the number of items is low compared to the number of users, this could provide a significant performance advantage.

Mahout provides a number of item-based implementations – a generic recommender, Boolean preferences recommender, Slope one, SVD and KNN.

CF Recommendations for Dataset

We can now run a CF recommender for the dataset and see the results we get. For the computations, a generic user-based recommender has been used, with Pearson Correlation similarity and a neighborhood of 3 users.

CF Recommendations for Dataset

If you go back to what we recommended to Alice with our contextual information (Thor / Training Day, Before Sunset), the CF recommender came eerily close to our thought process!

In practice, a recommendation engine has to deal with much larger datasets and real-time information as well. In the next post we’ll cover a very interesting idea where automated sensors monitor the users’ fitness activities and health products are suggested to them using fitness trends as a measure of similarity between users.

Sayantam Dey

Sayantam Dey

Senior Director Engineering

Sayantam Dey is the Senior Director of Engineering at 3Pillar Global, working out of our office in Noida, India. He has been with 3Pillar for ten years, delivering enterprise products and building frameworks for accelerated software development and testing in various technologies. His current areas of interest are data analytics, messaging systems and cloud services. He has authored the ‘Spring Integration AWS’ open source project and contributes to other open source projects such as SocialAuth and SocialAuth Android.

One Response to “Big Data and Machine Learning: Building a Recommendation Engine”
  1. Mallika on

    It would be great if you could provide links to the ‘older post’/’previous blog’ that you mention multiple times in this blog. I would have loved to read the older, relevant blogs but could not find it with ease (thanks to the multipage index)… Please make it easy for your readers to find content that you are referring to..At least if you really care about your blogs being read.

    Reply
Leave a Reply

Related Posts

How to Manage the “Need for Speed” Without Sacri... The pace of innovation today is faster than it has ever been. Customers are much more active and vocal thanks to social and mobile channels, and the c...
Determining the First Release The first thing you release needs to put the solution to your customer's most important problem in their hands. Deciding what the most important probl...
The Art of Building Rapid (and Valuable) Proofs of Concept Clients and stakeholders want results. They want assurances that their investment is well spent and they're building the right product. The software d...
Are You Doing Stuff or Creating Value? You can put a bunch of stickies on the wall, create tons of JIRA tickets, and commit lots of code, but are you creating value? Is the work your produc...
Costovation – Giving Your Customers Exactly What They ... On this episode of The Innovation Engine podcast, we delve into “cost-ovation,” or innovation that gives your customers exactly what they want – and n...