May 16, 2013

Big Data and Machine Learning: An Introduction to Machine Learning

Big Data & Machine LearningThis blog post will give you a whirlwind tour of machine learning techniques applied to recommender engines and why we’ve chosen Apache Mahout for our research.

Machine learning is a branch of artificial intelligence (AI) focused on the study of systems that can learn from data. Over the years, it has seen many applications in recommender systems, search engines, stock market analysis, speech recognition and information retrieval. The main characteristics of such applications are the need to analyze large sets of historical data and make predictions on unseen data.

Recommender systems are of special interest to organizations which want to analyze user choices and personalize the content for other users. For example, if you browse the detail of any book on Amazon, you’ll be informed of what other users interested in the same book viewed and purchased, which is of immense benefit to Amazon as they are able to showcase and cross-sell their products, all the while personalizing the experience of the user as per her interests.

User choices on different items vary, but they do follow patterns.  People tend to likes things that are similar to other things they like and they seem to like the things similar people like. In fact, these are the two broad categories of machine learning algorithms for computing recommendations: item-based and user-based. These algorithms work by analyzing the preferences of a given set of users for a given set of items; item-based recommender engines compute the similarity of items, while, user-based recommender engines compute the similarity of users. If an item-based recommender engine is presented with a user and a set of her preferences for some items, it will compute other items similar to the preferred items and recommend the closest matches. If a user-based recommender engine is provided the same inputs, it will compute the most similar other users and recommend the most preferred items as expressed by those similar users.

These are examples of collaborative filtering – which produces recommendations based only on knowledge of users preferences for items. These algorithms have no knowledge of the properties of the users or the items. This can be beneficial as the same algorithm can be used for recommending any kind of item – music, books, movies or flowers.  There exist other techniques called content-based recommendation techniques which do account for the properties for users and items. Often content-based recommendation techniques complement collaborative filtering when the domain model is well understood, but there is no way to codify content based techniques in to a generic framework.

However, because collaborative filtering does not depend on domain attributes, it can very well be coded in to a generic framework. In this space, Apache Mahout, is an open source machine learning library for collaborative filtering, clustering and classification. Mahout provides both item-based and user-based collaborative filtering algorithms. These algorithms can be run from memory or on Apache Hadoop for scaling to very large datasets. Mahout is able to work with users and items with numeric values that signify the strength of preference or even for datasets where this preference indicator is absent; the first case is a generic recommender, while the second case is a Boolean recommender. Mahout also provides means to inject content based recommendation techniques by allowing custom algorithms to boost or even omit some recommended items. Mahout is equipped with tools to measure the recall and precision of your algorithms and program counters, making it ideal to quickly determine the best fit for your dataset and problem domain.

For the reasons mentioned above, we picked Mahout for our research on building a recommender engine that employed both collaborative filtering algorithms and content based modification of recommender results. That will be the topic for our next blog!