Big Data and Machine Learning: An Introduction to Machine Learning

Big Data & Machine LearningThis blog post will give you a whirlwind tour of machine learning techniques applied to recommender engines and why we’ve chosen Apache Mahout for our research.

Machine learning is a branch of artificial intelligence (AI) focused on the study of systems that can learn from data. Over the years, it has seen many applications in recommender systems, search engines, stock market analysis, speech recognition and information retrieval. The main characteristics of such applications are the need to analyze large sets of historical data and make predictions on unseen data.

Recommender systems are of special interest to organizations which want to analyze user choices and personalize the content for other users. For example, if you browse the detail of any book on Amazon, you’ll be informed of what other users interested in the same book viewed and purchased, which is of immense benefit to Amazon as they are able to showcase and cross-sell their products, all the while personalizing the experience of the user as per her interests.

User choices on different items vary, but they do follow patterns.  People tend to likes things that are similar to other things they like and they seem to like the things similar people like. In fact, these are the two broad categories of machine learning algorithms for computing recommendations: item-based and user-based. These algorithms work by analyzing the preferences of a given set of users for a given set of items; item-based recommender engines compute the similarity of items, while, user-based recommender engines compute the similarity of users. If an item-based recommender engine is presented with a user and a set of her preferences for some items, it will compute other items similar to the preferred items and recommend the closest matches. If a user-based recommender engine is provided the same inputs, it will compute the most similar other users and recommend the most preferred items as expressed by those similar users.

These are examples of collaborative filtering – which produces recommendations based only on knowledge of users preferences for items. These algorithms have no knowledge of the properties of the users or the items. This can be beneficial as the same algorithm can be used for recommending any kind of item – music, books, movies or flowers.  There exist other techniques called content-based recommendation techniques which do account for the properties for users and items. Often content-based recommendation techniques complement collaborative filtering when the domain model is well understood, but there is no way to codify content based techniques in to a generic framework.

However, because collaborative filtering does not depend on domain attributes, it can very well be coded in to a generic framework. In this space, Apache Mahout, is an open source machine learning library for collaborative filtering, clustering and classification. Mahout provides both item-based and user-based collaborative filtering algorithms. These algorithms can be run from memory or on Apache Hadoop for scaling to very large datasets. Mahout is able to work with users and items with numeric values that signify the strength of preference or even for datasets where this preference indicator is absent; the first case is a generic recommender, while the second case is a Boolean recommender. Mahout also provides means to inject content based recommendation techniques by allowing custom algorithms to boost or even omit some recommended items. Mahout is equipped with tools to measure the recall and precision of your algorithms and program counters, making it ideal to quickly determine the best fit for your dataset and problem domain.

For the reasons mentioned above, we picked Mahout for our research on building a recommender engine that employed both collaborative filtering algorithms and content based modification of recommender results. That will be the topic for our next blog!

Sayantam Dey

Sayantam Dey

Senior Director Engineering

Sayantam Dey is the Senior Director of Engineering at 3Pillar Global, working out of our office in Noida, India. He has been with 3Pillar for ten years, delivering enterprise products and building frameworks for accelerated software development and testing in various technologies. His current areas of interest are data analytics, messaging systems and cloud services. He has authored the ‘Spring Integration AWS’ open source project and contributes to other open source projects such as SocialAuth and SocialAuth Android.

Leave a Reply

Related Posts

Designing the Future & the Future of Work – The I... Martin Wezowski, Chief Designer and Futurist at SAP, shares his thoughts on designing the future and the future of work on this episode of The Innovat...
The 4 Characteristics of a Healthy Digital Product Team Several weeks ago, I found myself engaged in two separate, yet eerily similar, conversations with CEOs struggling to gain the confidence they needed t...
Recapping Fortune Brainstorm Tech – The Innovation Eng... On this episode of The Innovation Engine, David DeWolf and Jonathan Rivers join us to share an overview of all the news that was fit to print at this ...
4 Reasons Everyone is Wrong About Blockchain: Your Guide to ... You know a technology has officially jumped the shark when iced tea companies decide they want in on the action. In case you missed that one, Long Isl...
The Connection Between Innovation & Story On this episode of The Innovation Engine, we'll be looking at the connection between story and innovation. Among the topics we'll cover are why story ...