Big Data and Machine Learning: An Introduction to Machine Learning

Big Data & Machine LearningThis blog post will give you a whirlwind tour of machine learning techniques applied to recommender engines and why we’ve chosen Apache Mahout for our research.

Machine learning is a branch of artificial intelligence (AI) focused on the study of systems that can learn from data. Over the years, it has seen many applications in recommender systems, search engines, stock market analysis, speech recognition and information retrieval. The main characteristics of such applications are the need to analyze large sets of historical data and make predictions on unseen data.

Recommender systems are of special interest to organizations which want to analyze user choices and personalize the content for other users. For example, if you browse the detail of any book on Amazon, you’ll be informed of what other users interested in the same book viewed and purchased, which is of immense benefit to Amazon as they are able to showcase and cross-sell their products, all the while personalizing the experience of the user as per her interests.

User choices on different items vary, but they do follow patterns.  People tend to likes things that are similar to other things they like and they seem to like the things similar people like. In fact, these are the two broad categories of machine learning algorithms for computing recommendations: item-based and user-based. These algorithms work by analyzing the preferences of a given set of users for a given set of items; item-based recommender engines compute the similarity of items, while, user-based recommender engines compute the similarity of users. If an item-based recommender engine is presented with a user and a set of her preferences for some items, it will compute other items similar to the preferred items and recommend the closest matches. If a user-based recommender engine is provided the same inputs, it will compute the most similar other users and recommend the most preferred items as expressed by those similar users.

These are examples of collaborative filtering – which produces recommendations based only on knowledge of users preferences for items. These algorithms have no knowledge of the properties of the users or the items. This can be beneficial as the same algorithm can be used for recommending any kind of item – music, books, movies or flowers.  There exist other techniques called content-based recommendation techniques which do account for the properties for users and items. Often content-based recommendation techniques complement collaborative filtering when the domain model is well understood, but there is no way to codify content based techniques in to a generic framework.

However, because collaborative filtering does not depend on domain attributes, it can very well be coded in to a generic framework. In this space, Apache Mahout, is an open source machine learning library for collaborative filtering, clustering and classification. Mahout provides both item-based and user-based collaborative filtering algorithms. These algorithms can be run from memory or on Apache Hadoop for scaling to very large datasets. Mahout is able to work with users and items with numeric values that signify the strength of preference or even for datasets where this preference indicator is absent; the first case is a generic recommender, while the second case is a Boolean recommender. Mahout also provides means to inject content based recommendation techniques by allowing custom algorithms to boost or even omit some recommended items. Mahout is equipped with tools to measure the recall and precision of your algorithms and program counters, making it ideal to quickly determine the best fit for your dataset and problem domain.

For the reasons mentioned above, we picked Mahout for our research on building a recommender engine that employed both collaborative filtering algorithms and content based modification of recommender results. That will be the topic for our next blog!

Sayantam Dey

Sayantam Dey

Senior Director Engineering

Sayantam Dey is the Senior Director of Engineering at 3Pillar Global, working out of our office in Noida, India. He has been with 3Pillar for ten years, delivering enterprise products and building frameworks for accelerated software development and testing in various technologies. His current areas of interest are data analytics, messaging systems and cloud services. He has authored the ‘Spring Integration AWS’ open source project and contributes to other open source projects such as SocialAuth and SocialAuth Android.

One Response to “Big Data and Machine Learning: An Introduction to Machine Learning”
  1. ramakrishnan on

    Big data is one of the most emerging field. Thanks for sharing this valuable information on big data.

Leave a Reply

Related Posts

3Pillar Recognized as an Experience Designer In Report by In... Fairfax-based product development company named to its second report in 2018FAIRFAX, VA (June 26) - Today, 3Pillar Global, a global custom softwar...
Why You Need Automated Testing to Reach DevOps’ Holy Grail Automated testing is required to reach DevOps’ Holy Grail - continuous deployment. Despite what you may have seen in Indiana Jones and the Last Crusad...
AI, Chatbots & Natural Language Processing: The Present... For this episode of The Innovation Engine podcast, we take a look at what the future of digital healthcare may hold for both patients and providers. W...
Should You A/B Test? First of all, what does A/B testing mean? A/B testing starts when you want to be sure you're making the right decision. Simply put, A/B testing is c...
Change Blindness in UX There is a strong discrepancy between the amount of information being transmitted and the amount of information our brains have the capacity to proces...