Document Classification Using Multinomial Naive Bayes Classifier

Document classification is a classical machine learning problem. If there is a set of documents that is already categorized/labeled in existing categories, the task is to automatically categorize a new document into one of the existing categories. In this blog, I will elaborate upon the machine learning technique to do this.

We have an existing set of documents (D1-D5) that are categorized into Auto, Sports, and Computer.

 

 

 

 

Document # Content Category
D1 Saturn Dealer’s Car Auto
D2 Toyota Car Tercel Auto
D3 Baseball Game Play Sports
D4 Pulled Muscle Game Sports
D5 Colored GIFs Root Computer

Now the task is to categorize the new D6 and D7 into Auto, Sports, or Computer.

Document # Content Category
D6 Home Runs Game ?
D7 Car Engine Noises ?

In machine learning, the given set of documents used to train the probabilistic model is called the training set.

The problem can be solved by the classification technique of machine learning. There are several machine learning algorithms that can be tried out, including:

  • Pipeline
  • BernoulliNB
  • MultinomialNB
  • NearestCentroid
  • SGD Classifier
  • LinearSVC
  • RandomForestClassifier
  • KNeighborsClassifier
  • PassiveAggressiveClassifier
  • Perceptron
  • RidgeClassifier

Feel free to try out these algorithms for yourself; I found Multinomial Naive Bayes to be one of the most effective algorithms for this purpose.

In this blog, I will also provide an application of Multinomial Naive Bayes. I recommend going through the following topics to build a strong foundation of this concept.

  1. Conditional Probability
  2. Bayes Theorem
  3. Naive Bayes Classifier
  4. Multinomial Naive Bayes Classifier

Applying Multinomial Bayes Classification

Step 1

Calculate prior probabilities. These are the probability of a document being in a specific category from the given set of documents.

P(Category) = (No. of documents classified into the category) divided by (Total number of documents)

P(Auto) = (No of documents classified into Auto)divided by (Total number of documents) = 2/5 = 0.4

P(Sports) = 2/5 = 0.4

P(Computer) = 1/5 = 0.2

Step 2

Calculate Likelihood. Likelihood is the conditional probability of a word occurring in a document given that the document belongs to a particular category.

P(Word/Category) = (Number of occurrence of the word in all the documents from a category+1) divided by (All the words in every document from a category + Total number of unique words in all the documents)

P(Saturn/Auto) = (Number of occurrence of the word “SATURN” in all the documents in “AUTO”+1) divided by (All the words in every document from “AUTO” + Total number of unique words in all the documents)

= (1+1)/(6+3) = 2/19 = 0.105263158

The tables below provide conditional probabilities for each word in Auto, Sports, and Computer.

Auto

Word # of Occurrences of Word in Auto Total Words in Auto Conditional Probability of Given Word in Auto # of Total Unique Words in All Documents
Saturn 1 6 0.105263158 13
Dealers 1 6 0.105263158 13
Car 2 6 0.157894737 13
Toyota 1 6 0.105263158 13
Tercel 1 6 0.105263158 13
Baseball 0 6 0.052631579 13
Game 0 6 0.052631579 13
Play 0 6 0.052631579 13
Pulled 0 6 0.052631579 13
Muscle 0 6 0.052631579 13
Colored 0 6 0.052631579 13
GIFs 0 6 0.052631579 13
Root 0 6 0.052631579 13
Home 0 6 0.052631579 13
Runs 0 6 0.052631579 13
Engine 0 6 0.052631579 13
Noises 0 6 0.052631579 13

Sports

Word # of Occurrences of Word in Sports
Total Words in Sports Conditional Probability of Given Word # of Total Unique Words in All Documents
Saturn 0 6 0.052631579 13
Dealers 0 6 0.052631579 13
Car 0 6 0.052631579 13
Toyota 0 6 0.052631579 13
Tercel 0 6 0.052631579 13
Baseball 1 6 0.105263158 13
Game 2 6 0.157894737 13
Play 1 6 0.105263158 13
Pulled 1 6 0.105263158 13
Muscle 1 6 0.105263158 13
Colored 1 6 0.105263158 13
GIFs 1 6 0.105263158 13
Root 1 6 0.105263158 13
Home 0 6 0.052631579 13
Runs 0 6 0.052631579 13
Engine 0 6 0.052631579 13
Noises 0 6 0.052631579 13

Computer

Word # of Occurrences of Word in Computer Total Words in Computer Conditional Probability of Given Word in Computer # of Total Unique Words in All Documents
Saturn 0 3 0.0625 13
Dealers 0 3 0.0625 13
Car 0 3 0.0625 13
Toyota 0 3 0.0625 13
Tercel 0 3 0.0625 13
Baseball 0 3 0.0625 13
Game 0 3 0.0625 13
Play 0 3 0.0625 13
Pulled 0 3 0.0625 13
Muscle 0 3 0.0625 13
Colored 1 3 0.125 13
GIFs 1 3 0.125 13
Root 1 3 0.125 13
Home 0 3 0.0625 13
Runs 0 3 0.0625 13
Engine 0 3 0.0625 13
Noises 0 3 0.0625 13

Step 3

Calculate P(Category/Document) = P(Category) * P(Word1/Category) * P(Word2/Category) * P(Word3/Category)

P(Auto/D6) = P(Auto) * P(Engine/Auto) * P(Noises/Auto) * P(Car/Auto)

= (0.4) * (0.052631579) * (0.157894737)

= (0.00005831754)

P(Sports/D6) = 0.000174953

P(Computers/D6) = 0.00004882813

The most probable category for D6 to fall into is Sports, because it has the highest probability among its peers.

P(Auto/D7) = 0.00017495262

P(Sports/D7) = 0.0000583175

P(Computers/D7) = 0.00004882813

The most probable category for D7 to fall into is Auto, because it has the highest probability among its peers.

The Multinomial Naive Bayes technique is pretty effective for document classification.

Before concluding, I would recommend exploring following Python Packages, which provide great resources to learn classification techniques along with the implementation of several classification algorithms.

I hope you enjoyed reading this. If you have any questions or queries, please leave a comment below. I highly appreciate your feedback!

Manoj Bisht

Manoj Bisht

Architect

Manoj Bisht is working as an Architect in the Advanced Technology Group at 3Pillar Global. Manoj has 13 years of software design and development experience. He has software architecture experience in many areas such as n-tier, EAI/B2B integration, SOA architecture, and Cloud Computing. He is an AWS Certified Solution Architect. Manoj also has extensive experience working in Retail, E-commerce, CMS, and Media domains. Manoj is a post graduate from Delhi University, India. He loves to spend his spare time playing games and also likes traveling to new places with family and friends.

Leave a Reply

Related Posts

Take 3, Scene 17: The Value of Data Analytics Dan Greene and Adi Chikara join us for this episode of Take 3 to discuss the value of Data Analytics for making sense of the massive amounts of data t...
Take 3, Scene 13: Redux Your React On this episode of Take 3, we take a look at the React development library and how it pairs with the Flux pattern Redux. Vlad Zelinschi and Cassian Lu...
Take 3, Scene 14: The Present and Future of Angular 2, Part ... On this two-part episode of Take 3, Cassian Lup and Andrei Tamas join us all the way from Romania to discuss the newest iteration of the AngularJS fra...
Take 3, Scene 14: The Present and Future of Angular 2, Part ... On the second part of this two-part episode of Take 3, we continue our conversation with Cassian Lup and Andrei Tamas on the newest iteration of the A...
Spell Check and Autocorrect with Conditional Probability A client of mine wanted to reduce the time it took for his vendors to upload an inventory list to his system. The system currently matches the product...

Free product development tips delivered right to your inbox