Document Classification Using Multinomial Naive Bayes Classifier

Document classification is a classical machine learning problem. If there is a set of documents that is already categorized/labeled in existing categories, the task is to automatically categorize a new document into one of the existing categories. In this blog, I will elaborate upon the machine learning technique to do this.

We have an existing set of documents (D1-D5) that are categorized into Auto, Sports, and Computer.

 

 

 

 

Document #ContentCategory
D1Saturn Dealer’s CarAuto
D2Toyota Car TercelAuto
D3Baseball Game PlaySports
D4Pulled Muscle GameSports
D5Colored GIFs RootComputer

Now the task is to categorize the new D6 and D7 into Auto, Sports, or Computer.

Document #ContentCategory
D6Home Runs Game?
D7Car Engine Noises?

In machine learning, the given set of documents used to train the probabilistic model is called the training set.

The problem can be solved by the classification technique of machine learning. There are several machine learning algorithms that can be tried out, including:

  • Pipeline
  • BernoulliNB
  • MultinomialNB
  • NearestCentroid
  • SGD Classifier
  • LinearSVC
  • RandomForestClassifier
  • KNeighborsClassifier
  • PassiveAggressiveClassifier
  • Perceptron
  • RidgeClassifier

Feel free to try out these algorithms for yourself; I found Multinomial Naive Bayes to be one of the most effective algorithms for this purpose.

In this blog, I will also provide an application of Multinomial Naive Bayes. I recommend going through the following topics to build a strong foundation of this concept.

  1. Conditional Probability
  2. Bayes Theorem
  3. Naive Bayes Classifier
  4. Multinomial Naive Bayes Classifier

Applying Multinomial Bayes Classification

Step 1

Calculate prior probabilities. These are the probability of a document being in a specific category from the given set of documents.

P(Category) = (No. of documents classified into the category) divided by (Total number of documents)

P(Auto) = (No of documents classified into Auto)divided by (Total number of documents) = 2/5 = 0.4

P(Sports) = 2/5 = 0.4

P(Computer) = 1/5 = 0.2

Step 2

Calculate Likelihood. Likelihood is the conditional probability of a word occurring in a document given that the document belongs to a particular category.

P(Word/Category) = (Number of occurrence of the word in all the documents from a category+1) divided by (All the words in every document from a category + Total number of unique words in all the documents)

P(Saturn/Auto) = (Number of occurrence of the word “SATURN” in all the documents in “AUTO”+1) divided by (All the words in every document from “AUTO” + Total number of unique words in all the documents)

= (1+1)/(6+3) = 2/19 = 0.105263158

The tables below provide conditional probabilities for each word in Auto, Sports, and Computer.

Auto

Word# of Occurrences of Word in AutoTotal Words in AutoConditional Probability of Given Word in Auto# of Total Unique Words in All Documents
Saturn160.10526315813
Dealers160.10526315813
Car260.15789473713
Toyota160.10526315813
Tercel160.10526315813
Baseball060.05263157913
Game060.05263157913
Play060.05263157913
Pulled060.05263157913
Muscle060.05263157913
Colored060.05263157913
GIFs060.05263157913
Root060.05263157913
Home060.05263157913
Runs060.05263157913
Engine060.05263157913
Noises060.05263157913

Sports

Word# of Occurrences of Word in Sports
Total Words in SportsConditional Probability of Given Word# of Total Unique Words in All Documents
Saturn060.05263157913
Dealers060.05263157913
Car060.05263157913
Toyota060.05263157913
Tercel060.05263157913
Baseball160.10526315813
Game260.15789473713
Play160.10526315813
Pulled160.10526315813
Muscle160.10526315813
Colored160.10526315813
GIFs160.10526315813
Root160.10526315813
Home060.05263157913
Runs060.05263157913
Engine060.05263157913
Noises060.05263157913

Computer

Word# of Occurrences of Word in ComputerTotal Words in ComputerConditional Probability of Given Word in Computer# of Total Unique Words in All Documents
Saturn030.062513
Dealers030.062513
Car030.062513
Toyota030.062513
Tercel030.062513
Baseball030.062513
Game030.062513
Play030.062513
Pulled030.062513
Muscle030.062513
Colored130.12513
GIFs130.12513
Root130.12513
Home030.062513
Runs030.062513
Engine030.062513
Noises030.062513

Step 3

Calculate P(Category/Document) = P(Category) * P(Word1/Category) * P(Word2/Category) * P(Word3/Category)

P(Auto/D6) = P(Auto) * P(Engine/Auto) * P(Noises/Auto) * P(Car/Auto)

= (0.4) * (0.052631579) * (0.157894737)

= (0.00005831754)

P(Sports/D6) = 0.000174953

P(Computers/D6) = 0.00004882813

The most probable category for D6 to fall into is Sports, because it has the highest probability among its peers.

P(Auto/D7) = 0.00017495262

P(Sports/D7) = 0.0000583175

P(Computers/D7) = 0.00004882813

The most probable category for D7 to fall into is Auto, because it has the highest probability among its peers.

The Multinomial Naive Bayes technique is pretty effective for document classification.

Before concluding, I would recommend exploring following Python Packages, which provide great resources to learn classification techniques along with the implementation of several classification algorithms.

I hope you enjoyed reading this. If you have any questions or queries, please leave a comment below. I highly appreciate your feedback!

Manoj Bisht

Manoj Bisht

Architect

Manoj Bisht is working as an Architect in the Advanced Technology Group at 3Pillar Global. Manoj has 13 years of software design and development experience. He has software architecture experience in many areas such as n-tier, EAI/B2B integration, SOA architecture, and Cloud Computing. He is an AWS Certified Solution Architect. Manoj also has extensive experience working in Retail, E-commerce, CMS, and Media domains. Manoj is a post graduate from Delhi University, India. He loves to spend his spare time playing games and also likes traveling to new places with family and friends.

5 Responses to “Document Classification Using Multinomial Naive Bayes Classifier”
  1. hardi thakor on

    hello sir, my question is how to find unique words in Naive Bayes algorithm

    Reply
  2. Patricia Ramos on

    Hi sir! May I ask when did you publish this article? My team and I are working on a paper about spam and we will be using Multinomial Naive Bayes Classifier. We would like to cite your article on our paper. Thank you very much.

    Reply
Leave a Reply

Related Posts

Why You’re Crazy for Not Building Your Products With a... Full disclosure: I am an external product development partner for several highly successful companies - gaining market share and having successful exi...
Real-Time Analytics & Data Visualization, with Dan Gree... 3Pillar Global's Dan Greene, Chris Graham, and Sayantam Dey join us on this episode of The Innovation Engine podcast to talk about the present and fut...
Take 3, Scene 24: The Influence and Evolution of Blockchain,... On this episode of Take 3, we're joined again by Michael Lisse and Derek Tiffany to hear their fresh insights from the Consensus Blockchain Summit in ...
Take 3, Scene 23: The Influence and Evolution of Blockchain Michael Lisse and Derek Tiffany join us in the studio for this episode of Take 3 to talk about the influence and evolution of blockchain in the financ...
The Innovation Engine from TechCrunch Disrupt New York For a very special episode of The Innovation Engine, we're bringing you a number of interviews from the floor of last week's TechCrunch Disrupt New Yo...

SUBSCRIBE TODAY


Sign up today to receive our monthly product development tips newsletter.