# Document Classification Using Multinomial Naive Bayes Classifier

Document classification is a classical machine learning problem. If there is a set of documents that is already categorized/labeled in existing categories, the task is to automatically categorize a new document into one of the existing categories. In this blog, I will elaborate upon the machine learning technique to do this.

We have an existing set of documents (D1-D5) that are categorized into Auto, Sports, and Computer.

 Document # Content Category D1 Saturn Dealer’s Car Auto D2 Toyota Car Tercel Auto D3 Baseball Game Play Sports D4 Pulled Muscle Game Sports D5 Colored GIFs Root Computer

Now the task is to categorize the new D6 and D7 into Auto, Sports, or Computer.

 Document # Content Category D6 Home Runs Game ? D7 Car Engine Noises ?

In machine learning, the given set of documents used to train the probabilistic model is called the training set.

The problem can be solved by the classification technique of machine learning. There are several machine learning algorithms that can be tried out, including:

• Pipeline
• BernoulliNB
• MultinomialNB
• NearestCentroid
• SGD Classifier
• LinearSVC
• RandomForestClassifier
• KNeighborsClassifier
• PassiveAggressiveClassifier
• Perceptron
• RidgeClassifier

Feel free to try out these algorithms for yourself; I found Multinomial Naive Bayes to be one of the most effective algorithms for this purpose.

In this blog, I will also provide an application of Multinomial Naive Bayes. I recommend going through the following topics to build a strong foundation of this concept.

### Applying Multinomial Bayes Classification

Step 1

Calculate prior probabilities. These are the probability of a document being in a specific category from the given set of documents.

P(Category) = (No. of documents classified into the category) divided by (Total number of documents)

P(Auto) = (No of documents classified into Auto)divided by (Total number of documents) = 2/5 = 0.4

P(Sports) = 2/5 = 0.4

P(Computer) = 1/5 = 0.2

Step 2

Calculate Likelihood. Likelihood is the conditional probability of a word occurring in a document given that the document belongs to a particular category.

P(Word/Category) = (Number of occurrence of the word in all the documents from a category+1) divided by (All the words in every document from a category + Total number of unique words in all the documents)

P(Saturn/Auto) = (Number of occurrence of the word “SATURN” in all the documents in “AUTO”+1) divided by (All the words in every document from “AUTO” + Total number of unique words in all the documents)

= (1+1)/(6+13) = 2/19 = 0.105263158

The tables below provide conditional probabilities for each word in Auto, Sports, and Computer.

### Auto

 Word # of Occurrences of Word in Auto Total Words in Auto Conditional Probability of Given Word in Auto # of Total Unique Words in All Documents Saturn 1 6 0.105263158 13 Dealers 1 6 0.105263158 13 Car 2 6 0.157894737 13 Toyota 1 6 0.105263158 13 Tercel 1 6 0.105263158 13 Baseball 0 6 0.052631579 13 Game 0 6 0.052631579 13 Play 0 6 0.052631579 13 Pulled 0 6 0.052631579 13 Muscle 0 6 0.052631579 13 Colored 0 6 0.052631579 13 GIFs 0 6 0.052631579 13 Root 0 6 0.052631579 13 Home 0 6 0.052631579 13 Runs 0 6 0.052631579 13 Engine 0 6 0.052631579 13 Noises 0 6 0.052631579 13

### Sports

 Word # of Occurrences of Word in Sports Total Words in Sports Conditional Probability of Given Word # of Total Unique Words in All Documents Saturn 0 6 0.052631579 13 Dealers 0 6 0.052631579 13 Car 0 6 0.052631579 13 Toyota 0 6 0.052631579 13 Tercel 0 6 0.052631579 13 Baseball 1 6 0.105263158 13 Game 2 6 0.157894737 13 Play 1 6 0.105263158 13 Pulled 1 6 0.105263158 13 Muscle 1 6 0.105263158 13 Colored 1 6 0.105263158 13 GIFs 1 6 0.105263158 13 Root 1 6 0.105263158 13 Home 0 6 0.052631579 13 Runs 0 6 0.052631579 13 Engine 0 6 0.052631579 13 Noises 0 6 0.052631579 13

### Computer

 Word # of Occurrences of Word in Computer Total Words in Computer Conditional Probability of Given Word in Computer # of Total Unique Words in All Documents Saturn 0 3 0.0625 13 Dealers 0 3 0.0625 13 Car 0 3 0.0625 13 Toyota 0 3 0.0625 13 Tercel 0 3 0.0625 13 Baseball 0 3 0.0625 13 Game 0 3 0.0625 13 Play 0 3 0.0625 13 Pulled 0 3 0.0625 13 Muscle 0 3 0.0625 13 Colored 1 3 0.125 13 GIFs 1 3 0.125 13 Root 1 3 0.125 13 Home 0 3 0.0625 13 Runs 0 3 0.0625 13 Engine 0 3 0.0625 13 Noises 0 3 0.0625 13

Step 3

Calculate P(Category/Document) = P(Category) * P(Word1/Category) * P(Word2/Category) * P(Word3/Category)

P(Auto/D6) = P(Auto) * P(Engine/Auto) * P(Noises/Auto) * P(Car/Auto)

= (0.4) * (0.052631579) * (0.157894737)

= (0.00005831754)

P(Sports/D6) = 0.000174953

P(Computers/D6) = 0.00004882813

The most probable category for D6 to fall into is Sports, because it has the highest probability among its peers.

P(Auto/D7) = 0.00017495262

P(Sports/D7) = 0.0000583175

P(Computers/D7) = 0.00004882813

The most probable category for D7 to fall into is Auto, because it has the highest probability among its peers.

The Multinomial Naive Bayes technique is pretty effective for document classification.

Before concluding, I would recommend exploring following Python Packages, which provide great resources to learn classification techniques along with the implementation of several classification algorithms.

#### Manoj Bisht

##### Senior Architect

Manoj Bisht is the Senior Architect at 3Pillar Global, working out of our office in Noida, India. He has expertise in building and working with high performance team delivering cutting edge enterprise products. He is also a keen researcher and dive deeps into trending technologies. His current areas of interest are data science, cloud services and micro service/ serverless design and architecture. He loves to spend his spare time playing games and also likes traveling to new places with family and friends.

##### 12 Responses to “Document Classification Using Multinomial Naive Bayes Classifier”
1. hardi thakor on

hello sir, my question is how to find unique words in Naive Bayes algorithm

2. Patricia Ramos on

Hi sir! May I ask when did you publish this article? My team and I are working on a paper about spam and we will be using Multinomial Naive Bayes Classifier. We would like to cite your article on our paper. Thank you very much.

3. dibakar on

= (1+1)/(6+3)

Instead of 3 it should be 13 I think.

4. Rupam on

Hi Sir,

I have a huge data set having around 360 categories( Eg. Agriculture, Animation, AI, Banking, Security ect..) to be predicted based on description of a comapny. Using MULTINOMIAL BAYES I’m getting very low accuracy. Is there any other algo which works on such type of dataset?

5. Shekhar pandey on

Sir, can you kindly tell me your research paper name…So that I can read it. Thank you.

6. Faizan Saeed on

Great article. Pointed out a small typo in the line

= (1+1)/(6+3) = 2/19 = 0.105263158

it should be 13 instead of 3 in the above line because 6+3 = 9 and 6+13 = 19