Spell Check and Autocorrect with Conditional Probability

A client of mine wanted to reduce the time it took for his vendors to upload an inventory list to his system. The system currently matches the product name with known product names in the system database, and for the names that match exactly, the inventory items are automatically updated. For the items that did not match, the user needs to search through the system database and choose the matching inventory item.

In the domain in which the client operates, the product names are not standardized and there is no accepted SKU ID for the products. The automatic matching also has a bunch of heuristics that cover the most commonly occurring mismatches, but they only apply to a subset of the mismatched products. In this blog, I will elaborate upon an approach that uses conditional probability to solve the mismatch in these product names without resorting to heuristics.

Let’s assume we have the following known names in the inventory:

Correct Product Names
Carbon Dioxide
Carbon Monoxide
Carbone Lorraine

If a user provides a name like “Carbin,” we will calculate the most likely correct work using Bayes Theorem:

P(C/W) = Probability(correctly named product/wrongly named product)

P(W/C) = Probability(wrongly named product/correctly named product)

P(C) = Probably(correctly named product)

P(W) = Probability(wrongly named product)

P(C/W) = P(W/C) * P(C) / P(W)

We can ignore the division factor P(W) as it will be constant for all the probability calculation and will have no impact when we choose the highest probability.

P(C) = [Count of correctly named product] / [Total number of individual names in product]

eg.
Probability(Carbon) = [Count of Carbon] / [Total number of individual names in product] = 2/7
Probability(Dioxide) = 1/7

Now, how do we calculate P(W/C)? We will construct all possible permutations of the words for a wrongly provided product name using the concept of edit distance. The edit distance is the minimum number of insertions, deletions, and substitutions required to transform one string into the other.

Let us understand it with an example. The wrongly provided word in need of correction is “Carbin.” So we prepare a set of all possible alterations to “Carbin” as shown below:

  1. Delete a letter, eg. arbin, crbin, cabin, etc.
  2. Replace a letter, eg. aarbin, barbin, etc.
  3. Add a letter, eg. acarbin, bcarbin, etc.
  4. Switch two letters around, eg. acrbin, crabin, cabrin, etc.

We choose all the words from the above set that are known product names. These words are Carbon and Carbone. In probabilistic terms, the probability of all the unknown product names would be 0 and they can’t be selected.

Prob(unknown word/Carbon)=0 and Prob(known word/Carbon)=1

Finally, calculate the conditional probabilities for Gruaud, Gruaudi

Prob(Carbon/Carbin) = Prob(Carbin/Carbon) * Prob(Carbon) = 1 * 2/7 = 0.28

Prob (Carbone/Carbin) = Prob(Carbin/Carbone) * Prob(Carbone) = 1 * 1/7 = 0.14

The higher probabilistic word is “Carbon.”

This article provides further reading on this subject: https://web.stanford.edu/class/cs124/lec/spelling.pdf

I hope you have enjoyed reading this. If you have any questions or queries, please leave a comment below. I highly appreciate your feedback!

Manoj Bisht

Manoj Bisht

Senior Architect

Manoj Bisht is the Senior Architect at 3Pillar Global, working out of our office in Noida, India. He has expertise in building and working with high performance team delivering cutting edge enterprise products. He is also a keen researcher and dive deeps into trending technologies. His current areas of interest are data science, cloud services and micro service/ serverless design and architecture. He loves to spend his spare time playing games and also likes traveling to new places with family and friends.

Leave a Reply

Related Posts

The 3 Keys to Building Products That Drive Retention –... I had the privilege of being invited to speak at the Wearable Technology Show in Santa Clara this week, where I gave a bit of a reprisal of a talk I d...
High Availability and Automatic Failover in Hadoop Hadoop in Brief Hadoop is one of the most popular sets of big data processing technologies/frameworks in use today. From Adobe and eBay to Facebook a...
3Pillar CEO David DeWolf Quoted in Enterprise Mobility Excha... David DeWolf, Founder and CEO of 3Pillar Global, was recently quoted in a report by Enterprise Mobility Exchange on the necessity of understanding and...
How the Right Tech Stack Fuels Innovation – The Innova... On this episode of The Innovation Engine podcast, we take a look at how choosing the right tech stack can fuel innovation in your company. We'll talk ...
The Road to AWS re:Invent 2018 – Weekly Predictions, P... For the last two weeks, I’ve been making predictions of what might be announced at AWS’ upcoming re:Invent conference. In week 1, I made some guesses ...