Spell Check and Autocorrect with Conditional Probability

A client of mine wanted to reduce the time it took for his vendors to upload an inventory list to his system. The system currently matches the product name with known product names in the system database, and for the names that match exactly, the inventory items are automatically updated. For the items that did not match, the user needs to search through the system database and choose the matching inventory item.

In the domain in which the client operates, the product names are not standardized and there is no accepted SKU ID for the products. The automatic matching also has a bunch of heuristics that cover the most commonly occurring mismatches, but they only apply to a subset of the mismatched products. In this blog, I will elaborate upon an approach that uses conditional probability to solve the mismatch in these product names without resorting to heuristics.

Let’s assume we have the following known names in the inventory:

Correct Product Names
Carbon Dioxide
Carbon Monoxide
Carbone Lorraine

If a user provides a name like “Carbin,” we will calculate the most likely correct work using Bayes Theorem:

P(C/W) = Probability(correctly named product/wrongly named product)

P(W/C) = Probability(wrongly named product/correctly named product)

P(C) = Probably(correctly named product)

P(W) = Probability(wrongly named product)

P(C/W) = P(W/C) * P(C) / P(W)

We can ignore the division factor P(W) as it will be constant for all the probability calculation and will have no impact when we choose the highest probability.

P(C) = [Count of correctly named product] / [Total number of individual names in product]

eg.
Probability(Carbon) = [Count of Carbon] / [Total number of individual names in product] = 2/7
Probability(Dioxide) = 1/7

Now, how do we calculate P(W/C)? We will construct all possible permutations of the words for a wrongly provided product name using the concept of edit distance. The edit distance is the minimum number of insertions, deletions, and substitutions required to transform one string into the other.

Let us understand it with an example. The wrongly provided word in need of correction is “Carbin.” So we prepare a set of all possible alterations to “Carbin” as shown below:

  1. Delete a letter, eg. arbin, crbin, cabin, etc.
  2. Replace a letter, eg. aarbin, barbin, etc.
  3. Add a letter, eg. acarbin, bcarbin, etc.
  4. Switch two letters around, eg. acrbin, crabin, cabrin, etc.

We choose all the words from the above set that are known product names. These words are Carbon and Carbone. In probabilistic terms, the probability of all the unknown product names would be 0 and they can’t be selected.

Prob(unknown word/Carbon)=0 and Prob(known word/Carbon)=1

Finally, calculate the conditional probabilities for Gruaud, Gruaudi

Prob(Carbon/Carbin) = Prob(Carbin/Carbon) * Prob(Carbon) = 1 * 2/7 = 0.28

Prob (Carbone/Carbin) = Prob(Carbin/Carbone) * Prob(Carbone) = 1 * 1/7 = 0.14

The higher probabilistic word is “Carbon.”

This article provides further reading on this subject: https://web.stanford.edu/class/cs124/lec/spelling.pdf

I hope you have enjoyed reading this. If you have any questions or queries, please leave a comment below. I highly appreciate your feedback!

Manoj Bisht

Manoj Bisht

Senior Architect

Manoj Bisht is the Senior Architect at 3Pillar Global, working out of our office in Noida, India. He has expertise in building and working with high performance team delivering cutting edge enterprise products. He is also a keen researcher and dive deeps into trending technologies. His current areas of interest are data science, cloud services and micro service/ serverless design and architecture. He loves to spend his spare time playing games and also likes traveling to new places with family and friends.

Leave a Reply

Related Posts

How to Manage the “Need for Speed” Without Sacri... The pace of innovation today is faster than it has ever been. Customers are much more active and vocal thanks to social and mobile channels, and the c...
Determining the First Release The first thing you release needs to put the solution to your customer's most important problem in their hands. Deciding what the most important probl...
The Art of Building Rapid (and Valuable) Proofs of Concept Clients and stakeholders want results. They want assurances that their investment is well spent and they're building the right product. The software d...
Are You Doing Stuff or Creating Value? You can put a bunch of stickies on the wall, create tons of JIRA tickets, and commit lots of code, but are you creating value? Is the work your produc...
Costovation – Giving Your Customers Exactly What They ... On this episode of The Innovation Engine podcast, we delve into “cost-ovation,” or innovation that gives your customers exactly what they want – and n...