August 16, 2016
Spell Check and Autocorrect with Conditional Probability
A client of mine wanted to reduce the time it took for his vendors to upload an inventory list to his system. The system currently matches the product name with known product names in the system database, and for the names that match exactly, the inventory items are automatically updated. For the items that did not match, the user needs to search through the system database and choose the matching inventory item.
In the domain in which the client operates, the product names are not standardized and there is no accepted SKU ID for the products. The automatic matching also has a bunch of heuristics that cover the most commonly occurring mismatches, but they only apply to a subset of the mismatched products. In this blog, I will elaborate upon an approach that uses conditional probability to solve the mismatch in these product names without resorting to heuristics.
Let's assume we have the following known names in the inventory:
|Correct Product Names|
If a user provides a name like "Carbin," we will calculate the most likely correct work using Bayes Theorem:
P(C/W) = Probability(correctly named product/wrongly named product)
P(W/C) = Probability(wrongly named product/correctly named product)
P(C) = Probably(correctly named product)
P(W) = Probability(wrongly named product)
P(C/W) = P(W/C) * P(C) / P(W)
We can ignore the division factor P(W) as it will be constant for all the probability calculation and will have no impact when we choose the highest probability.
P(C) = [Count of correctly named product] / [Total number of individual names in product]
Probability(Carbon) = [Count of Carbon] / [Total number of individual names in product] = 2/7
Probability(Dioxide) = 1/7
Now, how do we calculate P(W/C)? We will construct all possible permutations of the words for a wrongly provided product name using the concept of edit distance. The edit distance is the minimum number of insertions, deletions, and substitutions required to transform one string into the other.
Let us understand it with an example. The wrongly provided word in need of correction is “Carbin.” So we prepare a set of all possible alterations to “Carbin” as shown below:
- Delete a letter, eg. arbin, crbin, cabin, etc.
- Replace a letter, eg. aarbin, barbin, etc.
- Add a letter, eg. acarbin, bcarbin, etc.
- Switch two letters around, eg. acrbin, crabin, cabrin, etc.
We choose all the words from the above set that are known product names. These words are Carbon and Carbone. In probabilistic terms, the probability of all the unknown product names would be 0 and they can’t be selected.
Prob(unknown word/Carbon)=0 and Prob(known word/Carbon)=1
Finally, calculate the conditional probabilities for Gruaud, Gruaudi
Prob(Carbon/Carbin) = Prob(Carbin/Carbon) * Prob(Carbon) = 1 * 2/7 = 0.28
Prob (Carbone/Carbin) = Prob(Carbin/Carbone) * Prob(Carbone) = 1 * 1/7 = 0.14
The higher probabilistic word is "Carbon."
This article provides further reading on this subject: https://web.stanford.edu/class/cs124/lec/spelling.pdf
I hope you have enjoyed reading this. If you have any questions or queries, please leave a comment below. I highly appreciate your feedback!