Measures of Semantic Similarity

Semantic similarity can be easily understood as “how much a word A is related to the word B?” Determining semantic similarities often comes up in applications of Natural Language Processing. In this blog, I will elaborate on some well-known algorithms with their key characteristics.

Path Length

Path Length is a score denoting a count of edges between two words in the shortest path. The shorter the path between two words/senses in a thesaurus hierarchy graph, the more similar they are. A thesaurus hierarchy graph is a tree drawn from a broader category of words to narrower category of words. For example, dime and Nickel can be two nodes of coins, and men and women can be two nodes of humans. It is a simple node counting scheme to get to a score:

Simpath (c1, c2) = number of edges in shortest path

Key Characteristics

  • It is very simple
  • It is a path-based measure
  • The score provided is discrete and not normalized
  • This requires tagged data and is hugely dependent on the graph quality
  • It assumes a uniform cost; there is no weight on the graph edges

Leakcock-Chodorow

This is a score denoting count of edges between two words/senses with log smoothing. This is more or less the same as path length with log smoothing, and has the same characteristics except it is continuous in nature due to the log smoothing.

SimLC = -Log (Path Similarity)

Key Characteristics

  • Simple
  • Continuous
  • Required tagged data and dependent on the graph quality
  • Assumes a uniform cost; there is no specific weight on the graph edges

Wu & Palmer

This is a score that takes into account the position of concepts c1 and c2 in the taxonomy relative to the position of the Least Common Subsumer (c1, c2). It assumes that the similarity between two concepts is the function of path length and depth, in path-based measures.

The Least Common Subsumer of two node,s v and w, in a tree or directed acyclic graph (DAG) T is the lowest (i.e. deepest) node that has both v and w as descendants, where we define each node to be a descendant of itself (so if v has a direct connection from w, w is the lowest common ancestor).

Simwup (c1, c2) = (2* Dep(LCS(c1, c2))) / (Len(c1, c2) + 2*dep(LCS(c1, c2)))

LCS(c1, c2) = Lowest node in hierarchy that is a hypernym of c1, c2.

Key Characteristics

  • Continuous and normalized
  • The score can never be zero
  • Heavily dependent on the quality of the graph
  • No distinction between similarity/relatedness

Resnik Similarity

This is a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer.

Information content is the frequency counts of concepts as found in a corpus of text. The frequency associated with a concept is incremented in WordNet each time that the concept is observed, as are the counts of the ancestor concepts in the WordNet hierarchy (for nouns and verbs). Information Content can only be computed for nouns and verbs in WordNet, since these are the only parts of speech where concepts are organized in hierarchies.

SimResnik (c1, c2) = IC(LCS(c1, c2))

LCS(c1, c2) = Lowest node in hierarchy that is a hypernym of c1, c2.

IC(c) = -logP(c)

Key Characteristics

  • Value will always be greater than or equal to zero
  • Refines path-based approach using normalizations based on hierarchy depth
  • Relies on structure of thesaurus
  • Dependent on information content; the result is dependent on the corpus used to generate the information content and the specifics of how the information content was created
  • IC-based similarity results are better than path-based

Lin Similarity

This is a score using both the amount of information needed to state the commonality between the two concepts and the information needed to fully describe these terms.

SimLin = 2 * IC(LCS(c1, c2)) / (IC(c1) + IC(c2))

Key Characteristics

  • Refines path-based approach using normalizations based on hierarchy depth
  • Relies on structure of thesaurus
  • Dependent on information content; the result is dependent on the corpus used to generate the information content and the specifics of how the information content was created
  • IC-based similarity results are better than path-based

Jiang-Conrath Distance

This is a score using both the amount of information needed to state the commonality between the two concepts and the information needed to fully describe these terms. It is similar to Lin Similarity.

SimJCN = 1/distJC(c1, c2)

distJC(c1, c2) = 2 * log P(LCS(c1, c2)) – (log P(c1) + log P(c2))

Key Characteristics

  • Refines path-based approach using normalizations based on hierarchy depth
  • Relies on structure of thesaurus
  • Dependent on information content; the result is dependent on the corpus used to generate the information content and the specifics of how the information content was created
  • IC-based similarity results are better than path-based
  • Care must be taken to handle distJC = 0 scenario

References

I hoe you enjoyed reading this. If you have any questions or queries, please leave a comment below. I highly appreciate your feedback!

Manoj Bisht

Manoj Bisht

Architect

Manoj Bisht is working as an Architect in the Advanced Technology Group at 3Pillar Global. Manoj has 13 years of software design and development experience. He has software architecture experience in many areas such as n-tier, EAI/B2B integration, SOA architecture, and Cloud Computing. He is an AWS Certified Solution Architect. Manoj also has extensive experience working in Retail, E-commerce, CMS, and Media domains. Manoj is a post graduate from Delhi University, India. He loves to spend his spare time playing games and also likes traveling to new places with family and friends.

Leave a Reply

Related Posts

Take 3, Scene 14: The Present and Future of Angular 2, Part ... On this two-part episode of Take 3, Cassian Lup and Andrei Tamas join us all the way from Romania to discuss the newest iteration of the AngularJS fra...
Take 3, Scene 14: The Present and Future of Angular 2, Part ... On the second part of this two-part episode of Take 3, we continue our conversation with Cassian Lup and Andrei Tamas on the newest iteration of the A...
Spell Check and Autocorrect with Conditional Probability A client of mine wanted to reduce the time it took for his vendors to upload an inventory list to his system. The system currently matches the product...
Jonathan Rivers Interviewed by CXO Today on Cloud and Smart ... Jonathan Rivers, Chief Technology Officer for 3Pillar Global, was recently interviewed by India's CXO Today on 3Pillar's developments in the newest cl...
Blockchain Trends: 2016 Year in Review In October 2015, 3Pillar Global embarked on a trailblazing journey to build a trading platform on blockchain. This post is the first one of a series d...

Free product development tips delivered right to your inbox