machine-learning Natural Language Processing Text Matching or Similarity


One of the important areas of NLP is the matching of text objects to find similarities. Important applications of text matching includes automatic spelling correction, data de-duplication and genome analysis etc. A number of text matching techniques are available depending upon the requirement. So lets have;Levenshtein Distance

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.

Following is the implementation for efficient memory computations.

def levenshtein(s1,s2): 
 if len(s1) > len(s2):
    s1,s2 = s2,s1 
distances = range(len(s1) + 1) 

for index2,char2 in enumerate(s2):
    newDistances = [index2+1]
    for index1,char1 in enumerate(s1):
        if char1 == char2:
             newDistances.append(1 + min((distances[index1], distances[index1+1], newDistances[-1]))) 
             distances = newDistances 

             return distances[-1]