When working with the data from the Web, it often contains noise: mistyping, missing words, shortenings, excessive punctuation, and others. In addition, the same data coming from two different sources can slightly differ in its representation. Comparing noisy strings using the == operator will generate False, even if the two strings clearly represent the same data point, as in:
if "San Francisco, CA" == "San Francisco, Calif.": ...
In the above example, the data point is the city of San Francisco, in the state of California, in the United States. However, the first source used a more common USPS state code CA, while the second source used a less common, but still frequently seen in news articles, Associated Press state abbreviation Calif.
Because of this difference, the above comparison will result in a False. However, we, developers, would like that in such cases our program continues as if both strings were the same.
To solve the problem of noisy string comparison, we, at semanti.ca frequently use fuzzy string matching algorithms. One of the most popular algorithms is the one that computes Levenshtein Distance.
Levenshtein Distance (LD) is a measure of dissimilarity between two strings. In this blog post, we will refer to the two stings as the source string \((s)\) and the target string \((t)\). The Levenshtein Distance is given by the minimal number of character deletions, insertions, and substitutions required to transform \(s\) into \(t\).
Let's look at an example.
If \(s\) is "car" and \(t\) is "car", then \(LD(s,t) = 0\), because no transformations are needed: the strings are already identical.
If \(s\) is "car" and \(t\) is "bar", then \(LD(s,t) = 1\), because one substitution (substitute "c" with "b") is required to transform \(s\) into \(t\).
The greater the Levenshtein distance, the more dissimilar the two strings are.
Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who developed the algorithm in 1965. In some literature, Levenshtein distance is also called edit distance.
The Levenshtein Distance has application in:
- Spelling correction
- Plagiarism detection
- Joining databases
- Deduplication of database entities
The algorithm is very simple:
For every \(i\) ranging from \(0\) to \(n\), and for every \(j\) ranging from \(0\) to \(m\) define cost:
- if \(s[i]\) equals \(t[j]\), then set \(cost(i,j) = 0\);
- if \(s[i]\) doesn't equal \(t[j]\), set \(cost(i,j) = 1\).
- Set \(n\) to be the length of \(s\);
- Set \(m\) to be the length of \(t\);
- If \(n = 0\), return \(m\) and exit;
- If \(m = 0\), return n and exit;
- Construct a matrix \(D\) containing \(0..m\) rows and \(0..n\) columns.
- Initialize the first row of D: set \(D[0,i] = i\) for \(i\) ranging from \(0\) to \(n\);
- Initialize the first column of D: set \(D[j,0] = j\) for \(j\) ranging from \(0\) to \(m\).
For every \(i\) and for every \(j\), set cell \(D[i,j]\) of the matrix \(D\) equal to the minimum of:
- The cell immediately above plus \(1\): \(D[i-1,j] + 1\);
- The cell immediately to the left plus \(1\): \(D[i,j-1] + 1\);
- The cell diagonally above and to the left plus the cost: \(D[i-1,j-1] + cost[i-1,j-1]\).
After the step 3 is complete, the Levenshtein Distance is found in cell \(D[n,m]\).
Let's use the above algorithm and calculate the Levenshtein Distance between words "friend" and "freinds".
Steps 1 and 2
Step 3, \(i = 1\) and \(j\) spans from \(1\) to \(7\):
Step 3, \(i = 2\) and \(j\) spans from \(1\) to \(7\):
Step 3, \(i = 3\) and \(j\) spans from \(1\) to \(7\):
Step 3, \(i = 4\) and \(j\) spans from \(1\) to \(7\):
Step 3, \(i = 5\) and \(j\) spans from \(1\) to \(7\):
Step 3, \(i = 6\) and \(j\) spans from \(1\) to \(7\):
The Levenshtein Distance is in the lower right corner of the matrix \(D\), i.e. 3. This corresponds to our intuitive realization that "friend" can be transformed into "freinds" by substituting "i" for "e", "e" for "i" and adding "s" (two substitutions and one insertion gives 3 changes).
Levenshtein Distance in Python
def LD(s, t): # Step 1 n = len(s) m = len(t) if n == 0: return m if m == 0: return n; D = [[0 for j in range(m + 1)] for i in range(n + 1)] # Step 2 for i in range(0, n + 1): D[i] = i for j in range(0, m + 1): D[j] = j # Step 3 for i in range(1, n + 1): s_i = s[i - 1] for j in range(1, m + 1): t_j = t[j - 1] if s_i == t_j: cost = 0 else: cost = 1 D[i][j] = min(D[i-1][j] + 1, D[i][j-1] + 1, D[i-1][j-1] + cost) #Step 4 return D[n][m]
A reference implementation of Levenshtein Distance in Python is FuzzyWuzzy.
The java-string-similarity Java package implements various string similarity and distance algorithms: Levenshtein, Jaro-Winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity, and others.
Some useful implementations of string distance, including Levenshtein distance, are included in Lucene (and then in Solr / ElasticSearch).
editdistance is a reference implementation of the Levenshtein Distance algorithm in C++.
golang-levenshtein is an implementation of the Levenshtein Distance algorithm in Go.
In Ruby, we recommend using fuzzy_match.
If you know about another implementation of the Levenshtein Distance or any other fuzzy string matching algorithm (for the languages we didn't cover, or a faster one than those we covered), let us know.
Thanks to Ra'ad Siraj for corrections.
Found a mistyping or an inconsistency in the text? Let us know and we will improve it.