Jul
29
2008

## ColdFusion Levenshtein Distance: String comparison and highlighting

ColdFusion, Technology
This is a fun project I put out there a while back. I recently went through and optimized the performance a bit so I could officially blog it. It is an implementation of the Levenshtein Distance Algorithm in CFScript that I based off of a C# version written by Siderite Zackwehdex. Finding the "distance" between two strings is a means of comparing two strings to see how similar they both are. This can be done by finding the Longest Common String or LCS. It is as much a brain bender as it can be occasionally useful.The basic gist of the concept is this: Iterate over two strings making a note of how many characters were inserted, deleted, or transposed from one string to the other. When a difference is found, "bookmark" where you are and start looking ahead in each string to see if the strings are going to start matching up again down the line. How far down the line you look is controlled by in an input called maxOffset. The LCS is the number of characters between the two strings which were identical. The "distance" of the two strings is simply the average string length minus the longest common string. The similarity of the two strings can be expressed as a percentage given 1 minus the strings' distance divided the length of the longest string. Enough theory-- let's look at an example:
I love to go ride my go-kart. (29 chars)

I love to ride my go-cart outside. (34 chars)

There are 25 characters in the two strings that are identical. This is our LCS.

String 1 is 4 characters different than string 2, and string 2 has 9 characters different than string 1.

So on average, we will say 6.5 characters would need to be changed to make the strings identical. This is our distance. (4 + 9 = 13 / 2 = 6.5)

Divide that distance by the length of our longest string and we can come to the conclusion that the strings are an 81% match. (1 - (6.5 / 34) = .8088 = 81%) Here's another example:

I love to ride my go-cart outside. (34 chars)

- There is a deletion of the word "go "
- There is a substitution of "c" for "k" in go-cart
- There is an addition of the word "outside"

There are 25 characters in the two strings that are identical. This is our LCS.

String 1 is 4 characters different than string 2, and string 2 has 9 characters different than string 1.

So on average, we will say 6.5 characters would need to be changed to make the strings identical. This is our distance. (4 + 9 = 13 / 2 = 6.5)

Divide that distance by the length of our longest string and we can come to the conclusion that the strings are an 81% match. (1 - (6.5 / 34) = .8088 = 81%) Here's another example:

The rain in Spain stays mainly on the plains The rain in Spain stays mainly on the plains The rain in Spain stays mainly on the plains Lorum Ipsum, yadda yadda. Lorum Ipsum, yadda yadda. La La La La Luke, I am your father. |
The rain in Madrid stays totally on the plains The rain in Spain stays mainly on the plains The rain in Barcelona stays entirely in the air Lorum Ipsum, Yabba dabba doo. Whatcha eatin? Nutin' Honey. Da Da Da Duke, I am your father. |

- Roughly 67 characters are different between the two strings.
- The Longest Common String (LCS) is 180.
- The strings are a 73% match.

- If both strings are empty, I short circuit and return a 0 for distance and LCS. Similarity is 100%.
- If either string is empty, I short circuit and return the length of the non-empty string as the distance. The LCS and similarity is 0.
- To better detect differences at the start of the string I check the first three characters. That way a matching first letter wouldn't be confused in two strings like "top hat" and "the hat"
- When looking ahead in the strings trying to reconcile a difference I search for the distance of the maxOffset until I find THREE contiguous matching characters. This is to try and eliminate false positives.
- My function will highlight the differences between the strings by wrapping the deviations in the HTML tag of your choice. Default is <span style="background: yellow;"></span>

[code] <cfscript> /* StringSimilarity Brad Wood [email protected] May 2007 Code adopted from Siderite Zackwehdex's Blog http://siderite.blogspot.com/2007/04/super-fast-and-accurate-string-distance.html Parameters: s1: First string to be compared s2: Second string to be compared maxOffset: Average number of characters that s1 will deviate from s2 at any given point. This is used to control how far ahead the function looks to try and find the end of a peice of inserted text. Play with it to suit. */ function stringSimilarity(s1,s2,maxOffset) { var c = 0; var offset1 = 0; var offset2 = 0; var lcs = 0; // These two strings will contain the "highlighted" version var _s1 = createObject("java","java.lang.StringBuffer").init(javacast("int",len(s1)*3)); var _s2 = createObject("java","java.lang.StringBuffer").init(javacast("int",len(s2)*3)); // These chaactes will surround differences in the strings // (Inserted into _s1 and _s2) var h1 = "<span style=""background: yellow;"">"; var h2 = "</span>"; var return_struct = structNew(); // If both strings are empty if (not len(trim(s1)) and not len(trim(s2))) { return_struct.lcs = 0; return_struct.similarity = 1; return_struct.distance = 0; return_struct.s1 = ""; return_struct.s2 = ""; return return_struct; } // If s2 is empty, but s1 isn't if (len(trim(s1)) and not len(trim(s2))) { return_struct.lcs = 0; return_struct.similarity = 0; return_struct.distance = len(s1); return_struct.s1 = h1 & s1 & h2; return_struct.s2 = ""; return return_struct; } // If s1 is empty, but s2 isn't else if (len(trim(s2)) and not len(trim(s1))) { return_struct.lcs = 0; return_struct.similarity = 0; return_struct.distance = len(s2); return_struct.s1 = ""; return_struct.s2 = h1 & s2 & h2; return return_struct; } // Examine the strings, one character at a time, anding at the shortest string // The offset adjusts for extra characters in either string. while ((c + offset1 lt len(s1)) and (c + offset2 lt len(s2))) { // Pull the next charactes out of s1 anbd s2 next_s1 = mid(s1,c + offset1+1,iif(not c,3,1)); // First time through check the first three next_s2 = mid(s2,c + offset2+1,iif(not c,3,1)); // First time through check the first three // If they are equal if (compare(next_s1,next_s2) eq 0) { // Our longeset Common String just got one bigger lcs = lcs + 1; // Append the characters onto the "highlighted" version _s1.append(left(next_s1,1)); _s2.append(left(next_s2,1)); } // The next two charactes did not match // Now we will go into a sub-loop while we attempt to // find our place again. We will only search as long as // our maxOffset allows us to. else { // Don't reset the offsets, just back them up so you // have a point of reference old_offset1 = offset1; old_offset2 = offset2; _s1_deviation = ""; _s2_deviation = ""; // Loop for as long as allowed by our offset // to see if we can match up again for (i = 0; i lt maxOffset; i=i+1) { next_s1 = mid(s1,c + offset1 + i+1,3); // Increments each time through. len_next_s1 = len(next_s1); bookmarked_s1 = mid(s1,c + offset1+1,3); // stays the same next_s2 = mid(s2,c + offset2 + i+1,3); // Increments each time through. len_next_s2 = len(next_s2); bookmarked_s2 = mid(s2,c + offset2+1,3); // stays the same // If we reached the end of both of the strings if(not len_next_s1 and not len_next_s2) { // Quit break; } // These variables keep track of how far we have deviated in the // string while trying to find our match again. _s1_deviation = _s1_deviation & left(next_s1,1); _s2_deviation = _s2_deviation & left(next_s2,1); // It looks like s1 has a match down the line which fits // where we left off in s2 if (compare(next_s1,bookmarked_s2) eq 0) { // s1 is now offset THIS far from s2 offset1 = offset1+i; // Our longeset Common String just got bigger lcs = lcs + 1; // Now that we match again, break to the main loop break; } // It looks like s2 has a match down the line which fits // where we left off in s1 if (compare(next_s2,bookmarked_s1) eq 0) { // s2 is now offset THIS far from s1 offset2 = offset2+i; // Our longeset Common String just got bigger lcs = lcs + 1; // Now that we match again, break to the main loop break; } } //This is the number of inserted characters were found added_offset1 = offset1 - old_offset1; added_offset2 = offset2 - old_offset2; // We reached our maxoffset and couldn't match up the strings if(added_offset1 eq 0 and added_offset2 eq 0) { _s1.append(h1 & left(_s1_deviation,added_offset1+1) & h2); _s2.append(h1 & left(_s2_deviation,added_offset2+1) & h2); } // s2 had extra characters else if(added_offset1 eq 0 and added_offset2 gt 0) { _s1.append(left(_s1_deviation,1)); _s2.append(h1 & left(_s2_deviation,added_offset2) & h2 & right(_s2_deviation,1)); } // s1 had extra characters else if(added_offset1 gt 0 and added_offset2 eq 0) { _s1.append(h1 & left(_s1_deviation,added_offset1) & h2 & right(_s1_deviation,1)); _s2.append(left(_s2_deviation,1)); } } c=c+1; } // Anything left at the end of s1 is extra if(c + offset1 lt len(s1)) { _s1.append(h1 & right(s1,len(s1)-(c + offset1)) & h2); } // Anything left at the end of s2 is extra if(c + offset2 lt len(s2)) { _s2.append(h1 & right(s2,len(s2)-(c + offset2)) & h2); } // Distance is the average string length minus the longest common string distance = (len(s1) + len(s2))/2 - lcs; // Whcih string was longest? maxLen = iif(len(s1) gt len(s2),de(len(s1)),de(len(s2))); // Similarity is the distance divided by the max length similarity = iif(maxLen eq 0,1,1-(distance/maxLen)); // Return what we found. return_struct.lcs = lcs; return_struct.similarity = similarity; return_struct.distance = distance; return_struct.s1 = _s1.toString(); // "highlighted" version return_struct.s2 = _s2.toString(); // "highlighted" version return return_struct; } </cfscript> [/code]

[code]<cfset string1 = "The rain in Spain stays mainly on the plains The rain in Spain stays mainly on the plains The rain in Spain stays mainly on the plains Lorum Ipsum, yadda yadda. Lorum Ipsum, yadda yadda. La La La La Luke, I am your father."> <cfset string2 = "The rain in Madrid stays totally on the plains The rain in Spain stays mainly on the plains The rain in Barcelona stays entirely in the air Lorum Ipsum, Yabba dabba doo. Whatcha eatin? Nutin' Honey. Da Da Da Duke, I am your father."> <cfset comparison_result = stringSimilarity(string1,string2,10)> <cfoutput> Roughly #comparison_result.distance# characters are different between the two strings.<br> The strings are a #numberformat(comparison_result.similarity*100)#% match.<br> The Longest Common String is #comparison_result.lcs#.<br> <br> <table border="1" cellpadding="10" cellspacing="0"> <tr> <td> #replacenocase(comparison_result.s1,chr(10),"<br>","all")# </td> <td> #replacenocase(comparison_result.s2,chr(10),"<br>","all")# </td> <tr> </table> </cfoutput>[/code]

Comments are currently closed

Adrian Lynchsaidat 08/01/2008 02:45:24 PM

I was about to add a feature to a project that needs to tell the user that two pieces of text aren't different enough (for SEO purposes) and this looks like it'll work a treat.

Many thanks.

Brad Woodsaidat 08/01/2008 09:27:56 PM

Micahsaidat 09/09/2008 12:32:51 AM

George Jemptysaidat 11/05/2008 11:19:01 PM

Toppersaidat 12/16/2008 04:39:18 AM

I need to correct an error in some software that has led to a database being out of touch with another reference database - don't ask about the bad DB design.

If this works, I'll owe you a cupcake!

Brad Woodsaidat 12/16/2008 03:23:39 PM

Seriously though, if you need to compare two databases, Redgate software makes some very kick butt tools for that. They will show you line by line comparisons of stored procs and stuff, but when it comes to the differences in data I think it just tells you they aren't the same.

Hal Helmssaidat 01/08/2009 10:27:43 AM

Sideritesaidat 05/07/2009 02:22:01 PM

Andy Belleniesaidat 03/29/2010 06:46:30 AM

Nice script! .. .but you need to var the return_stuct too.

/ Andy

www.cfwheels.org

Brad Woodsaidat 03/30/2010 01:00:26 AM

I updated the post and the download.

Kerrsaidat 12/13/2010 08:44:05 AM

HPsaidat 03/10/2011 08:36:06 PM

Anyone knows how to compare two strings, then returns the diff.

For example:

s1 = 'GIT-04 (INT) AMT DR'

s2 = 'GIT-04 (INT)'

it will returns 'AMT DR'

I spent hours using Javascript and Coldfusion, but there is no luck.

Any advise, anyone?

Nebusaidat 04/20/2011 07:19:31 AM

177 distance = (len(s1) + len(s2))/2 - lcs;

178 // Whcih string was longest?

179 maxLen = iif(len(s1) gt len(s2),de(len(s1)),de(len(s2)));

180 // Similarity is the distance divided by the max length

181 similarity = iif(maxLen eq 0,1,1-(distance/maxLen));

Should/could be

177 var distance = (len(s1) + len(s2))/2 - lcs;

178 // Whcih string was longest?

179 var maxLen = iif(len(s1) gt len(s2),de(len(s1)),de(len(s2)));

180 // Similarity is the distance divided by the max length

181 var similarity = iif(maxLen eq 0,1,1-(distance/maxLen));

You might also be interested in http://cfdiff.googlecode.com/ . Keep up the good work.

Aiming Xusaidat 07/27/2011 02:58:37 PM

Valeriy Nenovsaidat 08/03/2011 08:41:16 PM

Val

public string GetFixedLengthString(string input, int length)

{

input = input ?? string.Empty;

input = input.Length > length ? input.Substring(0, length) : input;

return string.Format("{0,-" + length + "}", input);

}

// define the resulting variables

private int return_struct_lcs;

private float return_struct_similarity;

private int return_struct_distance;

private string return_struct_s1;

private string return_struct_s2;

/*

StringSimilarity

Brad Wood

[email protected]

May 2007

Code adopted from Siderite Zackwehdex's Blog

http://siderite.blogspot.com/2007/04/super-fast-and-accurate-string-distance.html

Parameters:

s1: First string to be compared

s2: Second string to be compared

maxOffset: Average number of characters that s1 will deviate from s2 at any given point.

This is used to control how far ahead the function looks to try and find the

end of a peice of inserted text. Play with it to suit.

*/

/// <summary>

/// Val converted manually form the Cold Fusion code at

/// http://www.codersrevolution.com/index.cfm/2008/7/29/ColdFusion-Levenshtein-Distance-String-comparison-and-highlighting

/// </summary>

/// <param name="s1"></param>

/// <param name="s2"></param>

/// <param name="maxOffset"></param>

public void stringSimilarity(string s1, string s2, int maxOffset)

{

int c = 0;

int offset1 = 0;

int offset2 = 0;

int lcs = 0;

// These two strings will contain the "highlighted" version

// was: string _s1 = createObject("java","java.lang.StringBuffer").init(javacast("int",s1.Length*3));

// Was: string _s2 = createObject("java","java.lang.StringBuffer").init(javacast("int",s2.Length*3));

string _s1 = ""; // Val

string _s2 = ""; // Val

_s1 = GetFixedLengthString(_s1, s1.Length * 3); // Val

_s2 = GetFixedLengthString(_s2, s2.Length * 3); // Val

// These charactes will surround differences in the strings

// (Inserted into _s1 and _s2)

//was: string h1 = "<span style="/"background: yellow;"/">";

//was: string h2 = "</span>";

string h1 = "<font color=red>";

string h2 = "</font>";

// was: var return_struct = structNew();

s1 = s1.Trim();

s2 = s2.Trim();

// If both strings are empty

if (String.IsNullOrEmpty(s1) && String.IsNullOrEmpty(s2)) // was: (!s1.Length && !s2.Length)

{

return_struct_lcs = 0;

return_struct_similarity = 1;

return_struct_distance = 0;

return_struct_s1 = "";

return_struct_s2 = "";

//return return_struct;

}

// If s2 is empty, but s1 isn't

if (!String.IsNullOrEmpty(s1) && String.IsNullOrEmpty(s2))// was:(s1.Length && !s2.Length)

{

return_struct_lcs = 0;

return_struct_similarity = 0;

return_struct_distance = s1.Length;

return_struct_s1 = h1 + s1 + h2;

return_struct_s2 = "";

//return return_struct;

}

// If s1 is empty, but s2 isn't

else if (String.IsNullOrEmpty(s1) && !String.IsNullOrEmpty(s2))// was:(s2.Length && !s1.Length)

{

return_struct_lcs = 0;

return_struct_similarity = 0;

return_struct_distance = s2.Length;

return_struct_s1 = "";

return_struct_s2 = h1 + s2 + h2;

//return return_struct;

}

// Examine the strings, one character at a time, anding at the shortest string

// The offset adjusts for extra characters in either string.

while ((c + offset1 < s1.Length)

&& (c + offset2 < s2.Length))

{

// Pull the next charactes out of s1 and s2

//was: string next_s1 = mid(s1,c + offset1+1,iif(!c,3,1)); // First time through check the first three

//was: string next_s2 = mid(s2,c + offset2+1,iif(!c,3,1)); // First time through check the first three

int iif; // Val

if (c == 0) iif = 3; else iif = 1; // Val

string next_s1 = s1.Substring(c + offset1 + 1, iif); // First time through check the first three

string next_s2 = s2.Substring(c + offset2 + 1, iif); // First time through check the first three

// If they are equal

if (next_s1 == next_s2) //was:(compare(next_s1,next_s2) == 0)

{

// Our longeset Common String just got one bigger

lcs = lcs + 1;

// Append the characters onto the "highlighted" version

// was: _s1.append(left(next_s1,1));

// was: _s2.append(left(next_s2,1));

_s1 = _s1 + next_s1.Substring(0,1);

_s2 = _s2 + next_s2.Substring(0,1);

}

// The next two charactes did not match

// Now we will go into a sub-loop while we attempt to

// find our place again. We will only search as long as

// our maxOffset allows us to.

else

{

// Don't reset the offsets, just back them up so you

// have a point of reference

int old_offset1 = offset1;

int old_offset2 = offset2;

string _s1_deviation = "";

string _s2_deviation = "";

// Loop for as long as allowed by our offset

// to see if we can match up again

for (int i = 0; i < maxOffset; i++)

{

//was: next_s1 = mid(s1,c + offset1 + i+1,3); // Increments each time through.

next_s1 = s1.Substring(c + offset1 + i + 1, 3); // Increments each time through.

int len_next_s1 = next_s1.Length;

// was: string bookmarked_s1 = mid(s1,c + offset1+1,3); // stays the same

string bookmarked_s1 = s1.Substring(c + offset1 + 1,3);

//was: next_s2 = mid(s2,c + offset2 + i+1,3); // Increments each time through.

next_s2 = s2.Substring(c + offset2 + i + 1, 3); // Increments each time through.

int len_next_s2 = next_s2.Length;

//was: string bookmarked_s2 = mid(s2,c + offset2+1,3); // stays the same

string bookmarked_s2 = s2.Substring(c + offset2 + 1, 3); // stays the same

// If we reached the end of both of the strings

if ((len_next_s1 == 0) && (len_next_s2 == 0)) // was: (!len_next_s1 && !len_next_s2)

{

// Quit

break;

}

// These variables keep track of how far we have deviated in the

// string while trying to find our match again.

_s1_deviation = _s1_deviation + next_s1.Substring(0, 1); // was: left(next_s1,1);

_s2_deviation = _s2_deviation + next_s2.Substring(0, 1); // was; left(next_s2,1);

// It looks like s1 has a match down the line which fits

// where we left off in s2

if (next_s1 == bookmarked_s2)// was: (compare(next_s1,bookmarked_s2) == 0)

{

// s1 is now offset THIS far from s2

offset1 = offset1+i;

// Our longeset Common String just got bigger

lcs = lcs + 1;

// Now that we match again, break to the main loop

break;

}

// It looks like s2 has a match down the line which fits

// where we left off in s1

if (next_s2 == bookmarked_s1) // was: (compare(next_s2,bookmarked_s1) == 0)

{

// s2 is now offset THIS far from s1

offset2 = offset2+i;

// Our longeset Common String just got bigger

lcs = lcs + 1;

// Now that we match again, break to the main loop

break;

}

}

//This is the number of inserted characters were found

int added_offset1 = offset1 - old_offset1;

int added_offset2 = offset2 - old_offset2;

// We reached our maxoffset and couldn't match up the strings

if(added_offset1 == 0 && added_offset2 == 0)

{

// was: _s1.append(h1 + left(_s1_deviation,added_offset1+1) + h2);

// was: _s2.append(h1 + left(_s2_deviation,added_offset2+1) + h2);

_s1 = _s1 + h1 + _s1_deviation.Substring(0, added_offset1 + 1) + h2;

_s2 = _s2 + h1 + _s2_deviation.Substring(0, added_offset2 + 1) + h2;

}

// s2 had extra characters

else if(added_offset1 == 0 && added_offset2 > 0)

{

// was: _s1.append(left(_s1_deviation,1));

// was: _s2.append(h1 + left(_s2_deviation,added_offset2) + h2 + right(_s2_deviation,1));

_s1 = _s1 + _s1_deviation.Substring(0, 1);

_s2 = _s2 + h1 + _s2_deviation.Substring(0, added_offset2) + h2 + _s2_deviation.Substring(_s2_deviation.Length - 1, 1); // ?? Length -1

}

// s1 had extra characters

else if(added_offset1 > 0 && added_offset2 == 0)

{

// was: _s1.append(h1 + left(_s1_deviation,added_offset1) + h2 + right(_s1_deviation,1));

// was: _s2.append(left(_s2_deviation,1));

_s1 = _s1 + h1 + _s1_deviation.Substring(0, added_offset1) + h2 + _s1_deviation.Substring(_s1_deviation.Length - 1, 1);

_s2 = _s2 + _s2_deviation.Substring(0, 1);

}

}

c++;

// was: c=c+1;

}

// Anything left at the end of s1 is extra

if(c + offset1 < s1.Length)

{

// was: _s1.append(h1 + right(s1,s1.Length-(c + offset1)) + h2);

_s1 = _s1 + h1 + s1.Substring(s1.Length - 1, s1.Length - (c + offset1)) + h2;

}

// Anything left at the end of s2 is extra

if(c + offset2 < s2.Length)

{

// was: _s2.append(h1 + right(s2,s2.Length-(c + offset2)) + h2);

_s2 = _s2 + h1 + s2.Substring(s2.Length - 1, s2.Length - (c + offset2)) + h2;

}

// Distance is the average string length minus the longest common string

int distance = (s1.Length + s2.Length)/2 - lcs;

// Which string was longest?

// was: int maxLen = iif(s1.Length > s2.Length,de(s1.Length),de(s2.Length));

int maxLen;

if(s1.Length > s2.Length) maxLen = s1.Length;

else maxLen = s2.Length;

// Similarity is the distance divided by the max length

// was: similarity = iif(maxLen eq 0,1,1-(distance/maxLen));

float similarity;

if(maxLen == 0) similarity =1;

else similarity = 1-(distance/maxLen);

// Return what we found.

return_struct_lcs = lcs;

return_struct_similarity = similarity;

return_struct_distance = distance;

return_struct_s1 = _s1; // "highlighted" version

return_struct_s2 = _s2; // "highlighted" version

//return return_struct;

}

Brad Woodsaidat 08/03/2011 09:14:23 PM

Sideritesaidat 08/08/2011 01:57:25 AM

Samsaidat 01/31/2012 10:56:42 AM