I want to show you an algorithm, it is a pretty simple algorithm. It is an implementation of the Damerau–Levenshtein edit distance algorithm from the pseudocode on Wikipedia:
public static int EditDistance(string string1, string string2) { var s1Length = string1.Length; var s2Length = string2.Length; var matrix = new int[s1Length + 1, s2Length + 1]; for (int i = 0; i <= s1Length; i++) matrix[i, 0] = i; for (int j = 0; j <= s2Length; j++) matrix[0, j] = j; for (int i = 1; i <= s1Length; i++) { for (int j = 1; j <= s2Length; j++) { int cost = (string2[j - 1] == string1[i - 1]) ? 0 : 1; matrix[i, j] = (new[] { matrix[i - 1, j] + 1, matrix[i, j - 1] + 1, matrix[i - 1, j - 1] + cost}).Min(); if ((i > 1) && (j > 1) && (string1[i - 1] == string2[j - 2]) && (string1[i - 2] == string2[j - 1])) { matrix[i, j] = Math.Min( matrix[i, j], matrix[i - 2, j - 2] + cost); } } } return matrix[s1Length, s2Length]; }
And I plan to use it to load up a word list:
var words = new List<string>(); using (var streamReader = new StreamReader("english-words.95")) { string line; while ((line = streamReader.ReadLine()) != null) { words.Add(line); } }
Then I needed to run a shorter list of words against this list of words and get edit distances:
var result = new List<int>(); foreach (string word1 in words) { foreach (string word2 in words2) { result.Add(EditDistance(word1, word2)); } }
Interestingly enough, this process takes quite a while. Especially if I have a few hundred thousand words in my word list. Go figure.
But since I am using .Net 4.0 (like any normal obsessive developer), I might have the great idea to leverage the awesome parallel libraries included in .Net 4.0 such as System.Threading.Tasks.Parallel.ForEach… phew, that was long:
var result = new List<int>(); Parallel.ForEach(words, word1 => { foreach (string word2 in words2) { result.Add(EditDistance(word1, word2)); } });
Thankfully I was logging out the number of items in my result array, and I noticed an anomaly. While running single threaded I got back more items than when running in parallel. I got the performance increase I wanted, but my result was now wrong! Oooops! Thankfully I know that this was because I used a List<T> which is not thread-safe. So, I just leverage another feature in .NET 4.0, the ConcurrentBag:
var result = new ConcurrentBag<int>(); Parallel.ForEach(words, word1 => { foreach (string word2 in words2) { result.Add(EditDistance(word1, word2)); } });
This is great, but unfortunately I had to know that if I didn’t use the ConcurrentBag, then my result would be off, but I wouldn’t get any exceptions. Just a silent race condition which caused my result to be wrong.
But how can we avoid this? One way would be to approach the problem from a different perspective. What if we tried to solve the original problem with LINQ? Our solution would need to take all the elements from the first list, and apply all the elements from the second list to each of them. Hmm, sounds a bit like a cross join to me. This can easily be solved in LINQ with the SelectMany method:
var result = words .SelectMany(word1 => words2.Select(word2 => EditDistance(word1, word2)));
Neato, and if we wanted to write this in the built-in query syntax, it looks like this:
var result = from word1 in words from word2 in words2 select EditDistance(word1, word2);
And that actually looks more natural. (IMO one of the few times where the query syntax does look more natural to me) Now that we have solved it with LINQ using a more functional approach, we no longer have to worry about thread safety because we aren’t using any mutable structures. Leveraging PLINQ, all we have to do is add "AsParallel" to the main list of words:
var result = words.AsParallel() .SelectMany(word1 => words2.Select(word2 => EditDistance(word1, word2)));
Or like this:
var result = from word1 in words.AsParallel() from word2 in words2 select EditDistance(word1, word2);
And we have multi-threaded execution! Now, there are a few things that we still have to worry about, such as ordering. But PLINQ will also allow us to specify it like this:
var result = words.AsParallel().AsOrdered() .SelectMany(word1 => words2.Select(word2 => EditDistance(word1, word2)));
And there you have it, without all that much effort, but by leveraging some simple LINQ queries, we have created a nicely scalable piece of parallel code. I hope you found this interesting, and I hope that you’ll let me know if you get a chance to use any of this in your applications!
I’m creating a series on LINQ over at TekPub, go check it out!
I had some inquiries as to the performance of this on my machine, and so here are the relative measurements:
foreach loop – 37 seconds |
Parallel.ForEach – 31 seconds – This could be further optimized |
LINQ – 39 seconds |
LINQ with AsParallel() – 23 seconds |
LINQ with AsParallel() and AsOrdered() – 25 seconds |
The Parallel.ForEach was a naive implementation that used the ConcurrentBag in order to achieve thread safety. Unfortunately this isn’t as efficient as we could make it. This could be optimized by using another overload of Parallel.ForEach in order to implement a list per thread and then combine them at the end. That would give us much better performance.
Loved the article? Hated it? Didn’t even read it?
We’d love to hear from you.
Great post — thanks. What sort of performance improvements did you see moving to the parallel implementation?
Very interesting Justin. It would have been really cool if you compared the execution times with and without using any parallel extensions and posted them as part of the post.
How significant was the increase in performance?
@Danimal @Rob I have posted the performance numbers that I saw. Unfortunately I only have a dual core proc to try it on.
Thanks Justin. It’s pretty cool that with a pretty easy coding change you could get an impressive performance improvement.
Great post, thanks. This was especially interesting for me, as I wrote a parallel processing framework for the purpose of running Levenshtein in .NET 2.0! I will experiment to see if this is any faster, but man, it would have saved me a lot of work back then.
I assume you were using Levenshtein only for the purpose of demo-ing this new framework, but if you actually are trying to optimize, you should use this version of Levenshtein. It’s 2.5 times faster – http://webreflection.blogspot.com/2009/02/levenshtein-algorithm-revisited-25.html
@Sam Thanks! And yes, I just ported the code from Wikipedia directly. If I have to use it in production, I’ll most certainly check out your implementation. Actually, I’ll probably check it out for fun anyways. 🙂
Hello, thanks for the great blog post.
Under what licence have you released this code you wrote that is now listed on this wiki entry?
http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
MIT/GPL like jquery ? http://jquery.org/license
Thanks!
Jeff