Four Functions for Finding Fuzzy String Matches in C# Extensions

by Tyler Jensen 27. May 2011 21:31

How similar are these two strings? Does string X sound like string Y? Could they be duplicates? Is the difference between string N and string M just a typo? There are many scenarios in enterprise software development where the answers to these questions can be highly significant.

In my search to answer such questions with code, the most helpful resource I’ve ever found was presented by George M. Brown on www.codeguru.com who implemented four well known and powerful fuzzy string matching algorithms in VBA for Access a few years ago. In an effort to convert these algorithms to C#, I found two alternatives that saved me some time (see below).

The four algorithms, with requisite Wikipedia links, are:

Each of the algorithms have been implemented here as extensions to the string and string array. Before we get to the code, let’s take a look at some results. Here are two very simplistic tests. The algorithms are not combined in any way. You can experiment with them and create your own secret sauce.

Name Matching for (mispelled deliberately): "Jensn"
The first test result set presents the raw output of the algorithms on a mispelled surname (mine) against a list of other surnames. I’ve highlighted the best score.

Dice Coefficient for Jensn:
    .00000 against Adams
    .46154 against Benson
    .00000 against Geralds
    .37500 against Johannson
    .42857 against Johnson
    .76923 against Jensen
    .30769 against Jordon
    .30769 against Madsen
    .00000 against Stratford
    .14286 against Wilkins

Levenshtein Edit Distance for Jensn:
    4 against Adams
    2 against Benson
    5 against Geralds
    5 against Johannson
    3 against Johnson
    1 against Jensen
    4 against Jordon
    4 against Madsen
    8 against Stratford
    6 against Wilkins

Longest Common Subsequence for Jensn:
    .04000, s against Adams
    .33333, ensn against Benson
    .05714, es against Geralds
    .08889, jnsn against Johannson
    .17143, jnsn against Johnson
    .56667, jensn against Jensen
    .06667, jn against Jordon
    .13333, en against Madsen
    .02222, s against Stratford
    .11429, ns against Wilkins

Double Metaphone for Jensn: ANSN
    ATMS metaphone for Adams
    PNSN metaphone for Benson
    JRLT metaphone for Geralds
    AHNS metaphone for Johannson
    ANSN metaphone for Johnson
    ANSN metaphone for Jensen
    ARTN metaphone for Jordon
    MTSN metaphone for Madsen
    STTR metaphone for Stratford
    FLKN metaphone for Wilkins

Address Matching for "2130 South Fort Union Blvd."
The second test is the same code run against multiple addresses trying to match a given address. Let’s see how it turned out.

Dice Coefficient for 2130 South Fort Union Blvd.:
    .16000 against 2689 East Milkin Ave.
    .10000 against 85 Morrison
    .27273 against 2350 North Main
    .07843 against 567 West Center Street
    .66667 against 2130 Fort Union Boulevard
    .61538 against 2310 S. Ft. Union Blvd.
    .42553 against 98 West Fort Union
    .12245 against Rural Route 2 Box 29
    .05000 against PO Box 3487
    .04444 against 3 Harvard Square

Levenshtein Edit Distance for 2130 South Fort Union Blvd.:
    18 against 2689 East Milkin Ave.
    22 against 85 Morrison
    18 against 2350 North Main
    22 against 567 West Center Street
    11 against 2130 Fort Union Boulevard
    8 against 2310 S. Ft. Union Blvd.
    14 against 98 West Fort Union
    19 against Rural Route 2 Box 29
    22 against PO Box 3487
    22 against 3 Harvard Square

Longest Common Subsequence for 2130 South Fort Union Blvd.:
    .02116, 2 st in v. against 2689 East Milkin Ave.
    .02020,  son against 85 Morrison
    .04444, 230 oth in against 2350 North Main
    .01010,  st t  against 567 West Center Street
    .25481, 2130 fort union blvd against 2130 Fort Union Boulevard
   .25765, 230 s ft union blvd. against 2310 S. Ft. Union Blvd.
    .25514,  st fort union against 98 West Fort Union
    .02593,  out  o  against Rural Route 2 Box 29
    .01347, o o  against PO Box 3487
    .01389, 3 hrvd against 3 Harvard Square

Double Metaphone for 2130 South Fort Union Blvd.: STFR
    STML metaphone for 2689 East Milkin Ave.
    MRSN metaphone for 85 Morrison
    NRTM metaphone for 2350 North Main
    SSNT metaphone for 567 West Center Street
    FRTN metaphone for 2130 Fort Union Boulevard
    SFTN metaphone for 2310 S. Ft. Union Blvd.
    STFR metaphone for 98 West Fort Union
    RRLR metaphone for Rural Route 2 Box 29
    PPKS metaphone for PO Box 3487
    RFRT metaphone for 3 Harvard Square

As you can see, the double metaphone algorithm may not be as useful on its own as the other algorithms. But when you put them together in creative ways, you’ll get a very powerful result.

Here’s how easy the algorithms, as extension methods, are to use.

private static void NameMatching()
{
	//name matching
	string input = "Jensn";
	string[] surnames = new string[] { 
		"Adams",
		"Benson",
		"Geralds",
		"Johannson",
		"Johnson",
		"Jensen",
		"Jordon",
		"Madsen",
		"Stratford",
		"Wilkins"
		};

	Console.WriteLine("Dice Coefficient for Jensn:");
	foreach (var name in surnames)
	{
		double dice = input.DiceCoefficient(name);
		Console.WriteLine("\t{0} against {1}", 
			dice.ToString("###,###.00000"), name);
	}

	Console.WriteLine();
	Console.WriteLine("Levenshtein Edit Distance for Jensn:");
	foreach (var name in surnames)
	{
		int leven = input.LevenshteinDistance(name);
		Console.WriteLine("\t{0} against {1}", leven, name);
	}

	Console.WriteLine();
	Console.WriteLine("Longest Common Subsequence for Jensn:");
	foreach (var name in surnames)
	{
		var lcs = input.LongestCommonSubsequence(name);
		Console.WriteLine("\t{0}, {1} against {2}", 
			lcs.Item2.ToString("###,###.00000"), lcs.Item1, name);
	}

	Console.WriteLine();
	string mp = input.ToDoubleMetaphone();
	Console.WriteLine("Double Metaphone for Jensn: {0}", mp);
	foreach (var name in surnames)
	{
		string nameMp = name.ToDoubleMetaphone();
		Console.WriteLine("\t{0} metaphone for {1}", nameMp, name);
	}
}

Source References
In researching and finding these algorithms, I poured over what seemed like hundreds of articles, blogs and open source resources. In the end, I settled on three primary sources and made certain modifications to suite my own needs. Here they are.

Additional references:

I recommend you spend time with these sources and doing your own research. Undoubtedly you will come up with even better algorithms. When you do, please share them with us here.

Download the complete code here: FuzzyStrings.zip (15.27 KB)

Tags:

Code

Try Catch Nothing Worst Code Ever

by Tyler Jensen 5. May 2011 13:50

Tracking down a problem in a commercial .NET library today, I found this little gem (identifying code in the try removed):

try
{
   //any code that could fail in any way here
}
catch (Exception)
{
}

Not only should this construct not compile, any attempt to do so should cause your computer to explode.

Tags:

Code | Principles

Extend IEnumerable<T> with ForEach<T> and Once<T> Methods

by Tyler Jensen 29. April 2011 03:41

A few days ago, a fellow programmer asked me a common question. How do you print odd numbers between one integer value and another. It's a common question and the quickest solution that came to mind is the common for loop with a simple if and the mod operator.

Later I began toying with the question and in combination wanted to experiment with an alternative to the limited ForEach extension method on IList<T>. What you see below is the result of a little tinkering and tweaking along with some fluent goodness.

I considered naming my custom extesion method ForEach but decided to use something different to remind myself that it is not your grandfather's ForEach extension method. So I named it ForEvery and in the middle of playing with that, I ended up creating the Once extension method on IEnumerable<T>.

class Program
{
   static void Main(string[] args)
   {
      PrintOdds(1, 100);
      Console.ReadLine();
   }

   static void PrintOdds(int x, int y)
   {
      Console.WriteLine("Odds with traditional for and if");
      for (int i = x; i <= y; i++)
      {
         if (i % 2 != 0) Console.Write("{0}, ", i);
      }

      Console.WriteLine();
      Console.WriteLine();
      Console.WriteLine("Odds with ToList and ForEach with if");
      Enumerable.Range(x, y - x + 1)
         .ToList()
         .ForEach(i =>
         {
            if (i % 2 != 0) Console.Write("{0}, ", i);
         });

      Console.WriteLine();
      Console.WriteLine();
      Console.WriteLine("Odds with Where then ToList and ForEach");

      Enumerable.Range(x, y - x + 1).Where(n => n % 2 != 0)
         .ToList().ForEach(i =>
         {
            Console.Write("{0}, ", i);
         });

      Console.WriteLine();
      Console.WriteLine();
      Console.WriteLine("Odds with Where and custom ForEach");

      Enumerable.Range(x, y - x + 1)
         .Where(n => n % 2 != 0)
         .ForEach(i =>
         {
            Console.Write("{0}, ", i);
         });

      Console.WriteLine();
      Console.WriteLine();
      Console.WriteLine("Odds with custom ForEvery fluent");

      Enumerable.Range(x, y - x + 1)
         .ForEvery(n => n % 2 != 0, i => 
         { 
            Console.Write("{0}, ", i);
         })
         .Once(n => n.Count() > 99, i =>
         {
            Console.WriteLine();
            Console.WriteLine("Once fluent");
            Console.WriteLine();
            Console.WriteLine("Evens with ForEvery fluent from {0} to {1}", i.Min(), i.Max());
         })
         .ForEvery(n => n % 2 == 0, i =>
         {
            Console.Write("{0}, ", i);
         });
   }
}

public static class EnumerableExtensions
{
   public static void ForEach<T>(this IEnumerable<T> collection, Action<T> action)
   {
      foreach (var item in collection)
      {
         action(item);
      }
   }

   public static IEnumerable<T> Once<T>(this IEnumerable<T> collection, 
      Func<IEnumerable<T>, bool> predicate, Action<IEnumerable<T>> action)
   {
      if (predicate(collection)) action(collection);
      return collection;
   }

   public static IEnumerable<T> ForEvery<T>(this IEnumerable<T> collection, 
      Func<T, bool> predicate, Action<T> action)
   {
      foreach (var item in collection)
      {
         if (predicate(item)) action(item);
      }
      return collection;
   }
}

Drop the code into a console app and run it. If you like these little extensions, I'd love to hear from you.

Tags:

Code

Deep Null Coalescing in C# with Dis.OrDat<T>

by Tyler Jensen 11. January 2011 02:43

The other day, a friend and I were reviewing some code like this and decided we deeply disliked it.

// var line1 = person.Address.Lines.Line1 ?? string.Empty;
// throws NullReferenceException: 
//    {"Object reference not set to an instance of an object."}

// The ugly alternative

var line1 = person.Address == null
        ? "n/a"
        : person.Address.Lines == null
        ? "n/a"
        : person.Address.Lines.Line1;

If you have ever written code like that or even worse, with the if (obj != null) statements, then you can certainly relate to our lack of love for the constructs required to avoid the dratted NullReferenceException.

So we experimented with a solution and nearly got there. Tonight I finished that up and the new Dis class is born with the OrDat<T> method so you can now write this instead.

var line2 = Dis.OrDat<string>(() => person.Address.Lines.Line2, "n/a");

Here’s the full example and if you’re patient enough to scroll down, you’ll also find the full implementation of the Dis.OrDat class and method.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace DeepNullCoalescence
{
  class Program
  {
    static void Main(string[] args)
    {
      var person = new Person();

      // var line1 = person.Address.Lines.Line1 ?? string.Empty;
      // throws NullReferenceException: 
      //    {"Object reference not set to an instance of an object."}
      
      // The ugly alternative
      var line1 = person.Address == null
              ? "n/a"
              : person.Address.Lines == null
              ? "n/a"
              : person.Address.Lines.Line1;

      // A cooler alternative
      var line2 = Dis.OrDat<string>(() => person.Address.Lines.Line2, "n/a");

      Console.WriteLine(line1);
      Console.WriteLine(line2);
      Console.ReadLine();
    }
  }

  internal class Person
  {
    public Address Address { get; set; }
  }

  internal class Address
  {
    public AddressLines Lines { get; set; }
    public string State { get; set; }
  }

  internal class AddressLines
  {
    public string Line1 { get; set; }
    public string Line2 { get; set; }
  }
}

And for the patient reader, here is the actual magic.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Linq.Expressions;

namespace DeepNullCoalescence
{
  public static class Dis
  {
    public static T OrDat<T>(Expression<Func<T>> expr, T dat)
    {
      try
      {
        var func = expr.Compile();
        var result = func.Invoke();
        return result ?? dat; //now we can coalesce
      }
      catch (NullReferenceException)
      {
        return dat;
      }
    }
  }
}

Tags:

Code

.NET Task Factory and Dangers of Iteration

by Tyler Jensen 30. October 2010 13:58

I had a simple small array of objects. I wanted to send each of them into a method that would fire off and manage a long running process on each of them in a new thread pool thread.

So I remebered the coolness of the new .NET 4.0 Task and it's attendant Factory. Seemed simple enough but I quickly learned a lesson I should have already known.

I've illustrated in code my first two failures and the loop that finally got it right. Let me know what you think. Undoubtedly there is even a better way.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace TaskFactoryExample
{
  public class ServerManager
  {
    IServerLib[] serverLibs;
    List<Task> taskList = new List<Task>();
    List<CancellationTokenSource> cancelTokens = new List<CancellationTokenSource>();

    public ServerManager(IServerLib[] servers)
    {
      serverLibs = servers;
    }

    // FIRST FAIL: only the last lib in the iteration gets sent to all ManageStart calls
    internal void Start_FirstFailed(string[] args)
    {
      foreach (var lib in serverLibs)
      {
        var tokenSource = new CancellationTokenSource();
        taskList.Add(Task.Factory.StartNew(() => { ManageStart(lib, args); }, tokenSource.Token));
        cancelTokens.Add(tokenSource);
      }
    }

    // SECOND FAIL: i is incremented finally to serverLibs.Length before ManageStart is called
	// resulting in an index out of range exception
    internal void Start_SecondFailed(string[] args)
    {
      for (int i = 0; i < serverLibs.Length; i++)
      {
        var tokenSource = new CancellationTokenSource();
        taskList.Add(Task.Factory.StartNew(() => { ManageStart(serverLibs[i], args); }, tokenSource.Token));
        cancelTokens.Add(tokenSource);
      }
    }

    // finally got it right - get a local reference to the item in the array so ManageStart 
    // is fed the correct serverLib object
    internal void Start(string[] args)
    {
      for (int i = 0; i < serverLibs.Length; i++ )
      {
        var serverLib = serverLibs[i];
        var tokenSource = new CancellationTokenSource();
        taskList.Add(Task.Factory.StartNew(() => { ManageStart(serverLib, args); }, tokenSource.Token));
        cancelTokens.Add(tokenSource);
      }
    }

    private void ManageStart(IServerLib lib, string[] args)
    {
      try
      {
        //code redacted for brevity
        //start long running or ongoing process with lib on threadpool thread
        lib.Start();
      }
      catch (Exception e)
      {
        //TODO: log general exception catcher
        throw; //leave in for testing
      }
    }

    internal void Stop()
    {
      try
      {
        foreach (var lib in serverLibs) lib.Stop();
        foreach (var tokenSource in cancelTokens) tokenSource.Cancel();
        foreach (var t in taskList) if (t.IsCompleted) t.Dispose();
      }
      catch (Exception e)
      {
        //TODO: log general exception catcher
        throw; //leave in for testing
      }
    }
  }
}

Tags:

Code

Me...

Tyler Jensen

Tyler Jensen
.NET Developer and Architect

Month List

Other Stuff