Java fuzzy String matching with names

Durandal picture Durandal · Jan 11, 2014 · Viewed 24.9k times · Source

I've got a stand-alone CSV data loading process that I coded in Java that has to use some fuzzy string matching. It's definitely not ideal, but I don't have much choice. I am matching using a first and last name and I cache all the possibilities at the start of a run. After finding the match, I then need that person object multiple places during the run. I used guava's Objects.hashCode() to create a hash out of the firstname and lastname.

The caching mechanism looks like this:

Map<Integer,PersonDO> personCache = Maps.newHashMap();
for(PersonDO p: dao.getPeople()) {
    personCache.put(Objects.hashCode(p.getFirstName(),p.getLastName()), p);
}

Most of the time I get hits on firstname+lastname, but when it misses I'm falling back using Apache's StringUtils.getLevenshteinDistance() to try and match it. This is how the matching logic flow goes:

    person = personCache.get(Objects.hashCode(firstNameFromCSV,lastNameFromCSV));
    if(person == null) {//fallback to fuzzy matching
        person = findClosetMatch(firstNameFromCSV+lastNameFromCSV);

    }

This is the findClosetMatch() method:

private PersonDO findClosetMatch(String name) {
    int min = 15;//initial value
    int testVal=0;
    PersonDO matchedPerson = null;
    for(PersonDO person: personCache.values()) {
        testVal = StringUtils.getLevenshteinDistance(name,person.getFirstName()+person.getLastName());
        if( testVal < min ) {
            min = testVal;
            matchedPerson = person;
        }
    }
    if(matchedPerson == null) {
        throw new Exception("Unable to find person: " + name) 
    }
    return matchedPerson;
}

This works fine with simple spelling errors, typos and shortened names(ie Mike->Michael), but when I'm completely missing one of the incoming names in the cache I end up returning a false positive match. To help keep this from happening I set the min value in findClosetMatch() to 15(ie no more than 15 chars off); it works most of the time but I've still had a few mis-matches occur: Mike Thompson hits on Mike Thomas etc.

Outside of figuring out a way to get a primary key into the file being loaded, does anyone see a way to improve this process? Any other matching algorithms that could help here?

Answer

Nicole picture Nicole · Jan 13, 2014

As I look at this problem I notice a couple key facts to base some improvements on:

Facts and observations

  1. Max iterations of 1000.
  2. 15 for Levenshtein distance sounds really high to me.
  3. You know, by observing the data empirically, what your fuzzy matching should look like (there are many cases for fuzzy matching and each depends on why the data is bad).
  4. By building this API-like you could plug in many algorithms, including your own and others like Soundex, instead of depending on just one.

Requirements

I've interpreted your problem as requiring the following two things:

  1. You have PersonDO objects that you want to look up via a key that is based on the name. It sounds like you want to do this because you need a pre-existing PersonDO of which one exists per unique name, and the same name may come up more than once in your loop/workflow.
  2. You need "fuzzy matching" because the incoming data is not pure. For the purposes of this algorithm we'll assume that if a name "matches", it should always use the same PersonDO (in other words, a person's unique identifier is their name, which is obviously not the case in real life, but seems to work for you here).

Implementation

Next, let's look at making some improvements on your code:

1. Cleanup: unnecessary hashcode manipulation.

You don't need to generate hash codes yourself. This confuses the issue a bit.

You're simply generating a hashcode for the combination of the firstname + lastname. This is exactly what HashMap would do if you gave it the concatenated string as a key. So, just do that (and add a space, just in case we want to reverse parse out first/last from the key later).

Map<String, PersonDO> personCache = Maps.newHashMap();

public String getPersonKey(String first, String last) {
  return first + " " + last;
}

...
// Initialization code
for(PersonDO p: dao.getPeople()) {
    personCache.put(getPersonKey(p.getFirstName(), p.getLastName()), p);
}

2. Cleanup: Build a retrieval function to perform the lookup.

Since we've changed the key in the map we need to change the lookup function. We'll build this like a mini-API. If we always knew the key exactly (i.e. unique IDs) we would of course just use Map.get. So we'll start with that, but since we know we will need to add fuzzy matching we'll add a wrapper where this can happen:

public PersonDO findPersonDO(String searchFirst, String searchLast) {
  return personCache.get(getPersonKey(searchFirst, searchLast));
}

3. Build a fuzzy matching algorithm yourself using scoring.

Note that since you are using Guava, I've used a few conveniences here (Ordering, ImmutableList, Doubles, etc.).

First, we want to preserve the work we do to figure out how close a match is. Do this with a POJO:

class Match {
   private PersonDO candidate;
   private double score; // 0 - definitely not, 1.0 - perfect match

   // Add candidate/score constructor here
   // Add getters for candidate/score here

   public static final Ordering<Match> SCORE_ORDER =
       new Ordering<Match>() {
     @Override
     public int compare(Match left, Match right) {
       return Doubles.compare(left.score, right.score);
     }
   };
}

Next, we create a method for scoring a generic name. We should score first and last names separately, because it reduces noise. For instance, we don't care if the firstname matches any part of the last name — unless your firstname might accidentally be in the lastname field or vice versa, which you should account for intentionally and not accidentally (we'll address this later).

Note that we no longer need a "max levenshtein distance". This is because we normalize them to length, and we will pick the closest match later. 15 character adds/edits/deletes seems very high, and since we've minimized the blank first/last name problem by scoring names separately, we could probably now pick a max of 3-4 if you wanted (scoring anything else as a 0).

// Typos on first letter are much more rare.  Max score 0.3
public static final double MAX_SCORE_FOR_NO_FIRST_LETTER_MATCH = 0.3;

public double scoreName(String searchName, String candidateName) {
  if (searchName.equals(candidateName)) return 1.0

  int editDistance = StringUtils.getLevenshteinDistance(
      searchName, candidateName);

  // Normalize for length:
  double score =
      (candidateName.length() - editDistance) / candidateName.length();

  // Artificially reduce the score if the first letters don't match
  if (searchName.charAt(0) != candidateName.charAt(0)) {
    score = Math.min(score, MAX_SCORE_FOR_NO_FIRST_LETTER_MATCH);
  }

  // Try Soundex or other matching here.  Remember that you don't want
  // to go above 1.0, so you may want to create a second score and
  // return the higher.

  return Math.max(0.0, Math.min(score, 1.0));
}

As noted above, you could plug-in third party or other word-matching algorithms and gain from the shared knowledge of all of them.

Now, we go through the whole list and score every name. Note that I've added a spot for "tweaks". Tweaks may include:

  • Reversal: If the PersonDO is "Benjamin Franklin", but the CSV sheet may contain "Franklin, Benjamin", then you will want to correct for reversed names. In this case you will probably want to add a method checkForReversal that would score the name in reverse and take that score if it is significantly higher. If it matched in reverse exactly, you would give it a 1.0 score.
  • Abbreviations: You may want to give the score a bonus bump if either first/last names matches identically and the other is fully contained in the candidate (or vice versa). This might indicate an abbreviation, like "Samantha/Sam".
  • Common nicknames: You could add a set of known nicknames ("Robert -> Bob, Rob, Bobby, Robby") and then score the search name against all of them and take the highest score. If it matches any of these, you would probably give it a 1.0 score.

As you can see building this as a series of API's gives us logical locations to easily tweak this to our heart's content.

On with the alogrithm:

public static final double MIN_SCORE = 0.3;

public List<Match> findMatches(String searchFirst, String searchLast) {
  List<Match> results = new ArrayList<Match>();

  // Keep in mind that this doesn't scale well.
  // With only 1000 names that's not even a concern a little bit, but
  // thinking ahead, here are two ideas if you need to:
  // - Keep a map of firstnames.  Each entry should be a map of last names.
  //   Then, only iterate through last names if the firstname score is high
  //   enough.
  // - Score each unique first or last name only once and cache the score.
  for(PersonDO person: personCache.values()) {
    // Some of my own ideas follow, you can tweak based on your
    // knowledge of the data)

    // No reason to deal with the combined name, that just makes things
    // more fuzzy (like your problem of too-high scores when one name
    // is completely missing).
    // So, score each name individually.

    double scoreFirst = scoreName(searchFirst, person.getFirstName());
    double scoreLast = scoreName(searchLast, person.getLastName());

    double score = (scoreFirst + scoreLast)/2.0;

    // Add tweaks or alternate scores here.  If you do alternates, in most
    // cases you'll probably want to take the highest, but you may want to
    // average them if it makes more sense.

    if (score > MIN_SCORE) {
      results.add(new Match(person, score));
    }
  }

  return ImmutableList.copyOf(results);
}

Now we modify your findClosestMatch to get just the highest one from all matches (throws NoSuchElementException if none in list).

Possible tweaks:

  • You may want to check if multiple names scored very close, and either report the runners-up (see below), or skip the row for manual choice later.
  • You may want to report how many other matches there were (if you have a very tight scoring algorithm).

Code:

public Match findClosestMatch(String searchFirst, String searchLast) {
  List<Match> matches = findMatch(searchFirst, searchLast);

  // Tweak here

  return Match.SCORE_ORDER.max(list);
}

.. and then modify our original getter:

public PersonDO findPersonDO(String searchFirst, String searchLast) {
  PersonDO person = personCache.get(getPersonKey(searchFirst, searchLast));
  if (person == null) {
    Match match = findClosestMatch(searchFirst, searchLast);
    // Do something here, based on score.
    person = match.getCandidate();
  }
  return person;
}

4. Report "fuzziness" differently.

Finally, you'll notice that findClosestMatch doesn't just return a person, it returns a Match — This is so that we can modify the program to treat fuzzy matches differently from exact matches.

Some things you probably want to do with this:

  • Report guesses: Save all names that matched based on fuzziness into a list so that you can report those and they can be audited later.
  • Validate first: You may want to add a control to turn on and off whether it actually uses the fuzzy matches or just reports them so that you can massage the data before it comes in.
  • Data defenesiveness: You may want to qualify any edits made on a fuzzy match as "uncertain". For example, you could disallow any "major edits" to a Person record if the match was fuzzy.

Conclusion

As you can see it's not too much code to do this yourself. It's doubtful there is ever going to be a library that will predict names as well as you can knowing the data yourself.

Building this in pieces as I've done in the example above will allow you to iterate and tweak easily and even plug in third-party libraries to improve your scoring instead of depending on them entirely -- faults and all.