I've got a stand-alone CSV data loading process that I coded in Java that has to use some fuzzy string matching. It's definitely not ideal, but I don't have much choice. I am matching using a first and last name and I cache all the possibilities at the start of a run. After finding the match, I then need that person object multiple places during the run. I used guava's Objects.hashCode()
to create a hash out of the firstname and lastname.
The caching mechanism looks like this:
Map<Integer,PersonDO> personCache = Maps.newHashMap();
for(PersonDO p: dao.getPeople()) {
personCache.put(Objects.hashCode(p.getFirstName(),p.getLastName()), p);
}
Most of the time I get hits on firstname+lastname, but when it misses I'm falling back using Apache's StringUtils.getLevenshteinDistance()
to try and match it. This is how the matching logic flow goes:
person = personCache.get(Objects.hashCode(firstNameFromCSV,lastNameFromCSV));
if(person == null) {//fallback to fuzzy matching
person = findClosetMatch(firstNameFromCSV+lastNameFromCSV);
}
This is the findClosetMatch()
method:
private PersonDO findClosetMatch(String name) {
int min = 15;//initial value
int testVal=0;
PersonDO matchedPerson = null;
for(PersonDO person: personCache.values()) {
testVal = StringUtils.getLevenshteinDistance(name,person.getFirstName()+person.getLastName());
if( testVal < min ) {
min = testVal;
matchedPerson = person;
}
}
if(matchedPerson == null) {
throw new Exception("Unable to find person: " + name)
}
return matchedPerson;
}
This works fine with simple spelling errors, typos and shortened names(ie Mike->Michael), but when I'm completely missing one of the incoming names in the cache I end up returning a false positive match. To help keep this from happening I set the min value in findClosetMatch()
to 15(ie no more than 15 chars off); it works most of the time but I've still had a few mis-matches occur: Mike Thompson
hits on Mike Thomas
etc.
Outside of figuring out a way to get a primary key into the file being loaded, does anyone see a way to improve this process? Any other matching algorithms that could help here?
As I look at this problem I notice a couple key facts to base some improvements on:
I've interpreted your problem as requiring the following two things:
PersonDO
objects that you want to look up via a key that is based on the name. It sounds like you want to do this because you need a pre-existing PersonDO
of which one exists per unique name, and the same name may come up more than once in your loop/workflow.PersonDO
(in other words, a person's unique identifier is their name, which is obviously not the case in real life, but seems to work for you here).Next, let's look at making some improvements on your code:
1. Cleanup: unnecessary hashcode manipulation.
You don't need to generate hash codes yourself. This confuses the issue a bit.
You're simply generating a hashcode for the combination of the firstname + lastname. This is exactly what HashMap
would do if you gave it the concatenated string as a key. So, just do that (and add a space, just in case we want to reverse parse out first/last from the key later).
Map<String, PersonDO> personCache = Maps.newHashMap();
public String getPersonKey(String first, String last) {
return first + " " + last;
}
...
// Initialization code
for(PersonDO p: dao.getPeople()) {
personCache.put(getPersonKey(p.getFirstName(), p.getLastName()), p);
}
2. Cleanup: Build a retrieval function to perform the lookup.
Since we've changed the key in the map we need to change the lookup function. We'll build this like a mini-API. If we always knew the key exactly (i.e. unique IDs) we would of course just use Map.get
. So we'll start with that, but since we know we will need to add fuzzy matching we'll add a wrapper where this can happen:
public PersonDO findPersonDO(String searchFirst, String searchLast) {
return personCache.get(getPersonKey(searchFirst, searchLast));
}
3. Build a fuzzy matching algorithm yourself using scoring.
Note that since you are using Guava, I've used a few conveniences here (Ordering
, ImmutableList
, Doubles
, etc.).
First, we want to preserve the work we do to figure out how close a match is. Do this with a POJO:
class Match {
private PersonDO candidate;
private double score; // 0 - definitely not, 1.0 - perfect match
// Add candidate/score constructor here
// Add getters for candidate/score here
public static final Ordering<Match> SCORE_ORDER =
new Ordering<Match>() {
@Override
public int compare(Match left, Match right) {
return Doubles.compare(left.score, right.score);
}
};
}
Next, we create a method for scoring a generic name. We should score first and last names separately, because it reduces noise. For instance, we don't care if the firstname matches any part of the last name — unless your firstname might accidentally be in the lastname field or vice versa, which you should account for intentionally and not accidentally (we'll address this later).
Note that we no longer need a "max levenshtein distance". This is because we normalize them to length, and we will pick the closest match later. 15 character adds/edits/deletes seems very high, and since we've minimized the blank first/last name problem by scoring names separately, we could probably now pick a max of 3-4 if you wanted (scoring anything else as a 0).
// Typos on first letter are much more rare. Max score 0.3
public static final double MAX_SCORE_FOR_NO_FIRST_LETTER_MATCH = 0.3;
public double scoreName(String searchName, String candidateName) {
if (searchName.equals(candidateName)) return 1.0
int editDistance = StringUtils.getLevenshteinDistance(
searchName, candidateName);
// Normalize for length:
double score =
(candidateName.length() - editDistance) / candidateName.length();
// Artificially reduce the score if the first letters don't match
if (searchName.charAt(0) != candidateName.charAt(0)) {
score = Math.min(score, MAX_SCORE_FOR_NO_FIRST_LETTER_MATCH);
}
// Try Soundex or other matching here. Remember that you don't want
// to go above 1.0, so you may want to create a second score and
// return the higher.
return Math.max(0.0, Math.min(score, 1.0));
}
As noted above, you could plug-in third party or other word-matching algorithms and gain from the shared knowledge of all of them.
Now, we go through the whole list and score every name. Note that I've added a spot for "tweaks". Tweaks may include:
checkForReversal
that would score the name in reverse and take that score if it is significantly higher. If it matched in reverse exactly, you would give it a 1.0 score.As you can see building this as a series of API's gives us logical locations to easily tweak this to our heart's content.
On with the alogrithm:
public static final double MIN_SCORE = 0.3;
public List<Match> findMatches(String searchFirst, String searchLast) {
List<Match> results = new ArrayList<Match>();
// Keep in mind that this doesn't scale well.
// With only 1000 names that's not even a concern a little bit, but
// thinking ahead, here are two ideas if you need to:
// - Keep a map of firstnames. Each entry should be a map of last names.
// Then, only iterate through last names if the firstname score is high
// enough.
// - Score each unique first or last name only once and cache the score.
for(PersonDO person: personCache.values()) {
// Some of my own ideas follow, you can tweak based on your
// knowledge of the data)
// No reason to deal with the combined name, that just makes things
// more fuzzy (like your problem of too-high scores when one name
// is completely missing).
// So, score each name individually.
double scoreFirst = scoreName(searchFirst, person.getFirstName());
double scoreLast = scoreName(searchLast, person.getLastName());
double score = (scoreFirst + scoreLast)/2.0;
// Add tweaks or alternate scores here. If you do alternates, in most
// cases you'll probably want to take the highest, but you may want to
// average them if it makes more sense.
if (score > MIN_SCORE) {
results.add(new Match(person, score));
}
}
return ImmutableList.copyOf(results);
}
Now we modify your findClosestMatch to get just the highest one from all matches (throws NoSuchElementException
if none in list).
Possible tweaks:
Code:
public Match findClosestMatch(String searchFirst, String searchLast) {
List<Match> matches = findMatch(searchFirst, searchLast);
// Tweak here
return Match.SCORE_ORDER.max(list);
}
.. and then modify our original getter:
public PersonDO findPersonDO(String searchFirst, String searchLast) {
PersonDO person = personCache.get(getPersonKey(searchFirst, searchLast));
if (person == null) {
Match match = findClosestMatch(searchFirst, searchLast);
// Do something here, based on score.
person = match.getCandidate();
}
return person;
}
4. Report "fuzziness" differently.
Finally, you'll notice that findClosestMatch
doesn't just return a person, it returns a Match
— This is so that we can modify the program to treat fuzzy matches differently from exact matches.
Some things you probably want to do with this:
As you can see it's not too much code to do this yourself. It's doubtful there is ever going to be a library that will predict names as well as you can knowing the data yourself.
Building this in pieces as I've done in the example above will allow you to iterate and tweak easily and even plug in third-party libraries to improve your scoring instead of depending on them entirely -- faults and all.