I have a list of addresses in two separate tables that are slightly off that I need to be able to match. For example, the same address can be entered in multiple ways:
Although simple, you can imagine the situation in more complex scenerios. I am trying to develop a simple algorithm that will be able to match the above addresses as a key.
For example. the key might be "11TEST" - first two of 110, first two of Test and first two of street variant. A full match key would also include first 5 of the zipcode as well so in the above example, the full key might look like "11TEST44680".
I am looking for ideas for an effective algorithm or resources I can look at for considerations when developing this. Any ideas can be pseudo code or in your language of choice.
We are only concerned with US addresses. In fact, we are only looking at addresses from 250 zip codes from Ohio and Michigan. We also do not have access to any postal software although would be open to ideas for cost effective solutions (it would essentially be a one time use). Please be mindful that this is an initial dump of data from a government source so suggestions of how users can clean it are helpful as I build out the application but I would love to have the best initial I possibly can by being able to match addresses as best as possible.
I'm working on a similar algorithm as we speak, it should handle addresses in Canada, USA, Mexico and the UK by the time I'm done. The problem I'm facing is that they're in our database in a 3 field plaintext format [whoever thought that was a good idea should be shot IMHO], so trying to handle rural routes, general deliveries, large volume receivers, multiple countries, province vs. state vs. county, postal codes vs. zip codes, spelling mistakes is no small or simple task.
Spelling mistakes alone was no small feat - especially when you get to countries that use French names - matching Saint, Sainte, St, Ste, Saints, Saintes, Sts, Stes, Grand, Grande, Grands, Grandes with or without period or hyphenation to the larger part of a name cause no end of performance issues - especially when St could mean saint or street and may or may not have been entered in the correct context (i.e. feminine vs. masculine). What if the address has largely been entered correctly but has an incorrect province or postal code?
One place to start your search is the Levenstein Distance Algorithm which I've found to be really useful for eliminating a large portion of spelling mistakes. After that, it's mostly a case of searching for keywords and comparing against a postal database.
I would be really interested in collaborating with anyone that is currently developing tools to do this, perhaps we can assist each other to a common solution. I'm already part of the way there and have overcome all the issues I've mentioned so far, having someone else working on the same problem would be really helpful to bounce ideas off.
Cheers - [ben at afsinc dot ca]