Is there a Java parser that can parse addresses like this

Dave picture Dave · Apr 13, 2012 · Viewed 14.6k times · Source

I'm using Java 6. I'm looking for an automated way to parse addresses. I'm not concerned if the addresses exist or not. The best thing I have found is JGeocoder (v 0.4.1), but JGeocoder is unable to parse addresses like this

16th Street Theater, Berwyn Cultural Center,  6420 16th St.

Does anyone know of a free Java address parser that is up to the challenge? By "parse" I mean the ability to distinguish street, city, state, postal code, and potentially the venue name (the above venue name is "16th Street Theater, Berwyn Cultural Center").

Answer

Matt picture Matt · Apr 13, 2012

Update: This topic is more exhaustively covered in this StackOverflow question.


I work for SmartyStreets where we parse and process addresses, and we have an answer. This is what we call "SLAP" or Single-Line Address Parsing (or Processing). The formal term is Named Entity Recognition (NER).

I'm not an expert on Java libraries, but I do know that any in-house implementations will not live up to expectations. Here's some common reasons that people who I've helped have previously had difficulty:

  • Google / Yahoo! / Bing Maps web services do not allow automated queries and do not verify accuracy of the parsed address.

  • In-house code can make also only make a best guess without any knowledge of existent addresses (a database) or other sorts of official sources. I know you want a library that can do this in-house, but you can at best make a guess...

  • By the way, regular expressions are not the answer. The best regex I've seen to parse addresses was dynamically generated over hundreds of lines of code and several classes. It was a mess, and was only correct for types of addresses you'd expect, not all the valid (US) formats there actually are.

This is an incredibly complex task... unless you have the right tools. One of our services is called LiveAddress API, and it's similar to Google Maps in that it parses addresses and geocodes them, but goes a step further by being CASS-Certified and returning only valid addresses, almost no matter the input format.

I encourage you to do some research of your own, but this is probably the most effective and reliable method.