How does the Google geocoder work?

rburhum picture rburhum · Jun 12, 2012 · Viewed 7.1k times · Source

I am curious as to how the Google geocoder works.

I have been studying some implementations of open source geocoders like geocommons' geocoder or PostGIS's new Tiger Geocoder. This is roughly what I know so far (to hopefully prove that I have been doing my homework) :

I realize that at the core of the open source geocoders, there are three main elements.

1.- An address normalizer that takes an arbitrary string and normalizes it (taking the example from here):

normalize_address('address string');

e.g.: SELECT naddy.* FROM normalize_address('29645 7th Street SW Federal Way 98023') AS naddy;

 address | predirabbrev |      streetname       | streettypeabbrev | postdirabbrev | internal | location | stateabbrev |  zip  | parsed
 ---------+-------------+-----------------------+------------------+---------------+----------+----------+-------------+-------+--------
 29645 |               | 7th Street SW Federal  | Way              |               |          |          |             | 98023 | 

and:

2.- A geocoder that does some magical fuzzy matching for names where the core algorithm is the Levenshtein Distance.

A good example is the one from the Wikipedia article where it calculates the Levenshtein distance between the words kitten and sitting (the distance is 3 since that is the number of edits required to change one string into the other):

kitten → sitten (substitution of 's' for 'k')
sitten → sittin (substitution of 'i' for 'e')
sittin → sitting (insertion of 'g' at the end).

3.- Some interpolation of the street segments at the end to guess where the house is. I downloaded a chunk of the free Census Tiger street dataset to create this example.

street interpolation example

In the example above, the street segment of interest (Schaeffer Hills Dr) has a from node that starts at 300 (so 300 Schaeffer Hills Dr) and a to node that ends on 400 (400 Schaeffer Hills Drv). If I matched to this Schaeffer Hills Drv, and the request was for street 310, then the algorithm would just interpolate to it (traverse 10% of it) to where my green arrow is.

This is what the Open Source geocoder tools do. Nevertheless, Google is clearly smarter than that and uses all kinds of non-traditional hints.

How so?

For example, I can type 680 Mission st (no city, state, county, anything at all). Most of the standard address normalizers would blow up because they would find too many matches. But since I am in SF, I am guessing google uses my ip to get some geoip-like information, does some expanding bounding as a hint with some fuzzy search, and right away finds the closest segment that matches and tells me that's my answer (which is correct!).

I am looking for answers that can shed some more light into how the Google geocoder works besides the techniques that I described above.

Update:

OK, so far we have two kinds of hints listed

  • Geoip as hints
  • Area of Interest Bounding Box (see Paul's example).
  • Others?

Answer

Paul Ramsey picture Paul Ramsey · Jun 12, 2012

One of the things you can find by poking at the black box is that the Google geocoder isn't totally sensitive to the order of the tokens (there's no enforced street/city/state/country expectation, though it does better when you do follow that). Which says to me that they might be dumping everything into some kind of full text search and then seeing what comes back. Or perhaps not. Try searching "sault saint marie adams 200" and "sault saint marie 200 adams".

With respect to your Mission example, that's a great one, as you can see the map hint coming into play directly:

Query with map window over Europe: European results.

Query with map window over Europe: European results

Query with map window over North America: American results.

Query with map window over North America: American results