I'm working on a survey program where people will be given promotional considerations the first time they fill out a survey. In a lot of scenarios, the only way we can stop people from cheating the system and getting a promotion they don't deserve is to check street address strings against each other.
I was looking at using levenshtein distance to give me a number to measure similarity, and consider those below a certain threshold a duplicate.
However, if someone were looking to game the system, they could easily write "S 5th St" instead of "South Fifth Street", and levenshtein would consider those strings to be very different. So then I was thinking to convert all strings to a 'standard address form' i.e. 'South' becomes 's', 'Fifth' becomes '5th', etc.
Then I was thinking this is hopeless, and too much effort to get it working robustly. Is it?
I'm working with PHP/MySql, so I have the limitations inherent in that system.
I think your second idea is better than using Levenshtein distance. If you try to compare the addresses for similarity, then two different people who live nearby each other might accidentally "cheat" one another out of their prize. If I live at "S. 4th St." but my neighbor at "S. 5th St." already signed up, those two addresses might seem too similar by Lev distance.
You could reduce (but probably not eliminate) a lot of potential cheating by running addresses through a synonym normalizer. Before you check for equality, just convert
North -> N.
East -> E.
...
First -> 1st
Second -> 2nd
Third -> 3rd
...
Street -> St.
Avenue -> Ave.
The longer the list of synonyms you come up with, the better it will be at catching matches. It will be a little bit slower at processing, but addresses are tiny.
This is similar to converting strings to all lower (or upper) case before comparing them. (Which I also recommend, naturally.)