Real world typo statistics?

Tal Weiss picture Tal Weiss · Aug 5, 2010 · Viewed 7.8k times · Source

Where can I find some real world typo statistics?

I'm trying to match people's input text to internal objects, and people tend to make spelling mistakes.
There are 2 kinds of mistakes:

  1. typos - "Helllo" instead of "Hello" / "Satudray" instead of "Saturday" etc.
  2. Spelling - "Shikago" instead of "Chicago"

I use Damerau-Levenshtein distance for the typos and Double Metaphone for spelling (Python implementations here and here).

I want to focus on the Damerau-Levenshtein (or simply edit-distance). The textbook implementations always use '1' for the weight of deletions, insertions substitutions and transpositions. While this is simple and allows for nice algorithms it doesn't match "reality" / "real-world probabilities".

Examples:

  • I'm sure the likelihood of "Helllo" ("Hello") is greater than "Helzlo", yet they are both 1 edit distance away.
  • "Gello" is closer than "Qello" to "Hello" on a QWERTY keyboard.
  • Unicode transliterations: What is the "real" distance between "München" and "Munchen"?

What should the "real world" weights be for deletions, insertions, substitutions, and transpositions?

Even Norvig's very cool spell corrector uses non-weighted edit distance.

BTW- I'm sure the weights need to be functions and not simple floats (per the above examples)...

I can adjust the algorithm, but where can I "learn" these weights? I don't have access to Google-scale data...

Should I just guess them?

EDIT - trying to answer user questions:

  • My current non-weighted algorithm fails often when faced with typos for the above reasons. "Return on Tursday": every "real person" can easily tell Thursday is more likely than Tuesday, yet they are both 1-edit-distance away! (Yes, I do log and measure my performance).
  • I'm developing an NLP Travel Search engine, so my dictionary contains ~25K destinations (expected to grow to 100K), Time Expressions ~200 (expected 1K), People expressions ~100 (expected 300), Money Expressions ~100 (expected 500), "glue logic words" ("from", "beautiful", "apartment") ~2K (expected 10K) and so on...
  • Usage of the edit distance is different for each of the above word-groups. I try to "auto-correct when obvious", e.g. 1 edit distance away from only 1 other word in the dictionary. I have many other hand-tuned rules, e.g. Double Metaphone fix which is not more than 2 edit distance away from a dictionary word with a length > 4... The list of rules continues to grow as I learn from real world input.
  • "How many pairs of dictionary entries are within your threshold?": well, that depends on the "fancy weighting system" and on real world (future) input, doesn't it? Anyway, I have extensive unit tests so that every change I make to the system only makes it better (based on past inputs, of course). Most sub-6 letter words are within 1 edit distance from a word that is 1 edit distance away from another dictionary entry.
  • Today when there are 2 dictionary entries at the same distance from the input I try to apply various statistics to better guess which the user meant (e.g. Paris, France is more likely to show up in my search than Pārīz, Iran).
  • The cost of choosing a wrong word is returning semi-random (often ridiculous) results to the end-user and potentially losing a customer. The cost of not understanding is slightly less expensive: the user will be asked to rephrase.
  • Is the cost of complexity worth it? Yes, I'm sure it is. You would not believe the amount of typos people throw at the system and expect it to understand, and I could sure use the boost in Precision and Recall.

Answer

tszming picture tszming · Aug 7, 2010

Possible source for real world typo statistics would be in the Wikipedia's complete edit history.

http://download.wikimedia.org/

Also, you might be interested in the AWB's RegExTypoFix

http://en.wikipedia.org/wiki/Wikipedia:AWB/T