How do spell checkers work?

dicroce picture dicroce · Dec 6, 2008 · Viewed 14k times · Source

I need to implement a spell checker in C. Basically, I need all the standard operations... I need to be able to spell check a block of text, make word suggestions and dynamically add new words to the index.

I'd kind of like to write this myself, tho I really don't know where to begin.

Answer

e.James picture e.James · Dec 6, 2008

Read up on Tree Traversal. The basic concept is as follows:

  1. Read a dictionary file into memory (this file contains the entire list of correctly spelled words that are possible/common for a given language). You can download free dictionary files online, such as Oracle's example dictionary.
  2. Parse this dictionary file into a search tree to make the actual text search as efficient as possible. I won't describe all of the dirty details of this type of tree structure, but the tree will be made up of nodes which have (up to) 26 links to child nodes (one for each letter), plus a flag to indicate wether or not the current node is the end of a valid word.
  3. Loop through all of the words in your document, and check each one against the search tree. If you reach a node in the tree where the next letter in the word is not a valid child of the current node, the word is not in the dictionary. Also, if you reach the end of your word, and the "valid end of word" flag is not set on that node, the word is not in the dictionary.
  4. If a word is not found in the dictionary, inform the user. At this stage, you can also suggest alternate spellings, but that gets a tad more complicated. You will have to loop through each character in the word, substituting alternate characters and test each of them against the search tree. There are probably more efficient algorithms for finding the recommended words, but I don't know what they are.

A really short example:

Dictionary:

apex apple appoint appointed

Tree: (* indicates valid end of word) update: Thank you to Curt Sampson for pointing out that this data structure is called a Patricia Tree

A -> P -> E -> X*
      \\-> P -> L -> E*
           \\-> O -> I -> N -> T* -> E -> D*

Document:

apple appint ape

Results:

  • "apple" will be found in the tree, so it is considered correct.
  • "appint" will be flagged as incorrect. Traversing the tree, you will follow A -> P -> P, but the second P does not have an I child node, so the search fails.
  • "ape" will also fail, since the E node in A -> P -> E does not have the "valid end of word" flag set.

    edit: For more details on spelling suggestions, look into Levenshtein Distance, which measures the smallest number of changes that must be made to convert one string into another. The best suggestions would be the dictionary words with the smallest Levenshtein Distance to the incorrectly spelled word.