Algorithm using soundex() or metaphone() to create Mad Gab style phrases

Jason McCreary picture Jason McCreary · Mar 19, 2012 · Viewed 7.1k times · Source

I'm attempting to create an algorithm that will suggest Mad Gab style phrases.

The input is a set of phrases. I also have a set of keywords that I'd like to use when possible. Currently, my solution is simply brute force:

  • loop over phrases (character by character)
    • if keyword is found
      • store keyword and branch (recursion)
    • increment character count

However, the problems I am running into are:

  • Account for compound keywords, e.g. "catches" can be "catches", "cat" + "cheeses"
  • Allow literal terms - "the", "and", "one", "two", "three".
  • How to suggest terms that are not keywords. i.e. fall back on something like the system dictionary when keywords or literals can not be found.
  • Skip phrase segments. Right now it just does one pass through. But consider the case where the phrase starts with something unmatched but a few characters later contains matches.

I am most familiar with PHP and MySQL. However, I am open to another technology if it provides a better solution.

I am also interested in any additional suggestions. Particularly ways to use the second parameter of metaphone() to make harder suggestions.

Answer

Jonathan Barlow picture Jonathan Barlow · Mar 28, 2012

Perhaps start with a syllable division algorithm on the phrase bank. You can use even a simple resource that teaches children to divide syllables to create your rough divider method:

http://www.ewsdonline.org/education/components/scrapbook/default.php?sectiondetailid=7584

If you want a more technical, completely accurate way, there was a Ph.D. dissertation about how to do it:

http://www.tug.org/docs/liang/

Then turn each syllable into a phonetic representation using either something you roll yourself or metaphone(). You can use a similar site that explains vowel sound rules. These will only be generalizations. You will process vowels separately from consonants if you roll your own. Metaphone just uses consonants, which is fine, but not as cool as if you also took into account vowels.

Vowels: http://www.eslgold.com/pronunciation/english_vowel_sounds.html Consonants: http://usefulenglish.ru/phonetics/english-consonant-sounds

Then, you have a dictionary of English words for your word bank. There are many open-source dictionaries available that you could stick into a MySQL table.

Start with the first syllable and look for a random word in the dictionary that matches the soundex test. If you can't find one (this will generally only find one syllable words) add the additional syllable and search again.

Example:

"Logical consequence"

A. Syllable split

"lo gi cal con se quence"

B. Vowel Sounds applied

"lah gee cahl con see quince"

C. Consonant Sounds applied

"lah jee kahl kon see quinse"

D. Soundtext test (one syllable soundex -obviously too easy to guess, but it proves the concept)

"Law Gee Call Con Sea Quints"

Soundex strcmp's return a number. So if you like, you could get the soundex values of everything in your word bank in advance. Then you can quickly run the strcmp.

An example of a Soundex MySQL comparison is:

select strcmp(soundex('lah'), soundex('law'));

I think using the MySQL soundex is easier for you than the PHP soundex test if you're wanting a random result from a big database and you've already captured the soundex value in a field in your dictionary table.

My suggestion may be inefficient, but optimization is a different question.

Update:

I didn't mean to imply that my solution would only yield one syllable words. I used one syllable as the example, but if you took two of the syllables together, you'd get multi-syllable matches. In fact, you could probably just start by jamming all the syllables together and running soundex in mysql. If you find an answer, great. But then you can roll off syllables until you get the longest match you can. Then you're left with the end of the phrase and can take those together and run a match. I think that's the essence of the solution below from the other contributor, but I think you need to avoid jamming all the letters together without spaces. In English, you'd lose information that way. Think of a phrase beginning with a "th" sound. If you jam the phrase together, you lose which "th" sound is needed. "Theremin" (the instrument) has a different "th" sound than "There, a man".