How to read values from numbers written as words?

Evgeny picture Evgeny · Sep 16, 2008 · Viewed 11.5k times · Source

As we all know numbers can be written either in numerics, or called by their names. While there are a lot of examples to be found that convert 123 into one hundred twenty three, I could not find good examples of how to convert it the other way around.

Some of the caveats:

  1. cardinal/nominal or ordinal: "one" and "first"
  2. common spelling mistakes: "forty"/"fourty"
  3. hundreds/thousands: 2100 -> "twenty one hundred" and also "two thousand and one hundred"
  4. separators: "eleven hundred fifty two", but also "elevenhundred fiftytwo" or "eleven-hundred fifty-two" and whatnot
  5. colloquialisms: "thirty-something"
  6. fractions: 'one third', 'two fifths'
  7. common names: 'a dozen', 'half'

And there are probably more caveats possible that are not yet listed. Suppose the algorithm needs to be very robust, and even understand spelling mistakes.

What fields/papers/studies/algorithms should I read to learn how to write all this? Where is the information?

PS: My final parser should actually understand 3 different languages, English, Russian and Hebrew. And maybe at a later stage more languages will be added. Hebrew also has male/female numbers, like "one man" and "one woman" have a different "one" — "ehad" and "ahat". Russian also has some of its own complexities.

Google does a great job at this. For example:

http://www.google.com/search?q=two+thousand+and+one+hundred+plus+five+dozen+and+four+fifths+in+decimal

(the reverse is also possible http://www.google.com/search?q=999999999999+in+english)

Answer

MarkusQ picture MarkusQ · Mar 17, 2009

I was playing around with a PEG parser to do what you wanted (and may post that as a separate answer later) when I noticed that there's a very simple algorithm that does a remarkably good job with common forms of numbers in English, Spanish, and German, at the very least.

Working with English for example, you need a dictionary that maps words to values in the obvious way:

"one" -> 1, "two" -> 2, ... "twenty" -> 20,
"dozen" -> 12, "score" -> 20, ...
"hundred" -> 100, "thousand" -> 1000, "million" -> 1000000

...and so forth

The algorithm is just:

total = 0
prior = null
for each word w
    v <- value(w) or next if no value defined
    prior <- case
        when prior is null:       v
        when prior > v:     prior+v
        else                prior*v
        else
    if w in {thousand,million,billion,trillion...}
        total <- total + prior
        prior <- null
total = total + prior unless prior is null

For example, this progresses as follows:

total    prior      v     unconsumed string
    0      _              four score and seven 
                    4     score and seven 
    0      4              
                   20     and seven 
    0     80      
                    _     seven 
    0     80      
                    7 
    0     87      
   87

total    prior      v     unconsumed string
    0        _            two million four hundred twelve thousand eight hundred seven
                    2     million four hundred twelve thousand eight hundred seven
    0        2
                  1000000 four hundred twelve thousand eight hundred seven
2000000      _
                    4     hundred twelve thousand eight hundred seven
2000000      4
                    100   twelve thousand eight hundred seven
2000000    400
                    12    thousand eight hundred seven
2000000    412
                    1000  eight hundred seven
2000000  412000
                    1000  eight hundred seven
2412000     _
                      8   hundred seven
2412000     8
                     100  seven
2412000   800
                     7
2412000   807
2412807

And so on. I'm not saying it's perfect, but for a quick and dirty it does quite well.


Addressing your specific list on edit:

  1. cardinal/nominal or ordinal: "one" and "first" -- just put them in the dictionary
  2. english/british: "fourty"/"forty" -- ditto
  3. hundreds/thousands: 2100 -> "twenty one hundred" and also "two thousand and one hundred" -- works as is
  4. separators: "eleven hundred fifty two", but also "elevenhundred fiftytwo" or "eleven-hundred fifty-two" and whatnot -- just define "next word" to be the longest prefix that matches a defined word, or up to the next non-word if none do, for a start
  5. colloqialisms: "thirty-something" -- works
  6. fragments: 'one third', 'two fifths' -- uh, not yet...
  7. common names: 'a dozen', 'half' -- works; you can even do things like "a half dozen"

Number 6 is the only one I don't have a ready answer for, and that's because of the ambiguity between ordinals and fractions (in English at least) added to the fact that my last cup of coffee was many hours ago.