Well, I was too late on the answer for this question, but I was working a little test scenario that seems to have worked very well for me. I used a (simple, but ugly, and large) regular expression to locate all the words for me. The expression is as follows:
(?<Value>(?:zero)|(?:one|first)|(?:two|second)|(?:three|third)|(?:four|fourth)|
(?:five|fifth)|(?:six|sixth)|(?:seven|seventh)|(?:eight|eighth)|(?:nine|ninth)|
(?:ten|tenth)|(?:eleven|eleventh)|(?:twelve|twelfth)|(?:thirteen|thirteenth)|
(?:fourteen|fourteenth)|(?:fifteen|fifteenth)|(?:sixteen|sixteenth)|
(?:seventeen|seventeenth)|(?:eighteen|eighteenth)|(?:nineteen|nineteenth)|
(?:twenty|twentieth)|(?:thirty|thirtieth)|(?:forty|fortieth)|(?:fifty|fiftieth)|
(?:sixty|sixtieth)|(?:seventy|seventieth)|(?:eighty|eightieth)|(?:ninety|ninetieth)|
(?<Magnitude>(?:hundred|hundredth)|(?:thousand|thousandth)|(?:million|millionth)|
(?:billion|billionth)))
Shown here with line breaks for formatting purposes..
Anyways, my method was to execute this RegEx with a library like PCRE, and then read back the named matches. And it worked on all of the different examples listed in this question, minus the "One Half", types, as I didn't add them in, but as you can see, it wouldn't be hard to do so. This addresses a lot of issues. For example, it addresses the following items in the original question and other answers:
- cardinal/nominal or ordinal: "one" and "first"
- common spelling mistakes: "forty"/"fourty" (Note that it does not EXPLICITLY address this, that would be something you'd want to do before you passed the string to this parser. This parser sees this example as "FOUR"...)
- hundreds/thousands: 2100 -> "twenty one hundred" and also "two thousand and one hundred"
- separators: "eleven hundred fifty two", but also "elevenhundred fiftytwo" or "eleven-hundred fifty-two" and whatnot
- colloqialisms: "thirty-something" (This also is not TOTALLY addressed, as what IS "something"? Well, this code finds this number as simply "30").**
Now, rather than store this monster of a regular expression in your source, I was considering building this RegEx at runtime, using something like the following:
char *ones[] = {"zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve",
"thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"};
char *tens[] = {"", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"};
char *ordinalones[] = { "", "first", "second", "third", "fourth", "fifth", "", "", "", "", "", "", "twelfth" };
char *ordinaltens[] = { "", "", "twentieth", "thirtieth", "fortieth", "fiftieth", "sixtieth", "seventieth", "eightieth", "ninetieth" };
and so on...
The easy part here is we are only storing the words that matter. In the case of SIXTH, you'll notice that there isn't an entry for it, because it's just it's normal number with TH tacked on... But ones like TWELVE need different attention.
Ok, so now we have the code to build our (ugly) RegEx, now we just execute it on our number strings.
One thing I would recommend, is to filter, or eat the word "AND". It's not necessary, and only leads to other issues.
So, what you are going to want to do is setup a function that passes the named matches for "Magnitude" into a function that looks at all the possible magnitude values, and multiplies your current result by that value of magnitude. Then, you create a function that looks at the "Value" named matches, and returns an int (or whatever you are using), based on the value discovered there.
All VALUE matches are ADDED to your result, while magnitutde matches multiply the result by the mag value. So, Two Hundred Fifty Thousand becomes "2", then "2 * 100", then "200 + 50", then "250 * 1000", ending up with 250000...
Just for fun, I wrote a vbScript version of this and it worked great with all the examples provided. Now, it doesn't support named matches, so I had to work a little harder getting the correct result, but I got it. Bottom line is, if it's a "VALUE" match, add it your accumulator. If it's a magnitude match, multiply your accumulator by 100, 1000, 1000000, 1000000000, etc... This will provide you with some pretty amazing results, and all you have to do to adjust for things like "one half" is add them to your RegEx, put in a code marker for them, and handle them.
Well, I hope this post helps SOMEONE out there. If anyone want, I can post by vbScript pseudo code that I used to test this with, however, it's not pretty code, and NOT production code.
If I may.. What is the final language this will be written in? C++, or something like a scripted language? Greg Hewgill's source will go a long way in helping understand how all of this comes together.
Let me know if I can be of any other help. Sorry, I only know English/American, so I can't help you with the other languages.