views:

308

answers:

4

I'm trying to create a regex to recognize English numerals, such as one, nineteen, twenty, one hundred and twenty two, et cetera, all the way to the millions. I want to reuse some parts of the regular expression, so the regex is being constructed by parts, like so:

// replace <TAG> with the content of the variable
ONE_DIGIT = (?:one|two|three|four|five|six|seven|eight|nine)
TEEN = (?:ten|eleven|twelve|(?:thir|for|fif|six|seven|eigh|nine)teen)
TWO_DIGITS = (?:(?:twen|thir|for|fif|six|seven|eigh|nine)ty(?:\s+<ONE_DIGIT>)?|<TEEN>)
// HUNDREDS, et cetera

I was wondering if anyone has already done the same (and would like to share), as these regexes are quite long and it's possible that they have something that they shouldn't, or something that I may be missing. Also, I want them to be as efficient as possible so I'm looking forward for any optimization tips. I'm using the Java regex engine, but any regex flavour is acceptable.

+7  A: 

See Perl's Lingua::EN::Words2Nums and Lingua::EN::FindNumber.

In particular, the source code for Lingua::EN::FindNumber contains:

# This is from Lingua::EN::Words2Nums, after being thrown through
# Regex::PreSuf
my $numbers =
    qr/((?:b(?:akers?dozen|illi(?:ard|on))|centillion|d(?:ecilli(?:ard|on)|ozen|u(?:o(?:decilli(?:ard|on)|vigintillion)|vigintillion))|e(?:ight(?:een|ieth|[yh])?|leven(?:ty(?:first|one))?|s)|f(?:i(?:ft(?:een|ieth|[yh])|rst|ve)|o(?:rt(?:ieth|y)|ur(?:t(?:ieth|[yh]))?))|g(?:oogol(?:plex)?|ross)|hundred|mi(?:l(?:ion|li(?:ard|on))|nus)|n(?:aught|egative|in(?:et(?:ieth|y)|t(?:een|[yh])|e)|o(?:nilli(?:ard|on)|ught|vem(?:dec|vigint)illion))|o(?:ct(?:illi(?:ard|on)|o(?:dec|vigint)illion)|ne)|qu(?:a(?:drilli(?:ard|on)|ttuor(?:decilli(?:ard|on)|vigintillion))|in(?:decilli(?:ard|on)|tilli(?:ard|on)|vigintillion))|s(?:core|e(?:cond|pt(?:en(?:dec|vigint)illion|illi(?:ard|on))|ven(?:t(?:ieth|y))?|x(?:decillion|tilli(?:ard|on)|vigintillion))|ix(?:t(?:ieth|y))?)|t(?:ee?n|h(?:ir(?:t(?:een|ieth|y)|d)|ousand|ree)|r(?:e(?:decilli(?:ard|on)|vigintillion)|i(?:gintillion|lli(?:ard|on)))|w(?:e(?:l(?:fth|ve)|nt(?:ieth|y))|o)|h)|un(?:decilli(?:ard|on)|vigintillion)|vigintillion|zero|s))/i;

subject to Perl's Artistic License.

You can use Regex::PreSuf to automatically factor out common pre- and suffixes:

#!/usr/bin/perl

use strict;
use warnings;

use Regex::PreSuf;

my %singledigit = (
    one    => 1,
    two    => 2,
    three  => 3,
    four   => 4,
    five   => 5,
    six    => 6,
    seven  => 7,
    eight  => 8,
    nine   => 9,
);

my $singledigit = presuf(keys %singledigit);

print $singledigit, "\n";

my $text = "one two three four five six seven eight nine";

$text =~ s/($singledigit)/$singledigit{$1}/g;

print $text, "\n";

Output:

C:\Temp> cvb
(?:eight|f(?:ive|our)|nine|one|s(?:even|ix)|t(?:hree|wo))
1 2 3 4 5 6 7 8 9

I am afraid it gets harder after this ;-)

Sinan Ünür
I did, but they seem to be used for converting regular cardinal numbers to English, like (from the example in FindNumber's source code) 84 to "fourscore and seven" (archaic!). They do have a numify method which seems to use a regex to convert text to numbers (which I don't need), but the method has this funky obscure line (to me) `$text =~ s/$number_re/words2nums($1). ($1 =~ m{(\s+)$} ? $1 :"")/eg;`.
JG
No, exactly the opposite. The regex is used to find numbers in English (including phrases like "fourscore and seven" and convert them to numbers. The regex in `Lingua::EN::FindNumber` is optimized by factoring out common prefixes and suffixes. The **funky obscure** lines invokes `words2nums` for each number phrase found, reinserting any trailing whitespace gobbled up by `words2nums`.
Sinan Ünür
Yes, I figured that out while writing the comment, and explained it in the second sentence. "The regex in Lingua::EN::FindNumber is optimized by factoring out **common prefixes and suffixes**." Thank you very much for this explanation; now it actually makes sense : ). Do you have any clue if that was created by hand or generated automatically ? Also, I don't want to convert the words to numbers, just want to recognize them.
JG
@JG Generated automatically: http://search.cpan.org/perldoc/Regex::PreSuf
Sinan Ünür
If ever a regex cried out for treatment with the 'x' modifier, that's it.
Jonathan Leffler
A: 

What are you trying to do and what algorithm are you using?

I'm not familiar with the Java regex engine, but other regex engines I've used (Perl, awk) allow you to capture matches. For example, if you're trying to match:

one million one hundred thousand one hundred one

You could have a regex that could identify the million, anything before it (i.e. 'one'), and everything after it (i.e. 'one hundred thousand one hundred one'). The things before would be captured for additional matching (using your hundreds, tens and ones regexes), and the things after it would be captured for additional matching (using your thousands regex, hundreds regex, tens regex or ones regex).

This algorithm is naturally recursive, and wouldn't be too hard to implement. Without knowing the specifics of what you're trying to accomplish or the details of Java's regex engine I can't suggest more.

Rob Jones
I'm just trying to recognize English numerals in text. For instance, in the sentence "There are six billion persons in the world, plus around seven hundred and seventy seven millions.", I would like to recognize "six billion" and "seven hundred and seventy seven millions".
JG
+2  A: 

Perl has a number of modules that produce optimized regexes (which mostly only use standard features, so should be usable in Java) using different techniques. You can see examples of Regexp::Assemble, Regexp::List, Regexp::Optimizer, and Regex::PreSuf output in http://groups.google.com/group/perl.perl5.porters/msg/132877aee7542015. Starting in perl 5.10, perl itself usually optimizes lists of |'d exact strings into a trie.

ysth
A: 

Regex really is a bad way to do this. Personally I'd just create a small map of all the known words and search in that fashion. (Search each word, when you find a match, determine if words next to it match, and continue thusly until you have the number).

Noon Silk
You'll end up doing almost the same as the regex engine, except if you're using something like Aho-Corasick's algorithm
JG