ansaurus

Question

Answer 1

+7 A:

See Perl's Lingua::EN::Words2Nums and Lingua::EN::FindNumber.

In particular, the source code for Lingua::EN::FindNumber contains:

# This is from Lingua::EN::Words2Nums, after being thrown through
# Regex::PreSuf
my $numbers =
    qr/((?:b(?:akers?dozen|illi(?:ard|on))|centillion|d(?:ecilli(?:ard|on)|ozen|u(?:o(?:decilli(?:ard|on)|vigintillion)|vigintillion))|e(?:ight(?:een|ieth|[yh])?|leven(?:ty(?:first|one))?|s)|f(?:i(?:ft(?:een|ieth|[yh])|rst|ve)|o(?:rt(?:ieth|y)|ur(?:t(?:ieth|[yh]))?))|g(?:oogol(?:plex)?|ross)|hundred|mi(?:l(?:ion|li(?:ard|on))|nus)|n(?:aught|egative|in(?:et(?:ieth|y)|t(?:een|[yh])|e)|o(?:nilli(?:ard|on)|ught|vem(?:dec|vigint)illion))|o(?:ct(?:illi(?:ard|on)|o(?:dec|vigint)illion)|ne)|qu(?:a(?:drilli(?:ard|on)|ttuor(?:decilli(?:ard|on)|vigintillion))|in(?:decilli(?:ard|on)|tilli(?:ard|on)|vigintillion))|s(?:core|e(?:cond|pt(?:en(?:dec|vigint)illion|illi(?:ard|on))|ven(?:t(?:ieth|y))?|x(?:decillion|tilli(?:ard|on)|vigintillion))|ix(?:t(?:ieth|y))?)|t(?:ee?n|h(?:ir(?:t(?:een|ieth|y)|d)|ousand|ree)|r(?:e(?:decilli(?:ard|on)|vigintillion)|i(?:gintillion|lli(?:ard|on)))|w(?:e(?:l(?:fth|ve)|nt(?:ieth|y))|o)|h)|un(?:decilli(?:ard|on)|vigintillion)|vigintillion|zero|s))/i;

subject to Perl's Artistic License.

You can use Regex::PreSuf to automatically factor out common pre- and suffixes:

#!/usr/bin/perl

use strict;
use warnings;

use Regex::PreSuf;

my %singledigit = (
    one    => 1,
    two    => 2,
    three  => 3,
    four   => 4,
    five   => 5,
    six    => 6,
    seven  => 7,
    eight  => 8,
    nine   => 9,
);

my $singledigit = presuf(keys %singledigit);

print $singledigit, "\n";

my $text = "one two three four five six seven eight nine";

$text =~ s/($singledigit)/$singledigit{$1}/g;

print $text, "\n";

Output:

C:\Temp> cvb
(?:eight|f(?:ive|our)|nine|one|s(?:even|ix)|t(?:hree|wo))
1 2 3 4 5 6 7 8 9

I am afraid it gets harder after this ;-)

Sinan Ünür 2009-08-13 03:26:47

I did, but they seem to be used for converting regular cardinal numbers to English, like (from the example in FindNumber's source code) 84 to "fourscore and seven" (archaic!). They do have a numify method which seems to use a regex to convert text to numbers (which I don't need), but the method has this funky obscure line (to me) `$text =~ s/$number_re/words2nums($1). ($1 =~ m{(\s+)$} ? $1 :"")/eg;`.

JG 2009-08-13 03:47:09

No, exactly the opposite. The regex is used to find numbers in English (including phrases like "fourscore and seven" and convert them to numbers. The regex in `Lingua::EN::FindNumber` is optimized by factoring out common prefixes and suffixes. The **funky obscure** lines invokes `words2nums` for each number phrase found, reinserting any trailing whitespace gobbled up by `words2nums`.

Sinan Ünür 2009-08-13 03:57:52

Yes, I figured that out while writing the comment, and explained it in the second sentence. "The regex in Lingua::EN::FindNumber is optimized by factoring out **common prefixes and suffixes**." Thank you very much for this explanation; now it actually makes sense : ). Do you have any clue if that was created by hand or generated automatically ? Also, I don't want to convert the words to numbers, just want to recognize them.

JG 2009-08-13 04:12:12

@JG Generated automatically: http://search.cpan.org/perldoc/Regex::PreSuf

Sinan Ünür 2009-08-13 04:18:29

If ever a regex cried out for treatment with the 'x' modifier, that's it.

Jonathan Leffler 2009-08-13 05:28:01

Answer 2

A:

What are you trying to do and what algorithm are you using?

I'm not familiar with the Java regex engine, but other regex engines I've used (Perl, awk) allow you to capture matches. For example, if you're trying to match:

one million one hundred thousand one hundred one

You could have a regex that could identify the million, anything before it (i.e. 'one'), and everything after it (i.e. 'one hundred thousand one hundred one'). The things before would be captured for additional matching (using your hundreds, tens and ones regexes), and the things after it would be captured for additional matching (using your thousands regex, hundreds regex, tens regex or ones regex).

This algorithm is naturally recursive, and wouldn't be too hard to implement. Without knowing the specifics of what you're trying to accomplish or the details of Java's regex engine I can't suggest more.

Rob Jones 2009-08-13 03:34:21

I'm just trying to recognize English numerals in text. For instance, in the sentence "There are six billion persons in the world, plus around seven hundred and seventy seven millions.", I would like to recognize "six billion" and "seven hundred and seventy seven millions".

JG 2009-08-13 03:55:31

Answer 3

+2 A:

Perl has a number of modules that produce optimized regexes (which mostly only use standard features, so should be usable in Java) using different techniques. You can see examples of Regexp::Assemble, Regexp::List, Regexp::Optimizer, and Regex::PreSuf output in http://groups.google.com/group/perl.perl5.porters/msg/132877aee7542015. Starting in perl 5.10, perl itself usually optimizes lists of |'d exact strings into a trie.

ysth 2009-08-13 05:18:21

Answer 4

A:

Regex really is a bad way to do this. Personally I'd just create a small map of all the known words and search in that fashion. (Search each word, when you find a match, determine if words next to it match, and continue thusly until you have the number).

Noon Silk 2009-08-13 05:22:13

You'll end up doing almost the same as the regex engine, except if you're using something like Aho-Corasick's algorithm

JG 2009-08-13 14:52:41

ansaurus

tags:

views:

answers:

Scalable Regex for English Numerals

related questions