tags:

views:

665

answers:

3

Hello there, I am trying to convert numerical values written as words into integers. For example, "iPhone has two hundred and thirty thousand seven hundred and eighty three apps" would become "iPhone as 230783 apps"

Before i start coding, I would like to know if any function / code exists for this conversion.

+1  A: 

The PEAR Numbers_Words package is probably a good start: http://pear.php.net/package-info.php?package=Numbers_Words

Jani Hartikainen
Thanks Jani. This package looks interesting, though this does the vice versa of my aim, i.e. from numbers to words. Would be useful in future projects.
+2  A: 

There are lots of pages discussing the conversion from numbers to words. Not so many for the reverse direction. The best I could find was some pseudo-code on Ask Yahoo. See http://answers.yahoo.com/question/index?qid=20090216103754AAONnDz for a nice algorithm:

Well, overall you are doing two things: Finding tokens (words that translates to numbers) and applying grammar. In short, you are building a parser for a very limited language.

The tokens you would need are:

POWER: thousand, million, billion
HUNDRED: hundred
TEN: twenty, thirty... ninety
UNIT: one, two, three, ... nine,
SPECIAL: ten, eleven, twelve, ... nineteen

(drop any "and"s as they are meaningless. Break hyphens into two tokens. That is sixty-five should be processed as "sixty" "five")

Once you've tokenized your string, move from RIGHT TO LEFT.

  1. Grab all the tokens from the RIGHT until you hit a POWER or the whole string.

  2. Parse the tokens after the stop point for these patterns:

    SPECIAL
    TEN
    UNIT
    TEN UNIT
    UNIT HUNDRED
    UNIT HUNDRED SPECIAL
    UNIT HUNDRED TEN
    UNIT HUNDRED UNIT
    UNIT HUNDRED TEN UNIT

    (This assumes that "seventeen hundred" is not allowed in this grammar)

    This gives you the last three digits of your number.

  3. If you stopped at the whole string you are done.

  4. If you stopped at a power, start again at step 1 until you reach a higher POWER or the whole string.

John Kugelman
Thank you John! This algo is exactly what I was looking for. I was trying to parse it from left to right, but this looks better. Appreciate your help!
+1 John - Your answers are always great.
alex
Why are we processing tokens from the right ?
joebert
@joebert 'cause it's easier to code :)
Csaba Kétszeri
+1  A: 

I haven't tested this too extensively, I more or less just worked on it until I saw what I expected in the output, but it seems to work, and parses from left-to-right.

<?php

$str = 'twelve billion people know iPhone has two hundred and thirty thousand, seven hundred and eighty-three apps as well as over one million units sold';

function strlen_sort($a, $b)
{
    if(strlen($a) > strlen($b))
    {
     return -1;
    }
    else if(strlen($a) < strlen($b))
    {
     return 1;
    }
    return 0;
}

$keys = array(
    'one' => '1', 'two' => '2', 'three' => '3', 'four' => '4', 'five' => '5', 'six' => '6', 'seven' => '7', 'eight' => '8', 'nine' => '9',
    'ten' => '10', 'eleven' => '11', 'twelve' => '12', 'thirteen' => '13', 'fourteen' => '14', 'fifteen' => '15', 'sixteen' => '16', 'seventeen' => '17', 'eighteen' => '18', 'nineteen' => '19',
    'twenty' => '20', 'thirty' => '30', 'forty' => '40', 'fifty' => '50', 'sixty' => '60', 'seventy' => '70', 'eighty' => '80', 'ninety' => '90',
    'hundred' => '100', 'thousand' => '1000', 'million' => '1000000', 'billion' => '1000000000'
);


preg_match_all('#((?:^|and|,| |-)*(\b' . implode('\b|\b', array_keys($keys)) . '\b))+#i', $str, $tokens);
//print_r($tokens); exit;
$tokens = $tokens[0];
usort($tokens, 'strlen_sort');

foreach($tokens as $token)
{
    $token = trim(strtolower($token));
    preg_match_all('#(?:(?:and|,| |-)*\b' . implode('\b|\b', array_keys($keys)) . '\b)+#', $token, $words);
    $words = $words[0];
    //print_r($words);
    $num = '0'; $total = 0;
    foreach($words as $word)
    {
     $word = trim($word);
     $val = $keys[$word];
     //echo "$val\n";
     if(bccomp($val, 100) == -1)
     {
      $num = bcadd($num, $val);
      continue;
     }
     else if(bccomp($val, 100) == 0)
     {
      $num = bcmul($num, $val);
      continue;
     }
     $num = bcmul($num, $val);
     $total = bcadd($total, $num);
     $num = '0';
    }
    $total = bcadd($total, $num);
    echo "$total:$token\n";
    $str = preg_replace("#\b$token\b#i", number_format($total), $str);
}
echo "\n$str\n";

?>
joebert
Found one flaw, it misses common mixtures of numbers and words such as "2 million".
joebert
It will also mess with certain wordings for dates."I was born in nineteen eighty one"
joebert
Thank you very much Joebert for the code! I'll try to improve on it. I have set up a test set of 10000 random number words (using the Numbers_Words) and currently, the accuracy of decoding words to numbers is 75%. Correct : forty five thousand five hundred and fifty four becomes 45554 Incorrect: fifty one thousand five hundred and eighty six becomes 586
Just realized the issue. There is something funny happening while accessing the first key, i.e. 'one' Instead put 'quadrillion' => '1000000000000000' before 'one' and it works with 100% accuracy.
Also, include 'lakh' => '100000' and 'crore' => '10000000' in $keys. They are more common terms than million in south asian countries
That makes sense. I have a filesize formatter that works similarly. I must have been in a rush and forgot to put the largest numbers first in the check.
joebert