views:

67

answers:

3

I need to extract the quantity and unit from strings like this

1 tbsp
1tbsp 
300ml
300 ml
10grams
10 g

The quantities will always be numbers, then there may or may not be a space then the unit. They may be 15 - 20 different units which can come from a list that we define (perhaps an array)

The solution can be in either javascript or PHP as I need to split them before storing them in a database. ie they need to be stored separately.

Thanks

EDIT: Sorry to be clear. Each new line represents a new string. That is the string would only contain 10g OR 300ml - so we just need to split one unit and one quantity at a time.

+4  A: 

Okay, what you can do is create an array of allowed units, and then use array_map to apply preg_quote on each unit in the array (so that if there are any characters in the unit that are special characters in a regular expression they will be escaped), and then construct a regular expression:

$units = array("tbsp", "ml", "g", "grams"); // add whatever other units are allowed
$pattern = '/^(\d+)\s*(' . join("|", array_map("preg_quote", $units)) . ')$/';

The $pattern will thus become something like /^(\d+)\s*(tbsp|ml|g|grams)$/, and then you can use it to detect things that look like units in your string:

$matches = array();
// assuming you have an array of measurement strings...
foreach ($measurement_strings as $measurement)
{
  preg_match($pattern, $measurement, $matches);
  list(, $quantity, $unit) = $matches;
  // ...
}

Because the pattern defines two capturing groups, for the quantity and unit respectively, you can then extract those out of the match and do what you want with them.

(I've updated my answer, based on the question update that each line is a separate string).

Daniel Vandersluis
I think OP was looking for more with this, like how to use that pattern to extract an array of matches.
hookedonwinter
@hookedonwinter I've edited my answer to that extent.
Daniel Vandersluis
@David awesome! I love the ability to add units on the fly. I think there's an error in the regex? The pattern in your code is different than the pattern in your explanation. `$/` vs `/$`. Trying to get it working in my ide. but awesome so far
hookedonwinter
@hookedonwinter Oops, yeah, that was a typo. The `^` and `$` are start and end of line anchors, and the slashes are beginning and end of pattern characters that the preg_* functions need.
Daniel Vandersluis
+3  A: 

Regex:

/(\d+)\s*(\D+)/

Code:

preg_match_all('/(\d+)\s*(\D+)/', $ingredients, $m);

$quantities = $m[1];
$units = array_map('trim', $m[2]);

$quantities and $units are:

Array
(
    [0] => 1
    [1] => 1
    [2] => 300
    [3] => 300
    [4] => 10
    [5] => 10
)
Array
(
    [0] => tbsp
    [1] => tbsp
    [2] => ml
    [3] => ml
    [4] => grams
    [5] => g
)

See: http://ideone.com/MSH8t

If you use this you don't have to have a list of units ready. But this assumes your units will have no numeric characters on them, and your quantities are numbers only.

quantumSoup
+2  A: 

Mabye something simple is enough, just like that:

^([0-9]+)\s*([a-zA-Z]+)\s*$
Jarek Waliszko
Those start and end anchors there make it useless to match multiple lines
quantumSoup
Basically, you're right, but it also depends on implementation. In c# you can define RegexOptions.Multiline, and it is working aganist multiple lines. For example new Regex(@"^([0-9]+)\s*([a-zA-Z]+)\s*$", RegexOptions.Multiline) is equivalent to new Regex(@"([0-9]+)\s*([a-zA-Z]+)\s*")
Jarek Waliszko
@quantum: the OP has updated the question to say the strings will be processed individually, not as a block of multiline text, so the anchors shouldn't be a problem.
Alan Moore
@Jarek: are you saying a multiline regex with anchors is the same as a non-multiline regex *without* anchors? That isn't right. Multiline `^([0-9]+)\s*([a-zA-Z]+)\s*$` is equivalent to `(?<=\A|\n)([0-9]+)\s*([a-zA-Z]+)\s*(?=\n|\Z)`
Alan Moore