views:

121

answers:

8

I have html that has the weight of a item.

<div><b>Item Weight  (0.51 lbs in Warehouse 3)</b></div>

I need a regex to get the weight and unit of measure.

So in the above html, I need:

0.51

and

lbs

I am using java, I have a helper method, just need to get the regex down now!

 String regexPattern = "";

String result = "";

Pattern p = Pattern.compile(regexPattern);
Matcher m = p.matcher(text);

if(m.find())
    result = m.group(1).trim();
A: 

What about:

((?:\d+\.)?\d+ \w{3})
Rubens Farias
That will only work lbs, if he wants to capture that part of the data I'm guessing there's probably other forms of measurement (e.g. kg).
Brian McKenna
Not crazy about the fact that it requires the decimal point, or that the unit is required to be 3 characters long.
danben
great! can't you just grab all the data, maybe there is no decimal? or does that work w/o a decimal also?
mrblah
I understood you both, but I learned to do not assume things I don't know; that expression matches OP sample data
Rubens Farias
@mrblah, now supports input w/o decimal
Rubens Farias
+3  A: 

This should do the trick

(\d*\.?\d+)\s?(\w+)

The first match will be the weight and the 2nd will be the unit of measure

cory.m.smith
Will this work for single digit weights?
Roman Stolper
@Roman - No, you'd need to change the first \d+ to \d*.
Steve Wortham
Good call Steve!
cory.m.smith
A: 

Why use regex? Since you always rely on some sort of format, you can also assume that the last brackets are the weight and location and that the weight and unit of measure is always formatted like that, e.g. with spaces.

@Test
public void testParseWeight() throws Exception {
    String input = "<div><b>Item Weight  (0.51 lbs in Warehouse 3)</b></div>";
    int startPos = input.lastIndexOf('(');
    int space = input.indexOf(' ', startPos);
    String weight = input.substring(startPos + 1, space);
    String uom = input.substring(space + 1, input.indexOf(' ', space + 1));
    Number parse = NumberFormat.getNumberInstance(Locale.US).parse(weight);
    assertEquals(0.51d, parse.doubleValue(), 0.0d);
    assertEquals("lbs", uom);
}
mhaller
well I do have the entire HTML, that was just a snippet!
mrblah
I assume you are able to identify the element in which the weight is contained in. Otherwise, if you're using regex for html parsing, you will fail.
mhaller
+1  A: 

This is what I came up with:

\((?<Weight>\d*\.?\d+)\s(?<Unit>\w+)

This will return the weight in group "Weight" and the unit of measure in group "Unit". And this will work with or without a decimal.

There are a couple assumptions I made:

  • The weight must be listed immediately after the first parenthesis.
  • There must be a space between the weight and the unit of measure.

If those assumptions aren't always accurate then the regular expression will need some more tweaking.

Steve Wortham
A: 

You shouldn't use regexp for HTML...A better guess would be to use a parser (like NekoHTML), with xpath (through jaxen for example)

Valentin Rocher
He's not parsing HTML. He's extracting a number in a string, which happens to be in HTML. The reflex "regex and HTML bad" response is too strong around here.
McPherrinM
A: 

Will "Weight" always be in the string? If so, a better regex would be:

Weight.*?(\d+(?:\.\d+)?)\s+(\w+)

I assume this is valid in Java regex, as it works in Perl. The above assumes weights < 1 will be 0.X formatted. If they can begin with decimals, use this:

Weight.*?(\d*.?\d+)?)\s+(\w+)

Jeff B
+1  A: 

if you know the units beforehand, specifying a list of units may give better results:

([\d.]+)\s+(lbs?|oz|g|kg) 
Jimmy
A: 

I think the pattern you want is:

(\d*\.?\d+)\s*(lbs?|kg)

This will get the numbers right, and you should anchor it with actual measurements, as Jimmy pointed out, to restrict your matches to measures of weight (or whatever other measures you care about).

brianary