views:

104

answers:

2

Hi guys - I'm kinda a newbie at regular expressions, so would appreciate a bit of peer feedback on this one. It will be heavily used on my site, so any weird edge cases can totally wreak havoc. The idea is to type in an amount of an ingredient in a recipe in whole units or fractions. Due to my autocomplete mechanism, just a number is valid too (since it'll pop up a dropdown). These lines are valid:

1
1/2
1 1/2
4 cups
4 1/2 cups
10 3/4 cups sliced

The numeric part of the line should be its own group so I can parse that with my fraction parser. Everything after the numeric part should be a second group. At first, I tried this:

^\s*(\d+|\d+\/\d+|\d+\s*\d+\/\d+)\s*(.*)$

This almost works, but "1 1/2 cups" will get parsed as (1) (1/2 cups) instead of (1 1/2) and (cups). After scratching my head a bit, I determined this was because of the ordering of my "OR" clause. (1) satisfies the \d+ and (.*) satisfies the rest. So I changed this to:

^\s*(\d+\/\d+|\d+\s*\d+\/\d+|\d+)\s*([a-z].*)$

This almost works, but allows weirdness such as "1 1/2/4 cups" or "1/2 3 cups". So I decided to enforce a letter as the first character after a valid numeric expression:

^\s*(\d+\/\d+|\d+\s*\d+\/\d+|\d+)\s*($|[a-z].*)$

Note I'm running this in case-insensitive mode. Here's my questions:

  1. Can the expression be improved? I kinda don't like the "OR" list for number, fraction, compound fraction but I couldn't think of a way to allow whole numbers, fractions, or compound fractions.

  2. It would be extra nice if I could return a group for each word after the numeric component. Such as a group for (10 3/4), a group for (cups) and a group for (sliced). There can be any number of words after. Is this possible?

Thanks!

+2  A: 

Well, it appears to me that you don't need OR conditions at all (but see below).

For the numeric bit, you could get away with:

\d+(\s+\d+/\d+)

which would handle all those fractional values.

I would still keep your decimal separate with an OR clause since it's likely to complicate things. So I think you could probably get away with something like:

^\s*((\d+\s)?(\d+/\d+)?|\d+(\.\d+)?)\s*([a-z].*)?$
 |   |                  |           |  |
 |   |                  |           |  +--- start of alpha section.
 |   |                  |           +------ optional white space.
 |   |                  +------------------ decimal (nn[.nn])
 |   +------------------------------------- fractional ([nn ][nn/nn])
 +----------------------------------------- optional starting space.

although that allows for an empty fractional amount so you may be better off with what you've got (whole, fractional and decimal in separate OR clauses).

I prefer the ([a-z].*)?$ construct to ($|[a-z].*)$ myself but that may just be an aversion on my past to have multiple line end markers in my RE :-)


But, in all honesty, I think you may be trying to swat a fly with a thermo-nuclear warhead here.

Do you really need to restrict what gets entered. I've seen recipes that call for a pinch of salt and a handful of sultanas. I personally think you may be being to restrictive in what you'll allow. I would have a free-form field for quantity and a drop-down for food-type (actually I would probably just allow free-form for the lot unless I was offering the ability to search for recipes based on what's in the fridge).

paxdiablo
Maybe we're using different parsers, but that doesn't match any of my examples above.. But I think I see what you're trying to do with the question mark..
Mike
@Mike, I'm not as au fait with the Javascript RE engine as I would like but I'd hoped the descriptive bits were getting across the idea.
paxdiablo
Yup, looking at your expression I think it should work too, but for some reason it does not :) I'm using RegExTester.com to test things.
Mike
As for your second point, why I don't just allow free form amounts, my entire site revolves around the ability to graph relationships between recipes and convert across forms of ingredients (how many oz of cheese is 3/4 cup shredded, etc). You can do things like put in the ingredients and amounts you have and how many recipes you want, and it will tell you the most efficient set of recipes you can make with that. For this reason, ingredients are /highly/ normalized. Yes, sucks from a UI point of view but that's my challenge, to make it as easy as possible.
Mike
@Mike, that's not a bad idea, it would be useful in one other way. The number of times I've cursed cookbook writers for telling me to measure out 10 fl oz of something and I go "WTH is that in ml?" and I have to go hunting for a conversion table. You _may_ find it useful to store everything in metric (or imperial, whatever you prefer) and allow the user to choose their presentation units. That way, even though the DB might say 10 fl oz, all the user sees is the measure they know. Nothing to do with your question of course but I'd pay money for that feature.
paxdiablo
@paxdiablo: +1 for the extended answer and the clean solution with the `\d+(\s+\d+/\d+)`
WoLpH
I have a few of these features already. First off, it determines the best unit to express the amount in as you change the serving sizes. If you want to make 8,000 servings of cookies, you'll be needing like 25 gallons of milk instead of 400 cups. I also have some mouse "hover-over" conversions between standard and metric, but will be improving this hopefully. You can read a bit on the project at http://blog.kitchenpc.com if you're interested.
Mike
+1  A: 

I believe that this regex should do what you want:

/^\s*(\d+ \d+\/\d+|\d+\/\d+|\d+)\s*(.*)/

For matching the specific words you should just do a split on whitespace after the parsing. There are some thing you don't want to do with regexes ;)

WoLpH
Yup that works, only no decimal support.. and I changed (.*) to ([a-z].*) to get rid of things like 1/2/ cups..
Mike
Actually probably ($|[a-z].*) is even better, since I don't want to require anything after the numeric part.
Mike
Ah yes. If you want decimal support than `[\d.]+` should be used instead. It is difficult to keep it fully contained in one regex if you want to add complex rules though.
WoLpH
Based on the above comment, I've decided ([a-z].*)? is better than my way :)
Mike