tags:

views:

51

answers:

2

I'm trying to build an alternative data entry, wherein the user will express some sort of command, that i'll parse. Rather than go into the details of the vocabulary I would be using in this effort here's an example of what I'm trying to accomplish with appoligies to Rex Harrison.

given the following sentences

the rain in spain falls on the plain

in spain on the plain falls the rain

on the meadow the snow falls in london

in pseudo regex:

(the (?<weather>\w+)) (in (<?city>\w+)) (falls) (on the (?<topography>\w+))

in short I need to harvest out of the sentence the weather, city and topography, using RegEx.

How do I express a set of captures that can occur in the input in any order?

+2  A: 

First off, this looks like a problem that begs for a natural language parser.

But if you really want a regex solution, you'll have to pick out each pattern separately, either by using 3 regexes or by alternating them with pipes, e.g.:

(the (?<weather>\w+))|(in (<?city>\w+))|(on the (?<topography>\w+))

Running the above against any of your sample sentences, you'll get 3 matches, each of which will have one of its three groups set.

Max Shawabkeh
+2  A: 
^(?:on the (?<area>\w+)() ?|the (?<weather>\w+)() ?|in (?<location>\w+)() ?|falls() ){4}\1\2\3\4$

will match a sentence that contains each of the elements exactly once in any order. That's what the empty parentheses are for - each one has to take part in the match so the final \1\2\3\4 can match.

The named backreferences will contain the variable elements.

Tim Pietzcker
+1: That's ingenious! Quick note though: some engines (e.g. Python's) will count the named captures with the numbered ones, so you might need to use `\2\4\6\7` in such a case.
Max Shawabkeh
That's interesting. RegexBuddy doesn't take this fact into account. In .NET, the above regex should work, though. It's probably generally a bad idea to mix named and unnamed capturing groups; I did it here to illustrate a point better.
Tim Pietzcker
@Tim Playing around with this some more. I expected that I would have been able to do {1,4}\1\2\3\4 and capture at least 1, and at most 4 of the tokens, but that didn't seem to work, it's 4 or nothing. How would I match a partial "sentence"?
Ralph Shillington
The `\1\2\3\4` are used to make sure that each group participates once in the match. If you remove that bit, it should work - but then you have no protection against duplicates ("on the plain on the train"). If that is a problem, you're out of luck with regexes (which aren't really the right tool for this, as Max already noted).
Tim Pietzcker
@Tim,@Max -- if there is a better tool for this, accessable to .NET 3.5 then I'm game to give it a try. I picked Rexex because it's the closest I've got.
Ralph Shillington