views:

78

answers:

2

Hey all,
I'm using tr1::regex to try to extract some matches from a string. An example string could be

asdf werq "one two three" asdf

And I would want to get out of that:

asdf  
werq  
one two three  
asdf  

With stuff in quotes grouped together, so I'm trying to use the regex \"(.+?)\"|([^\\s]+). The code I'm using is:

cmatch res;
regex reg("\"(.+?)\"|([^\\s]+)", regex_constants::icase);
regex_search("asdf werq \"one two three\" asdf", res, reg);

cout << res.size() << endl;
for (unsigned int i = 0; i < res.size(); ++k) {
    cout << res[i] << endl;
}

but that outputs

3
asdf

asdf

What am I doing wrong?

A: 

You may want to try the following regex instead:

(?<=")[^"]*(?=")|[^"\s]\S*

When quoted, it of course needs to be escaped:

"(?<=\")[^\"]*(?=\")|[^\"\\s]\\S*"

Btw, the code you used probably matches only the first word in the target string, since it does not use match_any. The 3 items you are getting in the result are probably (1) the entire match, (2) the first capture -- which is empty, and (3) the second capture, which is the source of the match.

Roy Sharon
Actually when I use that regex, when I run the program, this is outputted to the console: "This application has requested the Runtime to terminate it in an unusual way. Please contact..." blah blah blah, and it crashes.
Thomas T.
I don't have a working environment where I can check this, but I've tested the regex with both Java and C#, and it didn't crash neither. Please use syntax_option_type=extended to make sure it follows the standard syntax for extended regexes. (BTW, I made a small fix to the first part of the regex to prevent it from capturing a space after the end of a quoted word.)
Roy Sharon
You're invited to play with the regex here: http://www.myregextester.com/?r=a9e366fd
Roy Sharon
I changed it to regex reg("(?<=\")[^\"]*(?=\")|[^\"\\s]\\S*", regex_constants::syntax_option_type::extended); and I got that error again :( Any idea why?
Thomas T.
When I wrap it in a try/catch block, it catches the exception "Unknown exception". EDIT: If I change the catch from catch(exception e) to catch(const regex_error I get "regular expression error".
Thomas T.
Wierd error: no matter what regex I use, the regex_constants::syntax_option_type::extended always makes it crash. If I take that out and remove the (?<=\") from your regex, it doesn't crash.
Thomas T.
Confirmed, regex_constants::extended makes all regexes crash. Without any second argument, I can use your regex if I take out the lookbehind, but that yields incorrect results.
Thomas T.
I understand. Okay, there's another option, which is slightly more complicated but doesn't require a lookbehind assertion. I will add it in a few minutes to the solution above.
Roy Sharon
If I use "\"(.+?)\"|([^\\s]+)" and for (std::tr1::sregex_token_iterator i(str.begin(), str.end(), reg); i != end; ++i), iterator style, it works fine except when it matches something in quotes, it includes the quotes for some reason.
Thomas T.
Yeah, because you get the entire match, which includes the quotes. See the explanation in the other solution I've posted.
Roy Sharon
BTW, `\"[^\"]*\"` is better than `\".*?\"` because it (1) catches also \n inside the quotes, and (2) is somewhat faster. Also, `[^\\s]` is equivalent to `\\S`.
Roy Sharon
A: 

It appears that your regex engine does not support lookbehind assertions. To avoid using lookbehinds, you can try the following:

"([^"]*)"|(\S+)

or quoted:

"\"([^\"]*)\"|(\\S+)"

This regex will work, but each match will have two captures, one of which will be empty (either the first -- in case of a non-quoted word, or the second -- in case of a quoted string).

To be able to use this you need to iterate over all matches, and for each match use the non-empty capture.

I don't know enough about TR1, so I don't know exactly how one iterates over all matches. But if I'm not mistaken, the res.size() will be always equal to 3.

For example, for the string asdf "one two three" werq the first match will be:

res[0] = "asdf"              // the entire match
res[1] = ""                  // the first capture
res[2] = "asdf"              // the second capture

The second match will be:

res[0] = "\"one two three\"" // the entire match including leading/trailing quotes
res[1] = "one two three"     // the first capture
res[2] = ""                  // the second capture

and the third match will be:

res[0] = "werq"              // the entire match
res[1] = ""                  // the first capture
res[2] = "werq"              // the second capture

HTH.

Roy Sharon
How then would I get the matches thing like you said if I'm using an iterator? You use an iterator in the style of for (std::tr1::sregex_token_iterator i(str.begin(), str.end(), reg); i != end; ++i) { cout << *i; } You don't really get a choice as to whether you get the entire match, first capture or second capture, that I can see.
Thomas T.
What about the following: `for (std::tr1::sregex_token_iterator i(str.begin(), str.end(), reg); i != end; ++i) { cout << ((*i)[1] || (*i)[2]); }`? I cannot check if this compiles, let alone runs, but the idea is that `*i` is an object that has an indexing operator, which should give you the captures.
Roy Sharon