tags:

views:

447

answers:

1

I just started using Boost::regex today and am quite a novice in Regular Expressions too. I have been using "The Regulator" and Expresso to test my regex and seem satisfied with what I see there, but transferring that regex to boost, does not seem to do what I want it to do. Any pointers to help me a solution would be most welcome. As a side question are there any tools that would help me test my regex against boost.regex?

using namespace boost;
using namespace std;

vector<string> tokenizer::to_vector_int(const string s)
{
    regex re("\\d*");
    vector<string> vs;
    cmatch matches;
    if( regex_match(s.c_str(), matches, re) ) {
     MessageBox(NULL, L"Hmmm", L"", MB_OK); // it never gets here
     for( unsigned int i = 1 ; i < matches.size() ; ++i ) {
      string match(matches[i].first, matches[i].second);
      vs.push_back(match);
     }
    }
    return vs;
}

void _uttokenizer::test_to_vector_int() 
{
    vector<string> __vi = tokenizer::to_vector_int("0<br/>1");
    for( int i = 0 ; i < __vi.size() ; ++i ) INFO(__vi[i]);
    CPPUNIT_ASSERT_EQUAL(2, (int)__vi.size());//always fails
}

Update (Thanks to Dav for helping me clarify my question): I was hoping to get a vector with 2 strings in them => "0" and "1". I instead never get a successful regex_match() (regex_match() always returns false) so the vector is always empty.

Thanks '1800 INFORMATION' for your suggestions. The to_vector_int() method now looks like this, but it goes into a never ending loop (I took the code you gave and modified it to make it compilable) and find "0","","","" and so on. It never find the "1".

vector<string> tokenizer::to_vector_int(const string s)
{
    regex re("(\\d*)");
    vector<string> vs;

    cmatch matches;

    char * loc = const_cast<char *>(s.c_str());
    while( regex_search(loc, matches, re) ) {
     vs.push_back(string(matches[0].first, matches[0].second));
     loc = const_cast<char *>(matches.suffix().str().c_str());
    }

    return vs;
}

In all honesty I don't think I have still understood the basics of searching for a pattern and getting the matches. Are there any tutorials with examples that explains this?

+5  A: 

The basic problem is that you are using regex_match when you should be using regex_search:

The algorithms regex_search and regex_match make use of match_results to report what matched; the difference between these algorithms is that regex_match will only find matches that consume all of the input text, where as regex_search will search for a match anywhere within the text being matched.

From the boost documentation. Change it to use regex_search and it will work.

Also, it looks like you are not capturing the matches. Try changing the regex to this:

regex re("(\\d*)");

Or, maybe you need to be calling regex_search repeatedly:

char *where = s.c_str();
while (regex_search(s.c_str(), matches, re))
{
  where = m.suffix().first;
}

This is since you only have one capture in your regex.

Alternatively, change your regex, if you know the basic structure of the data:

regex re("(\\d+).*?(\\d+)");

This would match two numbers within the search string.

Note that the regular expression \d* will match zero or more digits - this includes the empty string "" since this is exactly zero digits. I would change the expression to \d+ which will match 1 or more.

1800 INFORMATION
Awesome. Thanks 1800 INFORMATION. I didn't realize how much of a noob I was in boost.regex. (In my defense both "The Regulator" and Expresso give me positive results in response to "Match", so I honed in a similarly named method in boost.regex.) I guess I didn't fathom the significance of the difference between regex_match and regex_search till you pointed it out.Thanks again. I wonder if there is anyway to reduce my "reputation score" even further to display my noobness :).
ossandcad
I tested your suggestion 1800 INFORMATION to replace regex_match with regex_search and now I get two strings: "0" and " ". I don't seem to get the 2nd one as "1". Any suggestions on what I could still be missing?
ossandcad
It looks like you aren't capturing the strings you match, try putting the () around the expression
1800 INFORMATION
Thanks 1800 INFORMATION. I unfortunately cannot use the suggestion for `regex re("(\\d+).*?(\\d+)");` as the basic structure is a bunch of integers (0+) separated by various punctuations, <br/> or "\r\n" etc. I have updated my question with the latest code version and have described a problem I faced with your suggestion. Any help would be doubly appreciated.
ossandcad
I suspect your current regular expression is wrong - \d* will match zero or more digits - the empty string "" is included in the subset of zero digits which is why it is stopping there - you should change it to \d+
1800 INFORMATION
I'm trying to use regex_match because I have constraints on my input that let me write simple REs that should consume all the input. However, it seems to always fail, no matter what the inputs are. For example, an RE of "a" fails against a string of "a". What's up with that?
Ben Collins
I can't get it working with regex_search either. Here's the code I'm testing with: http://gist.github.com/186124
Ben Collins
Comments aren't really much good for this kind of thing, you should really ask a new question
1800 INFORMATION