views:

781

answers:

4

I am trying to find out how many regex matches are in a string. I'm using an iterator to iterate the matches, and and integer to record how many there were.

long int before = GetTickCount();
string text;

boost::regex re("^(\\d{5})\\s(\\d{8})\\s(.*)\\s(.*)\\s(.*)\\s(\\d{8})\\s(.{1})$");
char * buffer;
long length;
long count;
ifstream f;


f.open("c:\\temp\\test.txt", ios::in | ios::ate);
length = f.tellg();
f.seekg(0, ios::beg);

buffer = new char[length];

f.read(buffer, length);
f.close();

text = buffer;
boost::sregex_token_iterator itr(text.begin(), text.end(), re, 0);
boost::sregex_token_iterator end;

count = 0;
for(; itr != end; ++itr)
{
    count++;
}

long int after = GetTickCount();
cout << "Found " << count << " matches in " << (after-before) << " ms." << endl;

In my example, count always returns 1, even if I put code in the for loop to show the matches (and there are plenty). Why is that? What am I doing wrong?

Edit

TEST INPUT:

12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N

OUTPUT (without matches):

Found 1 matches in 16 ms.

If I change the for loop to this:

count = 0;
for(; itr != end; ++itr)
{
    string match(itr->first, itr->second);
    cout << match << endl;
    count++;
}

I get this as output:

12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
12345 12345678 SOME NAME SOMETHING 88888888 N
Found 1 matches in 47 ms.
A: 

Can you paste the input and also the output.

If count returns 1, that means there is only one match in your string text.

dirkgently
Yeah, but like I've said, if I put code in the loop to print the matches, more than one is printed AND count is still 1. I've also changed the last parameter in the sregex_token_iterator to -1 and count changes to 2.
scottm
@dirkgently, updated with test results
scottm
+11  A: 

Heh. Your problem is your regex. Change your (.\*)s to (.\*?)s (assuming that's supported). You think you're seeing each line being matched, but in fact you're seeing the entire text being matched because your pattern is greedy.

To see the issue illustrated, change the debug output in your loop to:

cout << "[" << match << "]" << endl;
chaos
ya.. just figured that out too. :P good catch.
sfossen
tricky, tricky. That's right.
scottm
A: 

Don't know much about boost, but does (end - itr) work?

Sean
A: 

Since you're saying that even when you output the results, the count is still one, you might look at a couple things to help diagnose it:

  • Try outputting count each loop iteration and see what happens. If this only outputs once, then the loop is only running once, and what you thought were multiple matches were really one big long match.
  • If that works, try using another variable name entirely: it's possible that you are getting some scope shadowing where you have declared more than one count variable.

If that loop is executing multiple times, then the problem is not in how you are using boost. No matter what you are doing, boost does not have the ability to modify a variable that you don't pass to it. (Of course if you are passing count in to boost somewhere, then that's another possiblity.)

With all likelyhood, the first (.*) you have is matching everything up until nearly the end of the input (newlines included). Try replacing those with ([^ ]*) (anything but a space, so the matching stops when it finds a space.

Eclipse
You are correct, but why would the '.' be matching newlines? The docs say "anything but newline"
scottm
They might not be the newlines you think they are? Perhaps they are only `\r` or `\n` and boost is only handling `\r\n`? I don't know enough about boost::regex to say. Although I seem to recall that you had to specify not matching newlines in a constructor somewhere.
Eclipse