tags:

views:

126

answers:

1

Hi guys, i have this C++ program (actually it's just a snippet) :

#include <iostream>
#include <pcre.h>
#include <string>

using namespace std;

int main(){    
    string pattern = "<a\\s+href\\s*=\\s*\"([^\"]+)\"",
           html = "<html>\n"
                  "<body>\n"
                  "<a href=\"example_link_1\"/>\n"
                  "<a href=\"example_link_2\"/>\n"
                  "<a href=\"example_link_3\"/>\n"
                  "</body>\n"
                  "</html>";
    int            i, ccount, rc,
                *offsets,
                 eoffset;
    const char  *error;
    pcre         *compiled;

    compiled = pcre_compile( pattern.c_str(), PCRE_CASELESS | PCRE_MULTILINE, &error, &eoffset, 0 );
    if( !compiled ){
        cerr << "Error compiling the regexp!!" << endl;
        return 0;
    }

    rc = pcre_fullinfo( compiled, 0, PCRE_INFO_CAPTURECOUNT, &ccount );

    offsets = new int[ 3 * (ccount + 1) ];

    rc = pcre_exec( compiled, 0, html.c_str(), html.length(), 0, 0, offsets, 3 * (ccount + 1) );

    if( rc >= 0 ){
        for( i = 1; i < rc; ++i ){
            cout << "Match : " << html.substr( offsets[2*i], offsets[2*i+1] - offsets[2*i] ) << endl;
        }
    }
    else{
        cout << "Sorry, no matches!" << endl;
    }

    delete [] offsets;

    return 0;
}

As you can see, i'm trying to match html links inside a buffer with the given regular expression (the \\s is \s escaped for C/C++ strings). But, even if in the buffer there are 3 links and the regexp is compiled with the PCRE_CASELESS and PCRE_MULTILINE flags, i match only one element :

Match : example_link_1

Note: I start the loop fro index 1 because the pcre library returns the string that matched (not the match itself) as the first element, and the matches follows.

What's wrong with this code? The regexp itself i think it's correct (tried in PHP for instance).

+1  A: 

Well, it's not supposed to return all matches. Just think of it, you ask for capturecount, which is something like one or two (that is, either the whole match and one subexpression, or just subexpression, I don't remember, I'd guess two). And how would you expect it to know how many matches are in the string you've never passed to it? And you don't expect the thing to return three matches in the array, do you? And if you've had three thousands?

It's been a while since I dealt with pcre api, but I would think you need to loop and match against the rest of the string again.

Michael Krelin - hacker
I searched for this "loop" and found the solution, thanks! :)
Simone Margaritelli
you're welcome ;-)
Michael Krelin - hacker