tags:

views:

200

answers:

8

Hello,

I'm familiar with Regex itself, but whenever I try to find any examples or documentation to use regex with Unix computers, I just get tutorials on how to write regex or how to use the .NET specific libraries available for Windows. I've been searching for a while and I can't find any good tutorials on C++ regex on Unix machines.

What I'm trying to do:

Parse a string using regex by breaking it up and then reading the different subgroups. To make a PHP analogy, something like preg_match that returns all $matches.

A: 

You are looking for regcomp, regexec and regfree.

One thing to be careful about is that the Posix regular expressions actually implement two different languages, regular (default) and extended (include the flag REG_EXTENDED in the call to regcomp). If you are coming from the PHP world, the extended language closer to what you are used to.

R Samuel Klatchko
same comment as @epatel
Stanislav Palatnik
+5  A: 

Look up the documentation for TR1 regexes or (almost equivalently) boost regex. Both work quite nicely on various Unix systems. The TR1 regex classes have been accepted into C++ 0x, so though they're not exactly part of the standard yet, they will be reasonably soon.

Edit: To break a string into subgroups, you can use an sregex_token_iterator. You can specify either what you want matched as tokens, or what you want matched as separators. Here's a quickie demo of both:

#include <iterator>
#include <regex>
#include <string>
#include <iostream>

int main() { 

    std::string line;

    std::cout << "Please enter some words: " << std::flush;
    std::getline(std::cin, line);

    std::tr1::regex r("[ .,:;\\t\\n]+");
    std::tr1::regex w("[A-Za-z]+");

    std::cout << "Matching words:\n";
    std::copy(std::tr1::sregex_token_iterator(line.begin(), line.end(), w),
        std::tr1::sregex_token_iterator(), 
        std::ostream_iterator<std::string>(std::cout, "\n"));

    std::cout << "\nMatching separators:\n";
    std::copy(std::tr1::sregex_token_iterator(line.begin(), line.end(), r, -1), 
        std::tr1::sregex_token_iterator(), 
        std::ostream_iterator<std::string>(std::cout, "\n"));

    return 0;
}

If you give it input like this: "This is some 999 text", the result is like this:

Matching words:
This
is
some
text

Matching separators:
This
is
some
999
text
Jerry Coffin
He can also use Boost Xpressive (http://www.boost.org/doc/libs/1_42_0/doc/html/xpressive.html) which will get him compile-time error checking of his regular expressions. I doubt that will ever become standard though :)
Manuel
This one is the most ideal imo. But I actually ran into it before and the server that I need to deply to doesn't support this. :/
Stanislav Palatnik
@Manuel: Comment markdown syntax sucks sometimes, doesn't it? Also you're using 1.38?! Use `/release/` in boost URLs for the latest release version.
Roger Pate
@Roger - Thanks, fixed. Strangely, the link to 1.38 was the first result on Google.
Manuel
What I got out of regcomp, regexec is that they return 0 is its found. I need to return all the subgroups also.
Stanislav Palatnik
@Manuel: Yes, Xpressive may be able to do the job. Depending on the details of what he wants, Boost Spririt::lex might also. Unless his needs are fairly specialized, however, the normal RE package is probably the first choice.
Jerry Coffin
@Jerry Coffin - agreed, I just thought that Xpressive deserved at least a mention :)
Manuel
@Manuel: I tend to agree -- I probably should have mentioned at least a few more of the (admittedly many) possibilities.
Jerry Coffin
A: 

For perl-compatible regular expressions (pcre/preg), I'd suggest boost.regex.

Nicolás
A: 

My best bet would be boost::regex.

Nikolai N Fetissov
A: 

Try pcre. And pcrepp.

Michael Krelin - hacker
+10  A: 

Consider using Boost.Regex.

An example (from the website):

bool validate_card_format(const std::string& s)
{
   static const boost::regex e("(\\d{4}[- ]){3}\\d{4}");
   return regex_match(s, e);
}

Another example:

// match any format with the regular expression:
const boost::regex e("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z");
const std::string machine_format("\\1\\2\\3\\4");
const std::string human_format("\\1-\\2-\\3-\\4");

std::string machine_readable_card_number(const std::string s)
{
   return regex_replace(s, e, machine_format, boost::match_default | boost::format_sed);
}

std::string human_readable_card_number(const std::string s)
{
   return regex_replace(s, e, human_format, boost::match_default | boost::format_sed);
}
0xfe
A: 

Feel free to have a look at this small color grep tool I wrote.

At github

It uses regcomp, regexec and regfree that R Samuel Klatchko refers to.

epatel
Do you have any examples of returning the subgroups and manipulating them?
Stanislav Palatnik
@Stanislav Palatnik Think that is handled on (around) line 95
epatel
A: 

I use "GNU regex": http://www.gnu.org/s/libc/manual/html_node/Regular-Expressions.html

Works well but can't find clear solution for UTF-8 regexp.

Regards

opal