tags:

views:

535

answers:

7

I'm curious, does anybody know a good way to do regular expression matching in C? The only way I can think of is through Flex. Is this the only way or is there a better way?

Thanks!

+12  A: 

Uh ... The best way is to

#include <regex.h>

That is the POSIX standard API for regular expressions.

For non-POSIX systems, rolling your own is one alternative, a basic regex engine is not too hard to accomplish. I'm sure there are off-the-shelf solutions too, I haven't neded one.

Come to think of it, I think glib has one.

unwind
A: 

Perhaps this article will help? It shows you how to use the regex functions defined in regex.h.

Yes, isn't the posix regex just wonderful they have it all, so I wondered why are they not in C? Well, I found the answer and I got happy, they are there, and heres how to use them

Andrew Hare
+10  A: 

Depending on what dialect you're looking for and what platform you're on:

  • POSIX regular expressions are likely to be in the standard C library - see <regex.h> and regcomp, regerror, regexec, regfree.
  • libPCRE provides most of perl's extended regex features
  • glib has GRegex
  • Henry Spencer's Tcl Regex library.

Henry Spencer has also done another regex library, used by the current versions of TCL and PostgreSQL. This one is interesting because it is a hybrid NFA/DFA implementation.

Regexes that accept the same string multiple ways (like a*a?) inherently require an NFA. The usual implementation models the NFA as a DFA using backtracking, which can be as much as O(2^length(input)) for particularly degenerate pattern. But, the simple recursive implementation can be extended with capturing, backreferences, callbacks to code, and all the usual "extra" features that many languages offer besides testing for matches.

The "NFA" approach tracks multiple current states and updates all of them with each incoming character (See Ross Cox's writeup of Thompson NFA regexes for more explanation). This approach is O(input.length*pattern.length), which is faster - immensely so in the worst cases - but cannot perform backrefs or captures since it doesn't keep track of how it got to a state.

Spencer's approach is a hybrid, compiling some portions of a pattern to the NFA approach, and using backtracking only where necessary for captures that were actually requested. This is often a substantial win (see TCL's position in the regex-dna shootout).

puetzk
SUrely NFA-style regex interpretation uses backtracking, whereas DFA-style do not?
Vatine
Regexes which can accept the same string multiple ways inherently need an NFA. The usual implementation as a DFA+backtracking is O(2^n). The "NFA" approach keeps multiple current states, and updates all of them with each incoming character (O(n*m). See http://swtch.com/~rsc/regexp/regexp1.html
puetzk
+2  A: 

Hi,

Look this: http://www.pcre.org/

It is a VERY nice lib!

A: 

Boost REGEX

KitsuneYMG
Isn't Boost C++ only?
RossFabricant
+1  A: 

A completely different approach is to try a Parsing Expression Grammar (PEG). A PEG comes at the pattern matching problem from the point of view of a parser, and can even take advantage of multiple rules that form a complete grammar. That makes it possible to write expressions that match balanced parenthesis, which are otherwise quite difficult to express in most regexp dialects.

Although PEGs are relatively new, there should be a few implementations out there that are usable from C.

The PEG implementation I've personally used is LPeg. It is neatly bound to Lua, and coincidentally was written by one of Lua's principle authors, Roberto Ierusalimschy. It provides a complete PEG implementation, and also includes an adapter that translates a regexp into the equivalent PEG for execution.

Linking the Lua core to a C program just to get access to LPeg might sound like overkill, but it really wouldn't be that difficult to do, even if you had no plan to use Lua for other purposes.

RBerteig