views:

93

answers:

3

I have a command line application that needs to support arguments of the following brand:

  1. all: return everything
  2. search: return the first match to search
  3. all*search: return everything matching search
  4. X*search: return the first X matches to search
  5. search#Y: return the Yth match to search

Where search can be either a single keyword or a space separated list of keywords, delimited by single quotes. Keywords are a sequence of one or more letters and digits - nothing else.

A few examples might be:

  1. 2*foo
  2. bar#8
  3. all*'foo bar'

This sounds just complex enough that flex/bison come to mind - but the application can expect to have to parse strings like this very frequently, and I feel like (because there's no counting involved) a fully-fledged parser would incur entirely too much overhead.

What would you recommend? A long series of string ops? A few beefy subpattern-capturing regular expressions? Is there actually a plausible argument for a "real" parser?

It might be useful to note that the syntax for this pseudo-grammar is not subject to change, so if the code turns out less-than-wonderfully-maintainable, I won't cry. This is all in C++, if that makes a difference.

Thanks!

+2  A: 

I wouldn't reccomend a full lex/yacc parser just for this. What you described can fit a simple regular expression:

 ((all|[0-9]+)\*)?('[A-Za-z0-9\t ]*'|[A-Za-z0-9]+)(#[0-9]+)?

If you have a regex engine that support captures, it's easy to extract the single pieces of information you need. (Most probably in captures 1,3 and 4).

If I understood what you mean, you will probably want to check that capture 1 and capture 4 are not non-empty at the same time.

If you need to further split the search terms, you could do it in a subsequent step, parsing capture 3.

Even without regex, I would hand write a function. It would be simpler than dealing with lex/yacc and I guess you could put together something that is even more efficient than a regular expression.

Remo.D
This pretty well confirms what I was thinking (using regex). Agreed that I could write raw string ops to fit the bill more efficiently than pulling PCRE into the fray - but upon reflection, the net gain there is likely not worth my time debugging. Thanks for the insight!
Chris
A: 

The answer mostly depends on a balance between how much coding you want to do and how much libraries you want to depend on - if your application can depend on other libraries, you can use any of the many regular expression libraries - e.g. POSIX regex which comes with all Linux/Unix flavors.

OR

If you just want those specific syntaxes, I would use the string tokenizer (strtok) - split on '*' and split on '#' - then handle each case.

Ofir
While I decided not to use strtok, I did manage to find a fairly straightforward solution without a regex library.
Chris
A: 

In this case the strtok approach would be much better since the number of commands to be parsed are few.

anijhaw