views:

282

answers:

4

I want to parse a search string similar to that provided by Gmail using Perl. An example input would be "tag:thing by:{user1 user2} {-tag:a by:user3}". I want to put it into a tree structure, such as

{and => [
    "tag:thing",
    {or => [
       "by:user1",
       "by:user2",
    ]},
    {or => [
       {not => "tag:a"},
       "by:user3",
    ]},
}

The general rules are:

  1. Tokens separated by space default to the AND operator.
  2. Tokens in braces are alternative options (OR). The braces can go before or after the field specifier. i.e. "by:{user1 user2}" and "{by:user1 by:user2}" are equivalent.
  3. Tokens prefixed with a hyphen are excluded.

These elements can also be combined and nested: e.g. "{by:user5 -{tag:k by:user3}} etc".

I'm thinking of writing a context-free grammar to represent these rules, and then parsing it into the tree. Is this unnecessary? (Is this possible using simple regexps?)

What modules are recommended for doing parsing context-free grammars?

(Eventually this will be used to generate an database query with DBIx::Class.)

+1  A: 

Regex doesn't do nested things (like parenthesis) very well. By the time you get your regex counting parenthesis and capturing correctly, you could probably have a decent CFG parser. CFGs can logically guarantee correct parsing, while with a regex solution you're leaving a lot up to the magic. I can't recommend any Perl CFG libaries, but coding one sounds very cathartic.

Rob Elliott
Thanks, this is a convincing argument for using CFGs. I really also need a recommendation on which module to use though.
A. Murka
Never mind, I've found that Parse::RecDescent seems to be often recommended. It works well.
A. Murka
Perl 5.10 does nested things quite well. That doesn't mean it's the right solution though :)
brian d foy
A: 

If your query isn't tree structured, then regexes will do the job for you.

For example:

my $search = "tag:thing by:{user1 user2} {-tag:a by:user3}"
my @tokens = split /(?![^{]*})\s+/, $search;
foreach (@tokens) {
    my $or = s/[{}]//g; # OR mode
    my ($default_field_specifier) = /(\w+):/;
}


Even if your query is tree structured, regexes can make recursive parsing much more pleasant:

$_ = "by:{user1 z:{user2 3} } x {-tag:a by:user3} zz";
pos($_) = 0;
scan_query("");

sub scan_query {
    my $default_specifier = shift;
    while (/\G\s*((?:[-\w:]+)|(?={))({)?/gc) {
        scan_query($1), next if $2;
        my $query_token = $default_specifier . $1;
    }
    /\G\s*\}/gc;
}

Regexes are awesome :)!

Inshallah
The braces and operators can also be nested. Edited question to reflect this.
A. Murka
A: 

YAPP might do what you want. You can use it to generate and then use a LALR(1) Parsing Automaton.

Nick
A: 

Parse::Recdescent can generate parsers for this sort of thing. You probably need some experience with parsers to use it effectively though.

Walter H