tags:

views:

82

answers:

1

I’m trying to build a grammar that interprets user-entered text, search-engine style. It will support the AND, OR, NOT and ANDNOT Boolean operators. I have pretty much everything working, but I want to add a rule that two adjacent keywords outside of a quoted string implicitly are treated as in an AND clause. For example:

cheese and crackers = cheese AND crackers

(up and down) or (left and right) = (up AND down) OR (left AND right)

cat dog “potbelly pig” = cat AND dog AND “potbelly pig”

I’m having trouble with the last one, and I’m hoping somebody can point me in the right direction. Here’s my *.g file thus far, and please be nice, my ANTLR experience spans less than a work day:

grammar SearchEngine;

options { language = CSharp2; output = AST; }

@lexer::namespace { Demo.SearchEngine }
@parser::namespace { Demo.SearchEngine }

LPARENTHESIS : '(';
RPARENTHESIS : ')';

AND    : ('A'|'a')('N'|'n')('D'|'d');
OR     : ('O'|'o')('R'|'r');
ANDNOT : ('A'|'a')('N'|'n')('D'|'d')('N'|'n')('O'|'o')('T'|'t');
NOT    : ('N'|'n')('O'|'o')('T'|'t');

fragment CHARACTER : ('a'..'z'|'A'..'Z'|'0'..'9');
fragment QUOTE     : ('"');
fragment SPACE     : (' '|'\n'|'\r'|'\t'|'\u000C');

WS     : (SPACE) { $channel=HIDDEN; };
PHRASE : (QUOTE)(CHARACTER)+((SPACE)+(CHARACTER)+)+(QUOTE);
WORD   : (CHARACTER)+;

startExpression  : andExpression;
andExpression    : andnotExpression (AND^ andnotExpression)*;
andnotExpression : orExpression (ANDNOT^ orExpression)*;
orExpression     : notExpression (OR^ notExpression)*;
notExpression    : (NOT^)? atomicExpression;
atomicExpression : PHRASE | WORD | LPARENTHESIS! andExpression RPARENTHESIS!;
+1  A: 

Since your AND-rule has the optional AND-keyword, you should create an imaginary AND-token and use a rewrite-rule to "inject" that token in your tree. In this case, you can't make use of ANTLR's short-hand ^ root-operator. You'll have to use the -> rewrite operator.

Your andExpression should look like:

andExpression
  :  (andnotExpression        -> andnotExpression)
     (AND? a=andnotExpression -> ^(AndNode $andExpression $a))* 
  ;

A detailed description of this (perhaps cryptic) notation is given in Chapter 7, section Rewrite Rules in Subrules, page 173-174 of The Definitive ANTLR Reference by Terence Parr.

I ran a quick test to see if the grammar produces the proper AST with the new andExpression rule. After parsing the string cat dog "potbelly and pig" and FOO, the generated parser produced the following AST:

alt text

Note that the AndNode and Root are imaginary tokens.

If you want to know how to create the AST picture above, see this thread: http://stackoverflow.com/questions/2856612/visualizing-an-ast-created-with-antlr-in-a-net-environment

EDIT

When parsing both one two three and (one two) three, the following AST is created:

alt text

And when parsing (one two) OR three, the following AST is created:

alt text

which seems to be the proper way in all cases.

Bart Kiers
Thanks! That did the trick. My follow-up concern (now removed, and what your edit was directed at) was due to a bug in some enclosing C# code.
@highbeammeup, glad to hear that. And you're welcome!
Bart Kiers