views:

125

answers:

3

Hello, I need to parse a search query with a "Google-like" syntax (but simpler, since I don't need parenthesis, operator nesting and such). An example string might be:

TAG1: a,b,c TAG2: 123 TAG3: a,45,44,b

So, simply put, I need to recognize tokens which look like a TAG (i.e "color", "name", "age") followed by : and by a single "word" or a list of comma separated words I tried with some regex but if a user makes mistakes with the syntax (like typing an extra comma, or forgetting a value after a tag - color: shape:) the parsing fails. I don't really know if this is my fault (I'm far from being an expert with regex) or if going with a parser like ANTLR would be a better choice. Anyway, I'm opened to any kind of suggestion (I'm coding in java - I know the language has nothing to do with it, but maybe there are some tools that may help)

Thanks for your suggestions...

A: 

You might want to check out the Lucene QueryParser, you might be able to use it for your needs. It uses a javacc generated parser.

JavaCC

Lucene QueryParser

slappybag
+1  A: 

Given a string like "TAG1: a,b,c TAG2: 123 TAG3: a,45,44,b"

Pattern tokens = Pattern.compile( "([a-zA-Z0-9]+):\\s*(\\w+(?:,?\\w+)*)" );

Matcher m = tokens.matcher( myString );
while( m.find() ) {
    System.out.println( "tag:" + m.group(1) + "  value:" + m.group(2) );
}

That catches all of your cases and makes sure there is a certain well-formedness. Let me know if there is something I'm missing from your question.

Edit 1: To cover your other case you could do something like:

Pattern tokens = Pattern.compile( "([a-zA-Z0-9]+):\\s*(\\w+(?:[ ,]+?\\w+)*)(?=\\s+[a-zA-Z0-9]+:)|([a-zA-Z0-9]+):\\s*(\\w+(?:[ ,]+?\\w+)*)" );

And then check for groups 3 and 4 also.

Still, this regex is getting overly ambitious... though I'm not convinced a full-up parser would make your life that much easier in this case.

An alternative is to break it down one level at a time (which is what a parser would do anyway):

Pattern main = Pattern.compile( "([a-zA-Z0-9]+):" );
Matcher m = main.matcher(myString);
int lastStart = 0;
while( m.find() ) {
    if( lastStart != 0 ) {
        processToken( myString.substring(lastStart, m.start()) );
    }
    lastStart = m.start();
}
processToken( myString.substring(lastStart) );

Or something like that. It's similar to force an & sort of separator but it's taking into account the implicit separation that is your token syntax.

PSpeed
A: 

Thanks for your answers. PSpeed, the problem with your regexp is that if an user puts an extra space in the comma separated list (i.e. "TAG1: 1, 4") the match fails. Sorry, maybe I didn't explain very well.

Anyway, since I can change the syntax, I decided a separator would make everything easier and came up with the following regex for it.

String testString = "TAG1: a,b,c & TAG2: dddd, dddd &   TAG3: 123"
Pattern pattern = Pattern.compile("(?:\\s+|^)([A-Z]+:)\\s*(,*\\s*\\w+\\s*,*)+\\s*(?:$|&)");

But seeing as it fails with simple mistakes (what happens if the user forgets a &?), I'm starting to doubt if regex are the perfect tool for this task...

Sili
Updated my answer.
PSpeed