ansaurus

Question

Parsing search query

Answer 1

A:

You might want to check out the Lucene QueryParser, you might be able to use it for your needs. It uses a javacc generated parser.

JavaCC

Lucene QueryParser

slappybag 2009-12-04 11:27:54

Answer 2

+1 A:

Given a string like "TAG1: a,b,c TAG2: 123 TAG3: a,45,44,b"

Pattern tokens = Pattern.compile( "([a-zA-Z0-9]+):\\s*(\\w+(?:,?\\w+)*)" );

Matcher m = tokens.matcher( myString );
while( m.find() ) {
    System.out.println( "tag:" + m.group(1) + "  value:" + m.group(2) );
}

That catches all of your cases and makes sure there is a certain well-formedness. Let me know if there is something I'm missing from your question.

Edit 1: To cover your other case you could do something like:

Pattern tokens = Pattern.compile( "([a-zA-Z0-9]+):\\s*(\\w+(?:[ ,]+?\\w+)*)(?=\\s+[a-zA-Z0-9]+:)|([a-zA-Z0-9]+):\\s*(\\w+(?:[ ,]+?\\w+)*)" );

And then check for groups 3 and 4 also.

Still, this regex is getting overly ambitious... though I'm not convinced a full-up parser would make your life that much easier in this case.

An alternative is to break it down one level at a time (which is what a parser would do anyway):

Pattern main = Pattern.compile( "([a-zA-Z0-9]+):" );
Matcher m = main.matcher(myString);
int lastStart = 0;
while( m.find() ) {
    if( lastStart != 0 ) {
        processToken( myString.substring(lastStart, m.start()) );
    }
    lastStart = m.start();
}
processToken( myString.substring(lastStart) );

Or something like that. It's similar to force an & sort of separator but it's taking into account the implicit separation that is your token syntax.

PSpeed 2009-12-04 14:49:42

Answer 3

A:

Thanks for your answers. PSpeed, the problem with your regexp is that if an user puts an extra space in the comma separated list (i.e. "TAG1: 1, 4") the match fails. Sorry, maybe I didn't explain very well.

Anyway, since I can change the syntax, I decided a separator would make everything easier and came up with the following regex for it.

String testString = "TAG1: a,b,c & TAG2: dddd, dddd &   TAG3: 123"
Pattern pattern = Pattern.compile("(?:\\s+|^)([A-Z]+:)\\s*(,*\\s*\\w+\\s*,*)+\\s*(?:$|&)");

But seeing as it fails with simple mistakes (what happens if the user forgets a &?), I'm starting to doubt if regex are the perfect tool for this task...

Sili 2009-12-04 15:38:36

Updated my answer.

PSpeed 2009-12-04 17:33:41

ansaurus

tags:

views:

answers:

Parsing search query

related questions