views:

107

answers:

5

for example i'd need to create something like google search query parser to parse such expressions as:

flying hiking or swiming -"walking in boots " author:hamish author:reid

or

house in new york priced over $500000 with a swimming pool

how would i even go about start building something like it? any good resources?

c# relevant, please (if possible)

  • edit: this is something that i should somehow be able to translate to a sql query
A: 

i think you should just do some string processing. There is no smart way of doing this.

So replace "OR" with your own or operator (e.g. ||). As far as i know there is no library for this.

I suggest you go with regexes.

Henri
+1  A: 

How many keywords do you have (like 'or', 'in', 'priced over', 'with a')? If you only have a couple of them I'd suggest going with simple string processing (regexes) too.

But if you have more than that you might want to look into implementing a real parser for those search expressions. Irony.net might help you with that (I found it extremely easy to use as you can express your grammar in a near bnf-form directly in code).

andyp
there are potentially hundreds of keywords, however not all are required at once.
b0x0rz
That's not an easy problem to solve then as you have to assign a 'meaning' to those hundreds of keywords. And I wonder what your database schema might look like?
andyp
+1  A: 

The Lucene/NLucene project have functionality for boolean queries and some other query formats as well. I don't know about the possibilities to add own extensions like author in your case, but it might be worthwile to check it out.

PHeiberg
+1  A: 

There are few ways doing it, two of them:

  • Parsing using grammar (useful for complex language)
  • Parsing using regular expression and basic string manipulations (for simpler language)

According to your example, the language is very basic so splitting the string according to keyword can be the best solution.

string sentence = "house in new york priced over $500000 with a swimming pool";
string[] values = sentence.Split(new []{" in ", " priced over ", " with a "}, 
                                 StringSplitOptions.None);
string type = values[0];
string area = values[1];
string price = values[2];
string accessories = values[3];

However, some issues that may arise are: how to verify if the sentence stands in the expected form? What happens if some of the keywords can appear as part of the values?

If this is the case you encounter there are some libraries you can use to parse input using a defined grammar. Two of these libraries that works with .Net are ANTLR and Gold Parser, both are free. The main challenge is defining the grammar.

Elisha
liking the GOLD so far best.
b0x0rz
+1  A: 

A grammar would work very well for the second example you gave but the first (any order keyword/command strings) would be best handled using Split() and a class to handle the various keywords and commands. You will have to do initial processing to handle quoted regions before the split (for example replacing spaces within quoted regions with a rare/unused character).

The ":" commands are easy to find and pull out of the search string for processing after the split is completed. Simply traverse the array looking.

The +/- keywords are also easy to find and add to the sql query as AND/AND NOT clauses.

The only place you might run into issues is with the "or" since you'll have to define how it is handled. What if there are multiple "or"s? But the order of keywords in the array is the same as in the query so that won't be an issue.

AverageAdam