views:

62

answers:

5

I'm trying to build a search that is similar to that on Google (with regards to exact match encapsulated in double quotes).

Let's use the following phrase for an example

"phrase search" single terms [different phrase]

Currently if I use the following code

        Dim searchTermsArray As String() = searchTerms.Split(New String() {" ", ",", ";"}, StringSplitOptions.RemoveEmptyEntries)

        For Each entry In searchTermsArray
            Response.Write(entry & "<br>")
        Next

my output is

"phrase
search"
single
terms
[different
phrase]

but what I really need is to build a key value pair

phrase search     |  table1  
single            |  table1  
terms             |  table1  
different phrase  |  table2

where table1 is a table with general info, and table2 is a table of "tags" similar to that on stackoverflow.

Can anybody point me in the right direction on how to properly capture the input?

A: 

Regex is your friend. See this question

Pete Amundson
+3  A: 

What are you trying to do is not that trivial. Implementing a search "similar to Google's" is far beyond parsing the search string.

I'd suggest you not to reinvent the wheel and instead use production ready solutions such as Apache Lucene.NET or Apache Solr. Those cope with both parsing and fulltext search.

But if you only need to parse this kind of strings then you should really consider solution Pete pointed to.

Ihor Kaharlichenko
definitely recommend lucene.net if you really want search-ability akin to "googling"
Pete Amundson
search is a very minor part of my application. Basically the site is an events listing site, and I need to be able to search for key words (like band or venue names - including names with spaces), but also filter by tags if they use square braces.
rockinthesixstring
alright... after researching Lucene.NET, it looks like the right solution for me. Thanks for the direction.
rockinthesixstring
A: 

I would go for regular expressions

  1. Filter out all matches of the pattern "\".+\"" ("phrase search")
  2. Filter out all matches of "[.+]" ([different search])
  3. Split the rest by " "
Dave
A: 

Depending on how fancy you plan in getting, you might consider the search grammar/implementation that's included with Irony.

http://irony.codeplex.com/

David Lively
A: 

Search string parsing is a non-regular problem. That means that while a regular expression can get deceptively close, it won't take you all the way there without using proprietary extensions, building an unmaintainable mess of an expression, leaving nasty edge cases open that don't work how you'd like, or some combination of the three.

Instead, there are three correct ways to handle this:

  1. Use a third-party solution like Lucene.
  2. Build a grammar via something like antlr.
  3. Build your own state machine.

For a problem of this level (and assuming that search is core enough to what you're doing to really want to implement it yourself), I'd probably go with option 3. This makes more sense when you realize that regular expressions are themselves instructions for how to set up state machines. All you're doing is building that right into your code. This should give you the ability to tune performance and features as well, without requiring adding a larger lexer component into your code.

For an example of how you might do this take a look at my answer to this question:
http://stackoverflow.com/questions/1544721/reading-csv-files-in-c/1544743#1544743
hat I would do is build a state machine to parse the string character by character. This will be the easiest way to implement a fully-correct solution, and should also result in the fastest code.

Joel Coehoorn