views:

53

answers:

2

We are searching disparate data sources in our company. We have information in multiple databases that need to be searched from our Intranet. Initial experiments with Full Text Search (FTS) proved disappointing. We've implemented a custom search engine that works very well for our purposes. However, we want to make sure we are doing "the right thing" and aren't missing any great tools that would make our job easier.

What we need:

  1. Column search
    • ability to search by column
    • we flag which columns in a table are searchable
  2. Keep some relation between db column and data
    • we provide advanced filtering on the results
    • facilitates (amazon style) filtering
    • filter provided by grouping of results and allowing user to filter them via a checkbox
    • this is a great feature, users like it very much
  3. Partial Word Match
    • we have a lot of unique identifiers (product id, etc).
    • the unique id's can have sub parts with meaning (location, etc)
    • or only a portion may be available (when the user is searching)
    • or (by a decidedly poor design decision) there may be white space in the id
    • this is a major feature that we've implemented now via CHARINDEX (MSSQL) and INSTR (ORACLE)
    • using the char index functions turned out to be equivalent performance(+/-) on MSSQL compared to full text
    • didn't test on Oracle
    • however searches against both types of db are very fast
    • We take advantage of Indexed (MSSQL) and Materialized (Oracle) views to increase speed
      • this is a huge win, Oracle Materialized views are better than MSSQL Indexed views
      • both provide speedups in read-only join situations (like a search combing company and product)
    • A search that matches user expectations of the paradigm CTRL-f -> enter text -> find matches
    • Full Text Search is not the best in this area (slow and inconsistent matching)
    • partial matching (see "Partial Word Match")

Nice to have:

  1. Search database in real time
    • skip the indexing skip, this is not a hard requirement
  2. Spelling suggestion

What we don't need:

  1. We don't need to index documents
    • at this point searching our data sources are the most important thing
    • even when we do search documents, we will be looking for partial word matching, etc
  2. Ranking
    • Our own simple ranking algorithm has proven much better than an FTS equivalent.
    • Users understand it, we understand it, it's almost always relevant.
  3. Stemming
    • Just don't need to get [run|ran|running]
  4. Advanced search operators
    • phrase matching, or/and, etc
    • according to Jakob Nielsen http://www.useit.com/alertbox/20010513.html
      • most users are using simple search phrases
      • very few use advanced searches (when it's available)
      • also in Information Architecture 3rd edition Page 185
      • "few users take advantage of them [advanced search functions]"
      • http://oreilly.com/catalog/9780596000356
      • our Amazon like filtering allows better filtering anyway (via user testing)
  5. Full Text Search
    • We've found that results don't always "make sense" to the user
    • Searching with FTS is hard to tune (which set of operators match the users expectations)
    • Advanced search operators are a no go
    • we don't need them because
    • users don't understand them
    • Performance has been very close (+/1) to the char index functions
    • but the results are sometimes just "weird"

The question: Is there a solution that allows us to keep the key value pair "filtering feature", offers the column specific matching, partial word matching and the rest of the features, without the pain of full text search?

I'm open to any suggestion. I've wondered if a document/hash table nosql data store (MongoDB, et al) might be of use? ( http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mongo ). Any experience with these is appreciated.

Again, just making sure we aren't missing something with our in-house customized version. If there is something "off the shelf" I would be interested in it. Or if you've built something from some components, what components (search engines, data stores, etc) did you use and why?

You can also make your point for FTS. Just make sure it meets the requirements above before you say "just use Full Text Search because that's the only tool we have."

A: 

I recommend looking into Solr, I believe it will meet you needs:

http://lucene.apache.org/solr/

Any advice on how to do partial word matching with solr?
Chris Weber
I believe it accepts wildcards with ? or * it has been a while since I last looked at it though.
A: 

For an off-she-shelf solution: Have you checked out the Google Search Appliance?

Quote from the Google Mini/GSA site:

... If direct database indexing is a requirement for you, we encourage you to consider the Google Search Appliance, which has direct database connectivity.

And of course it indexes everything else in the Googly manner you'd expect it to.

Paul Sasik
Very interesting. Any info on how the GSA searches databases?
Chris Weber
Paul Sasik