views:

388

answers:

5

I am trying to visualize how to create a search for an application that we are building. I would like a suggestion on how to approach 'searching' through large sets of data.

For instance, this particular search would be on a 750k record minimum table, of product sku's, sizing, material type, create date, etc;

Is anyone aware of a 'plugin' solution for Coldfusion to do this? I envision a google like single entry search where a customer can type in the part number, or the sizing, etc, and get hits on any or all relevant results.

Currently if I run a 'LIKE' comparison query, it seems to take ages (ok a few seconds, but still), and it is too long. At times making a user sit there and wait up to 10 seconds for queries & page loads.

Or are there any SQL formulas to help accomplish this? I want to use a proven method to search the data, not just a simple SQL like or = comparison operation.

So this is a multi-approach question, should I attack this at the SQL level (as it ultimately looks to be) or is there a plug in/module for ColdFusion that I can grab that will give me speedy, advanced search capability.

+3  A: 

If your slowdown is specifically the search of textual fields (as I surmise from your mentioning of LIKE), the best solution is building an index table (not to be confiused with DB table indexes that are also part of the answer).

Build an index table mapping the unique ID of your records from main table to a set of words (1 word per row) of the textual field. If it matters, add the field of origin as a 3rd column in the index table, and if you want "relevance" features you may want to consider word count.

Populate the index table with either a trigger (using splitting) or from your app - the latter might be better, simply call a stored proc with both the actual data to insert/update and the list of words already split up.

This will immediately drastically speed up textual search as it will no longer do "LIKE", AND will be able to use indexes on index table (no pun intended) without interfering with indexing on SKU and the like on the main table.

Also, ensure that all the relevant fields are indexed fully - not necessarily in the same compund index (SKU, sizing etc...), and any field that is searched as a range field (sizing or date) is a good candidate for a clustered index (as long as the records are inserted in approximate order of that field's increase or you don't care about insert/update speed as much).

For anything mode detailed, you will need to post your table structure, existing indexes, the queries that are slow and the query plans you have now for those slow queries.

Another item is to enure that as little of the fields are textual as possible, especially ones that are "decodable" - your comment mentioned "is it boxed" in the text fields set. If so, I assume the values are "yes"/"no" or some other very limited data set. If so, simply store a numeric code for valid values and do en/de-coding in your app, and search by the numeric code. Not a tremendous speed improvement but still an improvement.

DVK
Yes, the searchable data is all text, it is mainly a large set of product related data, as I mentioned sizing, material, product sku, part number, is it boxed, etc; Currently I have an index on the table for part numbers, as most of our customers search by part number, however when doing a search through other criteria, it comes back slow, as I have mostly been attacking this with LIKE operators and wild cards for matching partials.I have a nightly re-index of the database (as data is constantly added, a few thousand records a day). So we have indexed the table(s).
Jakub
I updated the answer to clarify that "indexing" means building a table called "index" as opposed (or rather in addition) to building DB table indexes, in case it was 100% unclear in the original wording
DVK
Alsi added a point re: using numeric code for fields that are essentially "enum"s - e.g. have very small set of valid text values.
DVK
A: 

Because SQL Server is where your data is that is where your search performance is going to be a possible issue. Make sure you have indexes on the columns you are searching on and if using a like you can't use and index if you do this SELECT * FROM TABLEX WHERE last_name LIKE '%FR%'

But it can use an index if you do it like this SELECT * FROM TABLEX WHERE last_name LIKE 'FR%'. The key here is to allow as many of the first characters to not be wild cards.

Here is a link to a site with some general tips. http://blogs.techrepublic.com.com/datacenter/?p=173

StarShip3000
Hmm... good point on being able to use the index for LIKE that has start-letter matching, but LIKE is still way slower than exact equality search (and the OP's description doesn't seem like the searching would tend to start-of-text, at least to me)
DVK
Sorry, your example does not really apply to the scope of this question. You have shown a basic 'like comparison' which I mention I want to get away from using as it is costly.
Jakub
Yes but my point was that if you use the LIKE in the proper way it may not be costly. You will need to test to verify. To me that is the least amount of change to test and possibly fix your "slow issue".
StarShip3000
+1  A: 

If you want a truly plug-in solution then you should just go with Google itself. It sounds like your doing some kind of e-commerce or commercial site (given the use of the term 'SKU'), So you probably have a catalog of some kind with product pages. If you have consistent markup then you can configure a google appliance or service to do exactly what you want. It will send a bot in to index your pages and find your fields. No SQl, little coding, it will not be dependent on your database, or even coldfusion. It will also be quite fast and familiar to customers.

I was able to do this with a coldfusion site in about 6 hours, done! The only thing to watch out for is that google's index is limited to what the bot can see, so if you have a situation where you want to limit access based on a users role or permissions or group, then it may not be the solution for you (although you can configure a permission service for Google to check with)

ryber
you've hit the nail on the head, it is an eCommerce site, however the site will be structured in role based permissions, so that does bring up the concern of 'how' to spider those results. As we may give some users access to see stock, while others would not see that. I am not too familiar with how google integration works (buying a 1U google spider server, etc;). I always assumed it was mainly for file repos, static content, where as most of our data is all stored in DB.
Jakub
+3  A: 

You could try indexing your db records with a Verity (or Solr, if CF9) search.

I'm not sure it would be faster, and whether even trying it would be worthwhile would depend a lot on how often you update the records you need to search. If you update them rarely, you could do an Verity Index update whenever you update them. If you update the records constantly, that's going to be a drag on the webserver, and certainly mitigate any possible gains in search speed.

I've never indexed a database via Verity, but I've indexed large collections of PDFs, Word Docs, etc, and I recall the search being pretty fast. I don't know if it will help your current situation, but it might be worth further research.

Ben Doom
problem is we updated them fairly frequently, not just the product data, but more purchasing data (order history) or account information, which syncs from our ERP platform to the eCommerce app)
Jakub
That's not a problem - have your update process kick off a re-indexing of the data; or do it on a schedule, like overnight every night. If you have ColdFusion, you really should make use of Verity, Lucene, or Solr. They are incredible at what they do, and included, so take advantage of them!
Adam Tuttle
+1  A: 

I've done this using SQL's full text indexes. This will require very application changes and no changes to the database schema except for the addition of the full text index.

First, add the Full Text index to the table. Include in the full text index all of the columns the search should perform against. I'd also recommend having the index auto update; this shouldn't be a problem unless your SQL Server is already being highly taxed.

Second, to do the actual search, you need to convert your query to use a full text search. The first step is to convert the search string into a full text search string. I do this by splitting the search string into words (using the Split method) and then building a search string formatted as:

"Word1*" AND "Word2*" AND "Word3*"

The double-quotes are critical; they tell the full text index where the words begin and end.

Next, to actually execute the full text search, use the ContainsTable command in your query:

    SELECT *
 from containstable(Bugs, *, '"Word1*" AND "Word2*" AND "Word3*"')

This will return two columns:

  • Key - The column identified as the primary key of the full text search
  • Rank - A relative rank of the match (1 - 1000 with a higher ranking meaning a better match).

I've used approaches similar to this many times and I've had good luck with it.

Jeff Siver