views:

40

answers:

2

I'm trying to determine how to best design a storage facility for fast searching of text.

  • There will be a different file format for each customer
  • These files are XML, and the field names and attributes are not standard, and do not follow a schema
  • The customer has an option to choose certain fields to be searchable
  • There could be 100,000 records per file, per customer.

    I'm processing these files and generating a table based on the columns specified by the client configuration.

    What type of database schema would you chose, be it SQL, or flat files, or any other technology.

    There is going to be a lot of rows to search, and I don't know what way is best to go about this.

Create a table called SearchColumns

Id
CustomerId
DisplayValue

Create a table called "SearchRecords"

 Id
 SearchColumnId
 SearchText

With this scenario, the SearchRecords table is going to get very large, very fast, and because SearchText is going to be varchar(200), LIKE queries are going to be incrediably slow.

I've also considered full-text search on the SearchRecords table, but in testing on a sample table, i'm not getting results as I would expect.

I've also considered separate databases per customer This will help with table size in the short term, but after months or years, the table size and speed is going to be slower.

What would you do to make a FAST searchable table, that will potentially hold millions of records?

Edit: Information regarding the data i'm storing:

I'm pulling values such as FullName, Address, and Account Numbers from the xml file. These fields are pretty small, and would most likely never reach over 200 characters.

+1  A: 

I'm not sure I understand the question. Do you have a record-storage schema selected and need to know how best to get stuff in it, or do you need the storage schema as well? Are you planning on parsing the XML into columns that are nText, or just loading the XML file, tags and all, into nText columns?

Generally speaking, go for a narrow, deep table over a wide, shallow table if you're looking for performance. Narrow tables usually require fewer indices to speed searching on the most common columns, and those indices allow the engine to break the search into parallelizable chunks. Most engines are also smart enough to prioritize "cheap" filter conditions over "expensive"; the LIKE clause if present will almost surely be executed last in a compound WHERE clause, so if you can provide any other information to narrow the search, especially on indexed columns, you can speed up the general performance of your query.

You may consider (I can't believe I'm about to recommend this) a key-question-answer schema for the main element data (between opening and closing tags of each element). For any case in which even part of the schema definition is standardized, a traditional statically-defined table will be easier to work with on practically all counts, but if you don't even know the structure of your data other than that it's in XML, such an approach will require some sort of mapping between a particular file's metadata and a table of generic fields, and in that case key-question-answer will combine the two for better query performance.

Whatever information you have that uniquely identifies a particular record (and/or data on which you need to search very quickly to narrow result sets inexpensively) will be your key, the element name is your question, and the value is your answer. This will support a very flexible data naming standard. As the data is XML and thus relevant data can be stored as attributes of an element (part of the opening tag), you may require similar but simpler tables for searchable attribute data of your tags, or you can normalize attribute data into the main table based on some well-known mashup. Having these very narrow, row-per-column lines also allows you to very easily move non-searched columns into an "archive" table; you probably still need to keep the data around in case they want to start searching on a column, but if you currently don't search on a column, you can get it out of the table on which you do the heavy lifting, which will drastically reduce query times.

If you're looking for approximate values of a CLOB field, you simply aren't going to beat a LIKE query. Yes, it will be slow on very large text values; the only way to help that is to split that text up in a manner that won't cause false non-matches (where LIKE won't find a match across split boundaries), and I don't think you'll find a universal method of doing that; you must know something about what you're storing, such as that it's in paragraphs and a match will never cross paragraph boundaries anyway.

When all is said and done, I think you'll find that, regardless of data size, most SQL RDBMSes work rather well on just about any intelligent schema when given enough processor muscle. Searching on an index is logarithmic in nature as opposed to linear, and so a good indexing schema will help the engine break down the search space considerably.

KeithS
There is only going to be one column in the database called "SearchText" This isn't going to be XML data, but rather data extracted from an xml field. I hope this helped clarify things a bit.
Michael G
A: 

Some of my thoughts, may be it can help.

1- As performance is critical for your case, I would say that denormalize your table, try to minimize joins.

2- Partition your table.

3- Identify columns which is frequently used and make index on them.

saurabh