views:

445

answers:

7

Hello,

Let's say my application creates, stores and retrieves a very large amount of entries (tens of millions). Each entry has variable number of different data (for example, some entries have only a few bytes such as ID/title, while some may have megabytes of supplementary data). Basic structure of each entry is same and is in XML format.

Entries are created and edited (most likely by appending, not rewriting) arbitrarily.

Does it make sense to store entries as separate files in a file system while keeping necessary sets of indexes in the DB vs. saving everything in a DB?

+3  A: 

It really depends on how you're going to use it. Databases can handle more entries in a table than most people think, especially with proper indexing. On the other hand, if you aren't going to be making use of the functionality that a relational database provides, there might not be much reason to use it.

Ok, enough generalizing. Given that a database eventually boils down to "files on disk" anyway, I wouldn't worry too much about what "the right thing to do" is. If the primary purpose of the database is just to efficiently retrieve these files, I think it would be perfectly fine to keep the DB entries small and look up file paths instead of actual data - especially since your file system should be pretty efficient at retrieving data given a specific location.

In case you're interested, this is actually a common data storage pattern for search engines - the index will store the indexed data and a pointer to the stored data on disk, rather than storing everything in the index.

danben
+1  A: 

I would definitely store the data on the file system and a hash the path in the DB.

Alix Axel
+1  A: 

Well depending on your costs, MS SQL Server has what's called a "Primary XML Index" that can be created, even on unstructured data. This allows you to write XQuery to search down the columns and the database will assist you.

If there is any coherency at all in the data, or it can be placed into a schema then you may see a benefit to this.

Might I recommend if you have large amounts of binary data such as images etc, that you strip these out and place them somewhere else, such as a file system. Or if you use 2008 there is a type called "Filestream" (cheers @Marc_s) which allows you to index, store and secure all the files you write down and use NTFS APIs to retrieve them (i.e fast block transfer) but still have them kept as columns in the database.

Having the database there might give you a good layer of abstraction and scaling if your application puts large demands on searching through the XML data, which means that you don't have to.

Just my 2c.

Spence
The SQL Server 2008 data attribute is actually called **FILESTREAM**. It's not really a type per se - it's an attribute you can add to a `VARBINARY(MAX)` column
marc_s
A: 

At work I often have to accumulate large sets of XML documents for later analysis. Normally this is done by sticking them into a directory, and the analysis is done by grep (or a bespoke Java program with all its XML factory/builder/wrapper/API paraphernalia).

One slow day I thought I'd try putting it in PostgreSQL. There are two features that I wanted to try out:

  • Automatic compression of large data when appropriate (TOAST).
  • Indexing using an expression.

Regarding the first feature, the DB size was less than half of the raw files size. Doing a full text search, a table scan using WHERE data::TEXT LIKE '%pattern%', was actually quicker than running grep on the files. When you are dealing with a few GB of XML this alone makes the DB worthwhile.

The second feature, indexing, is a bit more work to maintain. There were a few particular elements I guessed would be good to index. An index on xpath('//tradeHeader/tradeId/text()', data) works, but it can be a pain to duplicate in each query. I found it easier to add ordinary columns for some fields, and use insert/update triggers to keep them in sync.

Edmund
How about besides XML / media files stored in FS, have tables with only searchable text content?
SODA
@Logistetica: I'm not quite sure what you mean. Do you mean putting the main file in the FS and just the metadata in the DB? (With a field saying what the filename is.) I think this is what people generally do. I haven't much experience with it myself.
Edmund
+1  A: 

A couple of considerations:

  • transaction management;
  • backup and recovery.

These are general easier to marshal with a database than with a file system. But probably the hardest thing is to synchronise a file system backup with a database's roll forward (redo) logging. The more transactional your application, the more these factors matter.

It appears from your question that you are not intending to make any use of normal database functionality (relational integrity, joining). In which case you should give strong consideration to a third option: store your data in the file system and, instead of a database, use a file-based text retrieval engine like Solr (or Lucene) , Sphinx, Autonomy, etc.

APC
A: 

It depends on how you are going to use the data, as a previous response says.

The data in a database can be used to support a lot of different kinds of queries, and feed the results to reports, forms, OLAP engines and lots of other kinds of tools. Appropriate indexing can speed up searches dramatically.

If you know SQL, and if the database is well designed, coming up with queries is easier, quicker, and less error prone than doing the equivalent thing with files. But, as others have noted, you can plug your XML data into SQL without moving it to a database.

Designing a good multipurpose schema is harder than most beginners think it is. There's a lot to learn, and it isn't just about how to manipulate one tool or another. And a bad multipurpose schema can be even harder to work with than files.

If you decide to go with a database, be prepared to make a significant investment. And make sure you are going to get the benefits of that investment.

Walter Mitty
+1  A: 

I will use HDFS(Hadoop distributed file system) to store the data. Main idea is that you will get high availability, scalability and replication. Any queries to your application can be made map reduce queries. And main fields can be stored as a distributed index on top of Hadoop using Katta.

Try googling for these technologies.

Algorist