views:

119

answers:

3

I have a very large dataset, each item in the dataset being roughly 1kB in size. The data needs to be queried rapidly by many applications distributed over a network. The dataset has more than a million items (so 500 million+ 1kB data chunks).

What would be the best method to storing this dataset (need to allow adding more items, and reading them rapidly, but never modifying already added data)? Would using a MySQL DB using the binary blob format be appropriate?

Or should each of these be stored as files on a file system?

edit: the number is 1 million items now, but needs to be able to scale to well over 500 million items easily.

A: 

That's one GB of data. What are you going to use the database for?

That's definitely just a file, read it into ram when starting up.

Scaling to 500Million is easy. That just takes some more machines. Depending on the precise application characteristics, you might be able to normalize or compress the data in ram.

You might be able to keep things on disk, and use a database, but that seriously limits your scalability in terms of simultaneous access. You get 50 disk accesses/sec from a disk, so just count how many disk you need.

Stephan Eggermont
I forgot to mention, it's currently 1 million, but would also need to scale up to 500 million or so easily.
The Unknown
A: 

If you need to retrive saved data then storing in files is certainly not a good idea.

MySQL is a good choice. But make sure you have right indexes set.

Regarding binary-blob. It depends on what you plan to store. Give us more details.

NinethSense
The data it self is a 2d array bytes, and the dimensions vary slightly (some may be 28x30 and others might be 30x35 etc).
The Unknown
-1 Nothing wrong with files. It is what databases use.
Stephan Eggermont
>> Nothing wrong with files.If you have frequent retrievals. It surely will have performance isue. Read his problem carefully. >> The data needs to be queried rapidly by many applications distributed over a network.
NinethSense
Nonsense. Databases are just as bad at reading a disk as file systems are. And the data should be in ram if you want access from multiple applications to actually work.
Stephan Eggermont
Database have its own benifits. We can make use of file system also... if it is used properly. I recommend database anyway!
NinethSense
A: 

Since there is no need to index anything inside the object. I would have to say a filesystem is probably your best bet not a relational database. Since there's only an unique ID and a blob, there really isn't any structure here, so there's no value to putting it in a database.

You could use a web server to provide access to the repository. And then a caching solution like nginx w/memcache to keep it all in memory and scale out using load balancing.

And if you run into further performance issues, you can remove the filesystem and roll your own like Facebook did with their photos system. This can reduce the unnecessary IO operations for pulling unneeded meta-data from the file system like security information.

NeuroScr