views:

692

answers:

10

Hi everyone,

I am looking for a fast (as in huge performance, not quick fix) solution for persisting and retrieving tens of millions of small (around 1k) binary objects. Each object should have a unique ID for retrieval (preferably, a GUID or SHA). Additional requirements is that it should be usable from .NET and it shouldn't require additional software installation.

Currently, I am using an SQLite database with a single table for this job, but I want to get rid of the overhead of processing simple SQL instructions like SELECT data FROM store WHERE id = id.

I've also tested direct filesystem persistency under NTFS, but the performance degrades very fast as soon as it reaches half a millions objects.

Give me your best, hackers of stackoverflow :D

P.S. By the way, objects never need to be deleted, and the insertion rate is very, very low. In fact, every time an object changes a new version is stored and the previous version remains. This is actually a requirement to support time-traveling.

Just adding some additional information to this thread:

To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem http://arxiv.org/abs/cs.DB/0701168

A: 

I think the database query is your best bet.

The whole structure of a database is tuned to just this kind of case, and parsing and optimizing the simple query is farily insignificant.

You might be able to cook up a scheme where you store all the objects in a big blob directly to the filesystem, and then open a memory mapped file view on it, and index the object IDs with an offset into the blob, but I doubt you'd see much more perf than the DB, since this is essentially what it does.

codekaizen
I'm not so sure. If it's just a matter of simple lookup and retrieval, using the filesystem might make more sense, as long as no single directory has too many files inside it.
Nosredna
A: 

Store a separate index (another file) of [Guid -> file number + offset in file]. Use a binary search for retrieval, and move to file n+1 whenever file n reaches a certain size. Each row in the index file is only 24 bytes (fixed size: guid + file number + offset, split files at 4GB), and sorting it is fast (insertion sort at a low rate.)

Edit: You have very simple requirements that are straightforward to optimize. This carefully constructed system should outperform the database, especially if you are careful about block reads of the data and asynchronous IO. The database queries will always have the overhead of parsing.

Edit 2: If you need it safe as well (always a good idea), take a look here for a description of how the concept of file system transactions can help you bullet-proof things.

280Z28
Directly accessing big files that way seem to be begging for consistency issues when powering off and stuff. I would really want to offset that kind of problems to the underlying structure. Good idea, nevertheless.
Hugo S Ferreira
Take a look at File System Transactions (my edit). The linked API is new to Vista, but the concepts can be implemented in code for XP if you needed.
280Z28
I will, thank you.
Hugo S Ferreira
+9  A: 

You may be able to lessen the performance problems of NTFS by breaking the object's GUID identifier up into pieces and using them as directory names. That way, each directory only contains a limited number of subdirectories or files.

e.g. if the identifier is aaaa-bb-cc-ddddeeee, the path to the item would be c:\store\aaaa\bbcc\dddd\eeee.dat, limiting each directory to no more than 64k subitems.

Daniel Earwicker
Very similar to the way git stores chunks, right? I'll do some performance tests with that scheme.
Hugo S Ferreira
I did something like this with mutual fund data. It works well. The trick is to find the right balance. It's going to depend on your particular data. You might also be able to do some hashing if you have too many clumpy areas. See my answer for details.
Nosredna
NTFS is a real dog performance wise, you can get away with this on LINUX but not NTFS.
jottos
@jottos - when you say "get away with this", do you mean the cops or the Scooby Doo gang are going to show up at the OP's house if they try my suggestion on Windows? The whole point is that NTFS slows down when there are many files in one directory. By dividing the files into a deeper hierarchy, this is avoided, and better performance is maintained. It works like a multi-level hash table.
Daniel Earwicker
Multi-level hash table! Never though of that... I don't think GUID search is that good in SQLite, but maybe I could tweak it a little further by splitting the GUID into 4 Int32 columns (or 2 Int64 columns). Gonna try it out...
Hugo S Ferreira
My implementation was on both Windows and Linux. That's why I broke up the files. Worked fine on both.
Nosredna
You would definitely want to break up things into subdirs somehow. But, I am not sure if this the best idea for lots of tiny files. Your bottleneck is going to be disk IO, especially if you are accessing random files all over the disk. You are at the mercy of the OS filesystem cache for storing files in memory. With a database, you may have a little more control over what stays in memory. At the very least you could make the DB memory ~= system memory. Also, with a DB and 1KB objects you may note need blobs; most DBs support very large varchars. I may be worth benchmarking varchars vs BLOBS.
AngerClown
+1  A: 

You need call a prepare function only once per statement, with parameter denoted e.g. by ? (so SELECT data FROM store WHERE id=? is the statement you'd prepare); then what you do "millions of times" is just to bind the parameter into the prepared statement and call sqlite_step -- these are fast operations. Worth benchmarking if blob open might not be even faster. IOW, I recommend sticking with SQLite and digging into its low-level interface (from managed C++ if you must) for maximum performance -- it's really an amazing little engine, and it has often surprised me favorably with its performance!

Alex Martelli
I am already preparing my statements, though I never tried blob open. Need to assess its performance. Thnks.
Hugo S Ferreira
A: 

Did you considered to try object database, like db4o? It can persist any CLR objekt, and access them quickly with query language (supports LINQ!). I didn't have millions of objects, but with few thousands access was fairly quick, no major difference than similar SQL query with indexed id field.

Hrvoje
That seems interesting. I think I'll do some performance tests with it.
Hugo S Ferreira
Hugo, how did those performace tests go?
Alex Baranosky
A: 

How about a binary file with fixed size blocks of around 2k, having the first 4 bytes be the length of the object...

location of object i is at i*2048 bytes, then read 2048 bytes for the object, getting the length of the actual object from the first 4 bytes (unsigned).

Osama ALASSIRY
Although the medium object is very small, nothing prohibits it to be higher than 2k. I think the biggest object I have is around 30k in this particularly instantiation of the warehouse. Relying on fixed-size chunks would probably require partitioning big objects and dealing with consistency issues. Nice suggestion, but I would rather prefer offset those problems to the underlying infrastructure.
Hugo S Ferreira
This won't work in that case, a database could be your best choice...
Osama ALASSIRY
A: 

I like Earwicker's solution. The way I've dealt with this is very similar.

What I did was this:

Let's say your guid is 3F2504E0-4F89-11D3-9A0C-0305E82C3301.

Hash the guid down to a three letter hash. aaa-zzz.

Suppose, for the sake of argument, that your guid hashes down to "xap".

Your info will be found in the file c:\store\x\xa\xap\3F2504E04F8911D39A0C0305E82C3301.dat

Naturally, there are many variants of this strategy. For instance, xap could be a file with all the binary objects appended together, with a header or an external file that has the guids and offsets into the file.

Nosredna
A: 

You can check if HDF5 structures are suitable for your tasks

zzr
Never heard of it. Gonna check. Thx.
Hugo S Ferreira
You are welcome:)I'm experimenting with HDF5 via PyTables from Python in my current project and maybe will try to use them as intermediate data structure between Python "ETL" scripts and analysis with R.If you'll share your test results, it will be great :)
zzr
Yes, I'll definitely publish some comparative results as soon as I implement these several strategies.
Hugo S Ferreira
A: 

I tend to agree w/ Alex, if you write your own solution you are reinventing stuff that's already likely in SQLite, but if you must...

You can likely make a BTree work here. It is the workhorse of any database and your problem space isn't all that bad. 10s of millions of 1k objects is still only 10's of billions of bytes so the file is manageable by the OS and there are a lot of BTree examples out there to try.

Compared with using the file system directory structure to essentially create a BTree analogue using a real BTree is going to be far faster.

Another solution that might be of interest is Mogilfs which is a distributed redundant file system.

jottos
+1 for MogileFS.
Jauder Ho
A: 

I don't know if SQLite support indexes or not, but if it does then you can speed up things by creating an Index over the ID field.

If it doesn't, then your best option is B+ trees. Thanks

mfawzymkh