views:

422

answers:

6

I am using Delphi 2009. I have a very simple data structure, with 2 fields:

  1. A string that is the key field I need to retrieve by and is usually 4 to 15 characters in length.
  2. A string that is the data field that may be any size, from 1 character up to say 10,000 characters.

The difficulty is that I may have several million of these records, so they may total as much or more than 10 GB in size. Obviously, I'm looking for an on-disk solution rather than an in-memory solution.

My program needs to randomly retrieve these records, based on the key field. That's that part that needs to be made as efficient as possible.

Should I use a database for such a simple structure, and if so, which database would be best to handle this and be simplest to implement?

Alternatively, is there a simple on-disk data structure not requiring a full-blown database that would work just as well?


Well, all I needed was the one answer to kick me back into reality. I was looking for something simpler than even a simple database. But when the no-duh answer is to use a database, then I realize I've already answered this question with my own answer to another question: Best database for small applications and tools.

My answer was DISQLite3 for the reasons I specified there. And that's what I'll probably with for my implementation.


A few more good answers with some possibilities. That's great. I'll be able to try a few different methods to see what works best.


More contemplation, and I have had to change the accepted answer to the GpStructuredStorage solution.

In my case, a million records totalling several Gigabytes will put a strain on a database structure. Specifically, the B* tree that is used to store the index in most databases is fast, but will slow down for some operations such as reindexing a million values.

The only thing you'll find faster than B* for an index is a hash table. And that is precisely what is provided in gabr's suggested addition to the GpStructuredStorage solution. I think it's quite elegant the way he segmented the hash value to give a 4 level directory structure.

The key reason why I can go to a hash solution is that I only need random access by the keys. I do not need to sort by the keys. If sorting was needed, then the gains of the speed of the hash table would be lost and the database system would be a no-brain winner.

When I get down to implementing this, I should do a comparison of this technique versus a database. Maybe I'll compare with both Firebird and SQLite which would both be worthy opponents.

+5  A: 

For more than 10GB in data, a database is exactly what you need. It will handle indexing for rapidly locating data (your random retrieval), the functionality for adding, modifying, and deleting data, and the actual storage, as well as much more if you so choose.

There are dozens of posts here related to which databases are available for use in Delphi, including built-ins and FOS ones like Firebird.

Ken White
So which database would be best geared towards this specific data structure, or are you saying it doesn't matter?
lkessler
It doesn't matter. There are several viable alternatives; Firebird is one I've used myself and it works really well, and it's free to boot. :-) There are others that may work better, depending on your specific needs (contained fully in the executable, or no DBA required, etc.); I can't recommend anything more specific without more info on what you're actually doing.
Ken White
It does seem a shame to carry around an entire database for a single two-column. So I'd choose the smallest, simplest, least-capable database that meets your performance requirements. In particular, you don't seem to need SQL or relationality, so there would be no value to you in having those features supported. BDB (below) would be a good option if Delphi bindings are available.
Larry Lustig
@Larry: I'd have to disagree. :-) 10GB of data needing quick random access is more than a simple two-column database. Of course, your definition of "quick" might be different than mine. Considering that the *standard* (note the italics) RAM in a desktop now is 3 or 4 GB, 10GB is pretty big. The key is the "quick" part, which cries out for an optimized database with good index capabilities. Zeos is a good library for accessing several databases, as long as you're not using D2009 or higher; they don't have a non-beta Unicode version available yet.
Ken White
@Larry: Support for unneeded features doesn't take anything away. However, a mature database system has probably much more work invested already into file access and caching strategies than one could reasonably achieve on their own, when writing a "simple" on-disk data structure. Why not make use of it?
mghie
+1  A: 

BerkleyDB is exactly that

glebm
Does it have Delphi bindings? A quick google doesn't show much.
Larry Lustig
http://code.google.com/p/bdbdelphi/
glebm
@Glex: One developer, no download available, and the SVN contains 2 revisions, the last one nearly 3 years old. Doesn't look that promising.
mghie
These are just bindings dude. If BerkleyDB changed that much in the last 3 years, it won't be that hard to fix the API changes anyway.
glebm
Maybe BerkleyDB didn't change that much, but I doubt the files will work without any changes in Delphi 2009 or 2010. Did you try?
mghie
There is an article in the issues section on what to change so that it works in Delphi 7. Delphi >7 is a strict subset of Delphi 7.However, good point, I have not tried it.
glebm
Here are my FPC bindings btw http://www.stack.nl/~marcov/bdb.zip
Marco van de Voort
+1  A: 

If you more often need large datasets, and have some money to spare, simply stuff 16GB ( 500-750 Eur) in a machine, and make a separate process with some 64-bit compiler (*) of it that you query over e.g. shared mem or other IPC method.

In that case you can use the in-memory approach till Delphi 64-bit finally comes out. Since your data seems to be simple ( a map from array of char to array of char) it is easily to export over IPC.

This is of course if this approach has any merit for your case (like it is a cache or so), which I can't determine from your question.

(*) I recommend FPC of course :-)

I did this once, till about 5 million objects, 5 GB of data.

I got permission to open source the container types I made for it, they are at:

http://www.stack.nl/~marcov/lightcontainers.zip (warning: very dirty code)

mghie: to answer in another cliche: There is no silver bullet

Databases have a lot of other assumptions too

  • their generalized approach make relative inefficient use of memory. Most notably your dataset using normal memory storage techniques falls inside the affordable memory ranges, which are of course typically bigger for a server (my bad assumption here, apparantly) than for a client.
  • databases assume that their resultsets can reduced to small sets within the database-server with a relative straight kind of processing, and assisted by indexing.
  • they have a relatively high latency.
  • they are relatively bad in some kinds of processing (like multidimensional analysis/ OLAP, which is why databases need to be extended for that)

This makes databases relatively bad for use in e.g. caches, loadbalancers etc. Of course that is all provided that you need the speed. But the initial question felt a bit speed-sensitive to me.

In a past job my function in an database oriented firm was to do everything but that, IOW fix the problems when the standard approach couldn't hack it (or required 4 socket Oracle servers for jobs where the budget didn't warrant such expenses). The solution/hack written above was a bit of OLAPpy, and connected to hardware (a rfid chipprogramming device), requiring some guaranteed response time. Two months of programming time, still runs, and couldn't even buy a windows server + oracle license for the cost.

Marco van de Voort
The problem with that approach is that all of the data needs to be read into memory on application startup, or at least the indexing structures need to. The indexing structures need to be calculated and written first, too. When developing all this one starts to duplicate much of the work that has been done already for database systems...
mghie
Yes, mghie identifies the main problem. Also, it would be very expensive for me to buy all my users 16 GB of RAM for their machines. :-)
lkessler
lkessler: If it is for an user app, forget it. mghie: I need to have more room to comment on you, I'll do that in the post.
Marco van de Voort
@Marco: Thanks for the edit, I understand now where you're coming from and agree with most points. But of your 4 points about database assumptions at least the second is a clear match for this question, and the fourth doesn't matter here. The other two are of real concern, but while I think a tuned hand-written solution can easily outperform many databases it is doubtful that the development effort would be well spent, in the context of this question. After all it's probably only a part of the whole application. Effort should be spent only *after* it's clear (measured!) that's the bottleneck.
mghie
In this case not I guess (though I'd have to rewind the original question to which I originally replied). I was hacking synedit late at night when I replied it, and I was frustrated. I assume you know the feeling :-)
Marco van de Voort
+1  A: 

Since your data is more than 3GB, you will need to make sure what ever database engine you select either handles tables that large, or split things up into multiple tables, which I would suggest doing no matter what the maximum size of a single table. If you perform the split, perform it as evenly as possible on a logical key break so that its easy to determine which table to use by the first or first two characters of the key. This will greatly reduce the search times by eliminating any records which could never match your query to start with.

If you just want raw performance, and will only be performing read only lookups into the data, then your better served by an ordered index file(s) using a fixed size record for your keys which points to your data file. You can then perform a binary search easily on this data and avoid any database overhead. For even more of a performance gain, you can pre-load/cache the midpoints into memory to reduce repetitive reads.

A simple fixed size record for your specs might look like:

type
  rIndexRec = record
    KeyStr  : String[15];  // short string 15 chars max
    DataLoc : integer;     // switch to int64 if your using gpHugeFile
  end;

For initial loading, use the Turbo Power sort found in the SysTools, which the latest version for Delphi 2009/2010 can be downloaded on the songbeamers website. The DataLoc would be the stream position of your datastring record, which writing/reading might look like the following:

function WriteDataString(aDataString:String;aStream:tStream):integer;
var
  aLen : integer;
begin
  Result := aStream.Position;
  aLen := Length(aDataString);
  aStream.Write(aLen,sizeOf(aLen));
  aStream.Write(aDataString[1],aLen*sizeOf(Char));
end;

function ReadDataString(aPos:Integer;aStream:tStream):String;
var
  aLen : integer;
begin
  if aStream.Position <> aPos then
    aStream.Seek(aPos,soFromBeginning);
  result := '';
  aStream.Read(aLen,SizeOf(aLen));
  SetLength(Result,aLen);
  if aStream.Read(Result[1],aLen*sizeOf(Char)) <> aLen*SizeOf(Char) then
    raise Exception.Create('Unable to read entire data string');
end;

When you are creating your index records, the dataloc would be set to datastring record position. It doesn't matter the order in which records are loaded, as long as the index records are sorted. I used just this technique to keep a 6 billion record database up to date with monthly updates, so it scales to the extreme easily.

EDIT: Yes, the code above is limited to around 2GB per datafile, but you can extend it by using gpHugeFile, or segmenting. I prefer the segmenting into multiple logical files < 2gb each, which will take up slightly less disk space.

skamradt
Your code is limited to 2GB files until a 64 bit Delphi becomes available.
mghie
Possibly. It would require timing tests to see how fast it is and comparisons to already developed databases to verify.
lkessler
+2  A: 

You should analyse your data. If

  1. a sizeable part of the data values is larger than the default file system block size,
  2. you don't want to search in the data values using SQL (so it doesn't matter what format they are stored in), and
  3. you really need random access over the whole database,

then you should test whether compressing your data values increases performance. The decompression of data values (especially on a modern machine with multiple cores, performed in background threads) should incur only a small performance hit, but the gains from having to read fewer blocks from the hard disc (especially if they are not in the cache) could be much larger.

But you need to measure, maybe the database engine stores compressed data anyway.

mghie
That's a good thought. Compressing may speed up. I'll implement first, see if it's fast enough. If not, I'll try your suggestion.
lkessler
+3  A: 

why all the bragging? just use GpStructuredStorage(4 TB restriction and with little time invested in class you can go over), it will take you few hours to get used to it but it worths the time. Hope it helps

GpStructuredStorage can retrieve the file names extremely fast(I've tested it), you need to save each record as a file in GpStructuredStorage and retrieve each name as a string in a string list, 1 milion string names(because you mentioned about stringlist) needs a few MB in RAM which is not much, use a TStream descendant to write data to a file in GpStructuredStorage, I do not have time today to write a simple example, but on Saturday or Sunday I will write a tutorial for GPStructuredStorage on my blog.


[Added by gabr - I hope that will not be considered a terrible breach of netiquette. It's just that my thoughts don't fit in a comment (sizewise) and that it feels stupid to add another answer just to write this ...]

While GpStructuredStorage can store loads of data, finding it may be a slow process. What I usually do in such cases is to create a hash of the key and convert it into hex (say 00FFA784). Then I convert this hex hash into folder structure (in this case it would be /00/FF/A7/84) and store relevant data in this folder, either as a file, as attributes or a combination of both.

This approach speeds data retrieval but slows down data insertions and is therefore recommended only for mostly static data. If the data is fairly dynamic I'd certainly recommend using database and not GpStructuredStorage.

delphigeist
I heard about GPStructuredStorage, but wouldn't have tought it might be applicable to this problem. Maybe it is. The only question would be once it's got a million items in it, will it become as slow as a Windows file system, or did gabr optimize the retrieval with a fast tree or his great cache? I might just try this when I compare. Price is right.
lkessler
Actually, I have no idea how GpStructuredStorage would behave on such big set but you're certainly welcome to try :)
gabr
As it turns out, the database will be created all at once the first time (imported from another data set). After that, it will be updated on a record by record basis (only a few records at a time), so maybe in my case, with gabr's suggestion on using your hash function as the folder path, GpStructuredStorage might be a possible solution.
lkessler
I especially like gabr's idea of grouping the hash into 2 character combinations to make a 4 level folder structure. With a million records, each level will only average 33 entries which should make finding the files fairly quick, even if just a sequential directory/file search is used.
lkessler
gabr: In http://stackoverflow.com/questions/222699/which-embedded-database-to-use-in-a-delphi-application/222725#222725 you answered Firebird, and I commented agreeing with you. But I haven't tried Firebird yet. How would you compare it with the GPStructuredStorage solution for my particular problem?
lkessler
I don't think that quickly finding the data would be a problem for any database, but it's very important to cause as little I/O as possible when searching for and retrieving data. I have no experience with `GpStructuredStorage`, so I can't comment on it, but database systems make sure that indexing structures are as small as possible and in adjacent disc blocks so as to minimize hard disc seek times. Does `GpStructuredStorage` maintain folder structure information in one continuous area? Is there any technique to minimize or even prevent fragmentation caused by file modifications?
mghie
delphigeist: I'm looking forward to your tutorial on GPStructuredStorage. Have you compared its speed to a database?
lkessler