views:

310

answers:

6

Hi All,

I have an application which may needs to process billions of objects.Each object of is of TRange class type. These ranges are created at different parts of an algorithm which depends on certain conditions and other object properties. As a result, if you have 100 items, you can't directly create the 100th object without creating all the prior objects. If I create all the (billions of) objects and add to the collection, the system will throw Outofmemory error. Now I want to iterate through each object mainly for two purposes:

  1. To apply an operation for each TRange object(eg:Output certain properties)
  2. To get a cumulative sum of a certain property.(eg: Each range has a weight property and I want to retreive totalweight that is a sum of all the range weights).

How do I effectively create an Iterator for these object without raising Outofmemory?

I have handled the first case by passing a function pointer to the algorithm function. For eg:

procedure createRanges(aProc: TRangeProc);//aProc is a pointer to function that takes a    //TRange
var range: TRange;
  rangerec: TRangeRec;
begin
  range:=TRange.Create;
  try 
    while canCreateRange do begin//certain conditions needed to create a range
      rangerec := ReturnRangeRec;
      range.Update(rangerec);//don't create new, use the same object.
      if Assigned(aProc) then aProc(range);
    end;
  finally
    range.Free;
  end;
end;

But the problem with this approach is that to add a new functionality, say to retrieve the Total weight I have mentioned earlier, either I have to duplicate the algorithm function or pass an optional out parameter. Please suggest some ideas.

Thank you all in advance
Pradeep

+7  A: 

For such large ammounts of data you need to only have a portion of the data in memory. The other data should be serialized to the hard drive. I tackled such a problem like this:

  1. I Created an extended storage that can store a custom record either in memory or on the hard drive. This storage has a maximum number of records that can live simultaniously in memory.
  2. Then I Derived the record classes out of the custom record class. These classes know how to store and load themselves from the hard drive (I use streams).
  3. Everytime you need a new or already existing record you ask the extended storage for such a record. If the maximum number of objects is exceeded, the storage streams some of the least used record back to the hard drive.

This way the records are transparent. You always access them as if they are in memory, but they may get loaded from hard drive first. It works really well. By the way RAM works in a very similar way so it only holds a certain subset of all you data on your hard drive. This is your working set then.

I did not post any code because it is beyond the scope of the question itself and would only confuse.

Runner
+1  A: 

Look at TgsStream64. This class can handle a huge amounts of data through file mapping.

http://code.google.com/p/gedemin/source/browse/trunk/Gedemin/Common/gsMMFStream.pas

Andrei K.
File mapping won't help much since file maps will still limit the amount of data in memory at any moment. While you could map a file of hundreds of gigabytes, a memory map will only give a view on between 1 to 2 GB of data.
Workshop Alex
A: 

"Runner" has a good answer how to handle this!

But I would like to known if you could do a quick fix: make smaller TRange objects. Maybe you have a big ancestor? Can you take a look at the instance size of TRange object? Maybe you better use packed records?

André
Delphi can only create 32-bit apps, which means you're limited to 2GB of address space. Every object has 4 bytes of overhead for the VMT pointer, so even with *no* payload, 1 billion records = 4 GB. And OP said "billion*s*", plural. So no, smaller objects won't help.
Joe White
Actually, for D2009+, with the Monitor that was added in D2009, every TObject instance has an 8 byte overhead. 4 bytes for the hidden Self pointer and 4 bytes for the monitor.
Marjan Venema
In Delphi 2009 (and higher), every "empty" object occupies even 8 bytes because of a TMonitor instance reference - see http://blogs.teamb.com/craigstuntz/2009/03/25/38138/ ... second :)
mjustin
+1  A: 

But the problem with this approach is that to add a new functionality, say to retrieve the Total weight I have mentioned earlier, either I have to duplicate the algorithm function or pass an optional out parameter.

It's usually done like this: you write a enumerator function (like you did) which receives a callback function pointer (you did that too) and an untyped pointer ("Data: pointer"). You define a callback function to have first parameter be the same untyped pointer:

TRangeProc = procedure(Data: pointer; range: TRange);

procedure enumRanges(aProc: TRangeProc; Data: pointer);
begin
  {for each range}
    aProc(range, Data);
end;

Then if you want to, say, sum all ranges, you do it like this:

TSumRecord = record
  Sum: int64;
end;
PSumRecord = ^TSumRecord;

procedure SumProc(SumRecord: PSumRecord; range: TRange);
begin
  SumRecord.Sum := SumRecord.Sum + range.Value;
end;

function SumRanges(): int64;
var SumRec: TSumRecord;
begin
  SumRec.Sum := 0;
  enumRanges(TRangeProc(SumProc), @SumRec);
  Result := SumRec.Sum;
end;

Anyway, if you need to create billions of ANYTHING you're probably doing it wrong (unless you're a scientist, modelling something extremely large scale and detailed). Even more so if you need to create billions of stuff every time you want one of those. This is never good. Try to think of alternative solutions.

himself
There are perferctly valid cases when you need bilion of objects. I had a situation like this when playing around with a custom XML database. I had to atomize the XML documents which resulted in a lot of objects (one for each node attribute etc...). The only way was to extend storage to the hard drive. XML databases are very large usually. 10 times larger then the actual XML documents.
Runner
I do not claim however that in his case it cannot be done differently. I do not know the problem deep enought to say that.
Runner
I don't say there aren't, but most of the times when people say "my case really needs billion of objects", they're wrong. You're probably no exception. I don't know what your objective was, but you could have just kept the db on the hard drive and read parts only when they're needed. If you wanted random access, you could have built indexes in a single pass.
himself
I did that. I used the approach I suggested in my answer. I never had bilion of objects in memory. We did not understand each other as I can see. I just said that there are cases when bilion of objects are needed. But I agree, cases when bilion of object in memory are needed are very very rare.
Runner
A: 

This part:

As a result, if you have 100 items, you can't directly create the 100th object without creating all the prior objects.

sounds a bit like calculating Fibonacci. May be you can reuse some of the TRange objects instead of creating redundant copies? Here is a C++ article describing this approach - it works by storing already calculated intermediate results in a hash map.

Ulrich Gerhardt
A: 

Handling billions of objects is possible but you should avoid it as much as possible. Do this only if you absolutely have to...
I did create a system once that needed to be able to handle a huge amount of data. To do so, I made my objects "streamable" so I could read/write them to disk. A larger class around it was used to decide when an object would be saved to disk and thus removed from memory. Basically, when I would call an object, this class would check if it's loaded or not. If not, it would re-create the object again from disk, put it on top of a stack and then move/write the bottom object from this stack to disk. As a result, my stack had a fixed (maximum) size. And it allowed me to use an unlimited amount of objects, with a reasonable good performance too.
Unfortunately, I don't have that code available anymore. I wrote it for a previous employer about 7 years ago. I do know that you would need to write a bit of code for the streaming support plus a bunch more for the stack controller which maintains all those objects. But it technically would allow you to create an unlimited number of objects, since you're trading RAM memory for disk space.

Workshop Alex
I suggested very similar aproach. But it is better to stream the least used objects back to the hard drive. You just track usage count for each object. This way you always have and optimal working set. I have the code, but it is not very easy to understand because it is a part of my XML database research project.
Runner