views:

2108

answers:

8

I'm reading in a large text file with 1.4 million lines that is 24 MB in size (average 17 characters a line).

I'm using Delphi 2009 and the file is ANSI but gets converted to Unicode upon reading, so fairly you can say the text once converted is 48 MB in size.

( Edit: I found a much simpler example ... )

I'm loading this text into a simple StringList:

  AllLines := TStringList.Create;
  AllLines.LoadFromFile(Filename);

I found that the lines of data seem to take much more memory than their 48 MB.

In fact, they use 155 MB of memory.

I don't mind Delphi using 48 MB or even as much as 60 MB allowing for some memory management overhead. But 155 MB seems excessive.

This is not a fault of StringList. I previously tried loading the lines into a record structure, and I got the same result (160 MB).

I don't see or understand what could be causing Delphi or the FastMM memory manager to use 3 times the amount of memory necessary to store the strings. Heap allocation can't be that inefficient, can it?

I've debugged this and researched it as far as I can. Any ideas as to why this might be happening, or ideas that might help me reduce the excess usage would be much appreciated.

Note: I am using this "smaller" file as an example. I am really trying to load a 320 MB file, but Delphi is asking for over 2 GB of RAM and running out of memory because of this excess string requirement.

Addenum: Marco Cantu just came out with a White Paper on Delphi and Unicode. Delphi 2009 has increased the overhead per string from 8 bytes to 12 bytes (plus maybe 4 more for the actual pointer to the string). An extra 16 bytes per 17x2 = 34 byte line adds almost 50%. But I'm seeing over 200% overhead. What could the extra 150% be?


Success!! Thanks to all of you for your suggestions. You all got me thinking. But I'll have to give Jan Goyvaerts credit for the answer, since he asked:

...why are you using TStringList? Must the file really be stored in memory as separate lines?

That led me to the solution that instead of loading the 24 MB file as a 1.4 million line StringList, I can group my lines into natural groups my program knows about. So this resulted in 127,000 lines loaded into the string list.

Now each line averages 190 characters instead of 17. The overhead per StringList line is the same but now there are many fewer lines.

When I apply this to 320 MB file, it no longer runs out of memory and now loads in less than 1 GB of RAM. (And it only takes about 10 seconds to load, which is pretty good!)

There will be a little bit extra processing to parse the grouped lines, but it shouldn't be noticeable in real time processing of each group.

(In case you were wondering, this is a genealogy program, and this may be the last step I needed to allow it to load all the data about one million people in a 32-bit address space in less than 30 seconds. So I've still got a 20 second buffer to play with to add the indexes into the data the will be required to allow display and editing of the data.)

+1  A: 

Are you relying on Windows to tell you how much memory the program is using? It's notorious for overstating the memory used by a Delphi app.

I do see plenty of extra memory use in your code, though.

Your record structure is 20 bytes--if there is one such record per line you're looking at more data for the records than for the text.

Furthermore, a string has an inherent 4 byte overhead--another 25%.

I believe there is a certain amount of allocation granularity in Delphi's heap handling but I don't recall what it is at present. Even at 8 bytes (two pointers for a linked list of free blocks) you're looking at another 25%.

Note that we are already up to over a 150% increase.

Loren Pechtel
The overhead of a UnicodeString is four bytes for the length, four bytes for the reference count, and two bytes for the null at the end.
Rob Kennedy
In my previous example with records, I specifically stated I was comparing loading the record and assigning the string to loading the record without assigning the string. Therefore the difference was due to the string alone, and not the 20 bytes in the record.
lkessler
+7  A: 

What if you made your original record use AnsiString? That chops it in half immediately? Just because Delphi defaults to UnicodeString doesn't mean you have to use it.

Additionally, if you know exactly the length of each string (within a character or two) then it might be better to use short strings even and shave off a few more bytes.

I am curious if there might be a better way to accomplish what you are trying to do. Loading 320 MB of text into memory might not be the best solution, even if you can get it down to only require 320 MB

Jim McKeeth
Good answer and I'll think about it. My program is designed for Unicode, so it would be a shame to have to resort back to ANSI for very large files. I may try file memory mapping. I don't expect that will be fast enough for what I need - but you never know until you try.
lkessler
+4  A: 

By default, Delphi 2009's TStringList reads a file as ANSI, unless there is a Byte Order Mark to identify the file as something else, or if you provide an encoding as the optional second parameter of LoadFromFile.

So if you are seeing that the TStringList is taking up more memory than you think, then something else is going on.

Nick Hodges
Thanks, Nick. Hmmm... Can't imagine what else is going on. My example is quite simple.
lkessler
+3  A: 

Are you by any chance compiling the program with FastMM sources from sourceforge and with FullDebugMode defined? In that case, FastMM is not really releasing unused memory blocks, which would explain the problem.

gabr
Good thought, but no. I'm using the FastMM in Delphi 2009. The only option I've changed is the compiler option to turn String Format Checking Off, as has been recommended on several blogs.
lkessler
+6  A: 

You asked me personally to answer your question here. I don't know the precise reason why you're seeing such high memory usage, but you need to remember that TStringList does a lot more than just loading your file. Each of these steps requires memory that may result in memory fragmentation. TStringList needs to load your file into memory, convert it from Ansi to Unicode, split it into one string for each line, and stuff those lines into an array that will be reallocated many times.

My question to you is why are you using TStringList? Must the file really be stored in memory as separate lines? Are you going to modify the file in-memory, or just display parts of it? Keeping the file in memory as one big chunk and scanning the whole thing with regular expressions that match the parts you want will be more memory efficient than storing separate lines.

Also, must the whole file be converted to Unicode? While your application is Unicode, your file is Ansi. My general recommendation is to convert Ansi input to Unicode as soon as possible, because doing so saves CPU cycles. But when you have 320 MB of Ansi data that will stay as Ansi data, memory consumption will be the bottleneck. Try keeping the file as Ansi in memory, and only convert the parts you'll be displaying to the user as Ansi.

If the 320 MB file isn't a data file you're extracting certain information from, but a data set you want to modify, consider converting it into a relational database, and let the database engine worry how to manage the huge set of data with limited RAM.

Jan Goyvaerts
Thank you Jan for your ideas, which gives me more to think on. Your suggestion of "chunk" makes me want to try loading groups of strings, which average about 150 characters per group rather than the 17 characters per line. Genealogy software should be Unicode.
lkessler
Of course your software should be Unicode. But that doesn't mean you need to hold 320 MB of data in memory in Unicode, when the source isn't Unicode.
Jan Goyvaerts
+4  A: 

I using Delphi 2009 and the file is ANSI but gets converted to Unicode upon reading, so fairly you can say the text once converted is 48 MB in size.

Sorry, but I don't understand this at all. If you have a need for your program to be Unicode, surely the file being "ANSI" (it must have some character set, like WIN1252 or ISO8859_1) isn't the right thing. I'd first convert it to be UTF8. If the file does not contain any chars >= 128 it won't change a thing (it will even be the same size), but you are prepared for the future.

Now you can load it into UTF8 strings, which will not double your memory consumption. On-the-fly-conversion of the few strings that can be visible on the screen at the same time to the Delphi Unicode string will be slower, but given the smaller memory footprint your program will perform much better on systems with little (free) memory.

Now if your program still consumes too much memory with TStringList you can always use TStrings or even IStrings in you program, and write a class that implements IStrings or inherits TStrings and does not keep all the lines in memory. Some ideas that come to mind:

  1. Read the file into a TMemoryStream, and maintain an array of pointers to the first characters of the lines. Returning a string is easy then, you only need to return a proper string between the start of the line and the start of the next one, with the CR and NL stripped.

  2. If this still consumes too much memory, replace the TMemoryStream with a TFileStream, and do not maintain an array of char pointers, but an array of file offsets for the line starts.

  3. You could also use the Windows API functions for memory mapped files. That allows you to work with memory addresses instead of file offsets, but does not consume that much memory as the first idea.

mghie
Your 3 ideas are good. But converting to UTF8 is inefficient and wrong in Delphi 2009. I either must keep it in ANSI and convert to Unicode when I need to, or absorb the 24 MB extra (which I'm willing to do) and convert to Unicode for the program to use.
lkessler
Sorry, but I happen to disagree. UTF8 is the right format for data storage and data exchange, and since I/O is much slower than CPU processing it should give you not only smaller disk files, but better performance too. Whatever the internal string format, I would always use UTF8 for the data files.
mghie
Data files are often of much greater value than program code, so optimizing for a particular programming environment is wrong. Their format has to be expressive yet efficient, preferably standardized. UTF8 gives you all of that, and is most common outside of Windows too. What's not to like?
mghie
A: 

Why are you loading that amount of data into a TStringList? The list itself will have some overhead. Maybe TTextReader could help you.

TTextReader only helps to Parse the input. I do that already myself very efficiently. I then have to put the parsed lines someplace. I originally tried using records and found this memory use problem. Then I found the same problem in TStringList and left that on the question as a simpler example.
lkessler
+1  A: 

Part of it could be the block allocation algorithm. As your list grows, it starts increasing the amount of memory allocated at each chunk. I haven't looked at it in a long time, but I believe it goes something like doubling the amount of last allocated each time it runs out of memory. When you start to deal with lists that large, your allocations are also much larger than you ultimately need.

EDIT- As lkessler pointed out, this increase is actually only 25%, but it still should be considered as a part of the problem. if your just beyond the tipping point, there could be an enormous block of memory allocated to the list that isn't being used.

skamradt
That was a good suggestion, but TStringList.Grow only increases the size 25% more each time. So the most the overhead is due to this is 25%.
lkessler