views:

1023

answers:

10

Hi,

I have a 1GB file containing pairs of string and long. What's the best way of reading it into a Dictionary, and how much memory would you say it requires?

File has 62 million rows. I've managed to read it using 5.5GB of ram.

Say 22 bytes overhead per Dictionary entry, that's 1.5GB. long is 8 bytes, that's 500MB. Average string length is 15 chars, each char 2 bytes, that's 2GB. Total is about 4GB, where does the extra 1.5 GB go to?

The initial Dictionary allocation takes 256MB. I've noticed that each 10 million rows I read, consume about 580MB, which fits quite nicely with the above calculation, but somewhere around the 6000th line, memory usage grows from 260MB to 1.7GB, that's my missing 1.5GB, where does it go?

Thanks.

+1  A: 

You'll need to specify the file format, but if it's just something like name=value, I'd do:

Dictionary<string,long> dictionary = new Dictionary<string,long>();
using (TextReader reader = File.OpenText(filename))
{
    string line;
    while ((line = reader.ReadLine()) != null)
    {
        string[] bits = line.Split('=');
        // Error checking would go here
        long value = long.Parse(bits[1]);
        dictionary[bits[0]] = value;
    }
}

Now, if that doesn't work we'll need to know more about the file - how many lines are there, etc?

Are you using 64 bit Windows? (If not, you won't be able to use more than 3GB per process anyway, IIRC.)

The amount of memory required will depend on the length of the strings, number of entries etc.

Jon Skeet
3.5GB on 32-bit windows.
Unkwntech
I thought the 3.5GB was the amount of physical memory the whole system would use, but with a 3GB per process limit. Either way, it's less than 5 :)
Jon Skeet
And your application needs to be set to any cpu or x64 to take advantage of a 64 bit system.
Aaron Fischer
32bit still has 2GB per process limit without PAE. The 3GB switch just means all applications share 3GB and the kernel uses 1GB.
sixlettervariables
+4  A: 

Thinking about this, I'm wondering why you'd need to do it... (I know, I know... I shouldn't wonder why, but hear me out...)

The main problem is that there is a huge amount of data that needs to be presumably accessed quickly... The question is, will it essentially be random access, or is there some pattern that can be exploited to predict accesses?

In any case, I would implement this as a sliding cache. E.g. I would load as much as feasibly possible into memory to start with (with the selection of what to load based as much on my expected access pattern as possible) and then keep track of accesses to elements by time last accessed. If I hit something that wasn't in the cache, then it would be loaded and replace the oldest item in the cache.

This would result in the most commonly used stuff being accessible in memory, but would incur additional work for cache misses.

In any case, without knowing a little more about the problem, this is merely a 'general solution'.

It may be that just keeping it in a local instance of a sql db would be sufficient :)

Andrew Rollings
1 GB is the bare minimum for gaining any kind of performance improvement (after exploiting all possible access patterns). Actually, I'm really thinking about an in-memory DB.
Meidan Alon
Well, in that case, you could try the above... With a bit of tuning, it may well do what you need. :)Failing that, stick it in a local db instance and let that take care of the cacheing.
Andrew Rollings
A: 

Loading a 1 GB file in memory at once doesn't sound like a good idea to me. I'd virtualize the access to the file by loading it in smaller chunks only when the specific chunk is needed. Of course, it'll be slower than having the whole file in memory, but 1 GB is a real mastodon...

Boyan
It maybe doesn't sound like a good idea to you but you bet that it sounds like a good idea to Google because this is exactly what they're doing. Storing in memory all their indexes.
lubos hasko
@lubos hasko: Maybe it's easier when you have a Google datacenter. But assuming every client has one still doesn't sound like a good idea.
Boyan
+5  A: 

Maybe you can convert that 1 GB file into a SQLite database with two columns key and value. Then create an index on key column. After that you can query that database to get the values of the keys you provided.

huseyint
+9  A: 

Everyone here seems to be in agreement that the best way to handle this is to read only a portion of the file into memory at a time. Speed, of course, is determined by which portion is in memory and what parts must be read from disk when a particular piece of information is needed.

There is a simple method to handle deciding what's the best parts to keep in memory:

Put the data into a database.

A real one, like MSSQL Express, or MySql or Oracle XE (all are free).

Databases cache the most commonly used information, so it's just like reading from memory. And they give you a single access method for in-memory or on-disk data.

James Curran
A: 

Don't read 1GB of file into the memory even though you got 8 GB of physical RAM, you can still have so many problems. -based on personal experience-

I don't know what you need to do but find a workaround and read partially and process. If it doesn't work you then consider using a database.

dr. evil
+1  A: 

I am not familiar with C#, but if you're having memory problems you might need to roll your own memory container for this task.

Since you want to store it in a dict, I assume you need it for fast lookup? You have not clarified which one should be the key, though.

Let's hope you want to use the long values for keys. Then try this:

Allocate a buffer that's as big as the file. Read the file into that buffer.

Then create a dictionary with the long values (32 bit values, I guess?) as keys, with their values being a 32 bit value as well.

Now browse the data in the buffer like this: Find the next key-value pair. Calculate the offset of its value in the buffer. Now add this information to the dictionary, with the long as the key and the offset as its value.

That way, you end up with a dictionary which might take maybe 10-20 bytes per record, and one larger buffer which holds all your text data.

At least with C++, this would be a rather memory-efficient way, I think.

Thomas Tempelmann
+6  A: 

It's important to understand what's happening when you populate a Hashtable. (The Dictionary uses a Hashtable as its underlying data structure.)

When you create a new Hashtable, .NET makes an array containing 11 buckets, which are linked lists of dictionary entries. When you add an entry, its key gets hashed, the hash code gets mapped on to one of the 11 buckets, and the entry (key + value + hash code) gets appended to the linked list.

At a certain point (and this depends on the load factor used when the Hashtable is first constructed), the Hashtable determines, during an Add operation, that it's encountering too many collisions, and that the initial 11 buckets aren't enough. So it creates a new array of buckets that's twice the size of the old one (not exactly; the number of buckets is always prime), and then populates the new table from the old one.

So there are two things that come into play in terms of memory utilization.

The first is that, every so often, the Hashtable needs to use twice as much memory as it's presently using, so that it can copy the table during resizing. So if you've got a Hashtable that's using 1.8GB of memory and it needs to be resized, it's briefly going to need to use 3.6GB, and, well, now you have a problem.

The second is that every hash table entry has about 12 bytes of overhead: pointers to the key, the value, and the next entry in the list, plus the hash code. For most uses, that overhead is insignificant, but if you're building a Hashtable with 100 million entries in it, well, that's about 1.2GB of overhead.

You can overcome the first problem by using the overload of the Dictionary's constructor that lets you provide an initial capacity. If you specify a capacity big enough to hold all of the entries you're going to be added, the Hashtable won't need to be rebuilt while you're populating it. There's pretty much nothing you can do about the second.

Robert Rossney
I've created the Dictionary with a big enough initial capacity. I can also see that memory usage grows slowly, wthout any big jumps.
Meidan Alon
+1  A: 

Can you convert the 1G file into a more efficient indexed format, but leave it as a file on disk? Then you can access it as needed and do efficient lookups.

Perhaps you can memory map the contents of this (more efficient format) file, then have minimum ram usage and demand-loading, which may be a good trade-off between accessing the file directly on disc all the time and loading the whole thing into a big byte array.

MarkR
A: 

If you choose to use a database, you might be better served by a dbm-style tool, like Berkeley DB for .NET. They are specifically designed to represent disk-based hashtables.

Alternatively you may roll your own solution using some database techniques.

Suppose your original data file looks like this (dots indicate that string lengths vary):

[key2][value2...][key1][value1..][key3][value3....]

Split it into index file and values file.

Values file:

[value1..][value2...][value3....]

Index file:

[key1][value1-offset]
[key2][value2-offset]
[key3][value3-offset]

Records in index file are fixed-size key->value-offset pairs and are ordered by key. Strings in values file are also ordered by key.

To get a value for key(N) you would binary-search for key(N) record in index, then read string from values file starting at value(N)-offset and ending before value(N+1)-offset.

Index file can be read into in-memory array of structs (less overhead and much more predictable memory consumption than Dictionary), or you can do the search directly on disk.

Constantin