Using Hash Maps to represent an extremely large data source

views:

458

answers:

+1 Q:

Using Hash Maps to represent an extremely large data source

I have a very large possible data set that I am trying to visualize at once. The set itself consists of hundreds of thousands of segments, each of which is mapped to an id.

I have received a second data source that gives more real-time information for each segment, but the id's do not correspond to the id's I have.

I have a 1:1 mapping of the data id's (9-character strings) to the current id's (long integers). The problem is that there are a lot of id's, and the data that is coming in is in no specific order.

The solution I came up with is to have a hash-map that maps the strings to the road id's. The problem is that I don't know if the hash-map will be efficient enough to have all 166k data entries.

Does anyone have any suggestions and/or hashing algorithms that I can use for this?

+1 A:

Judy Arrays are designed for this sort of thing: "Judy's key benefits are scalability, high performance, and memory efficiency. [...] Judy can replace many common data structures, such as arrays, sparse arrays, hash tables, B-trees, binary trees, linear lists, skiplists, other sort and search algorithms, and counting functions."

RichieHindle 2009-05-07 21:49:08

I read the site, and it seems like bad marketing hype. Is it really as good as they say, and if so, why haven't more people heard of them?

Andrei Krotkov 2009-05-11 02:03:35

I honestly don't know. I did try out the library a while ago and it seemed to work as advertised, but I didn't do any serious benchmarking and I haven't used it in production code.

RichieHindle 2009-05-11 08:19:05

Its not marketing. For that you should hire sun or microsoft. Most people don't solve problems where it matters (they are using a RDBMS, so everyhing else is fast enough). The implementation combines badly with OO and VM implementations, and is heavily architecture dependent.

Stephan Eggermont 2009-05-23 22:33:01

@Stephan so you are saying that judy is heavily architecture dependent?

rogerdpack 2009-12-17 16:34:59

+2 A:

If you're only dealing with hundreds of thousands of datapoints, it will likely not be a problem to go with the naive way and just stick with a hash-map.

Even if you have 500,000 9-character strings and an equal number of longs, that still only 16ish bytes per item, or 8,000,000 bytes total. Even if you double that for overhead, 16 MB is hardly too big to have in memory at one time.

Basically, try the easy way first, and only worry about it when your profiling tells you it's taking too long.

Eclipse 2009-05-07 21:56:40

+1 A:

Since the comments on the question indicate the primary concern may be memory usage:

Use a pooling or other small-object-optimized allocator; assuming you have access to boost you can probably find a drop-in replacement in Pool. Using a better small-object allocator is probably the single biggest memory win you'll find.
If you know your strings are fixed-width, you may want to make sure you're allocating only enough space to store them. For example, a struct wrapped around a fixed-length char[] with a custom comparison operator may work better than a std::string. std::string comes with an additional dynamic allocation (and uses space for the corresponding pointer) and some extra size and capacity tracking overhead. (Generally, try to reduce the number of allocations that stick around; it reduces overhead.)
(Assuming STL) Look at the overhead difference between std::map and std::unordered_map (the latter may or may not be available to you at the moment); an RBtree-based std::map may be close enough to the lookup performance characteristics of your "hashmap" and may (or may not) be more memory efficient depending on your standard library implementation.

What route you take should be influenced by info you can gather -- try to get a picture of number of allocs and alloc size/alignment overhead.

You can either instrument your allocator or insert a few elements and see how you're doing compared to how you think you should be doing in terms of memory usage.

leander 2009-05-07 23:31:00

Incidentally, reduction in memory usage often gives a good speed boost too.

leander 2009-05-07 23:36:20

Since your strings are known up front and have a fixed length, theoretically and practically the best solution is a perfect hash. You could use cmph to generate it.

According to Wikipedia, your keys woud take 2.5 bits/key, or about 50KB. That's negligable compared to the 664KB for the values.

MSalters 2009-05-08 12:20:24

This can be useful to
STXXL : Standard Template Library for Extra Large Data Sets.

lsalamon 2009-05-08 13:48:19

ansaurus

tags:

views:

answers:

Using Hash Maps to represent an extremely large data source

related questions