ansaurus

Question

Improving set comparisons by hashing them (being overly clever..?)

Answer 1

+3 A:

OK, I think I understand the problem at least now. Let me see if I can rephrase.

Let's start by leaving sets out of it. We'll keep it abstract.

You have a large list L, containing instances of reference type S (for "set"). Such a list is of course logically a mapping from natural numbers N onto S.

L: N --> S

S has the property that two instances can be compared for both reference equality and value equality. That is, there can be two instances of S which are not reference equals, but logically represent the same value.

You have a function F which takes a value of type V (for "vector") and an instance of type S and produces another instance of type S.

F: (V, S) --> S

Furthermore, you know that if F is given an instance of S from L then the resulting instance of S will be value equals to something on the list, but not necessarily reference equals.

The problem you face is: given an instance s of S which is the result of a call to F, which member L(n) is value-equals to s?

Yes?

The naive method -- go down L(1), L(2), ... testing set equality along the way will be dead slow. It'll be at least linear in the size of L.

I can think of several different ways to proceed. The easiest is your initial thought: make L something other than a list. Can you make it a HashSet<S> instead of List<S>? If you implement a hashing algorithm and equality method on S then you can build a fast lookup table.

If that doesn't suit then we'll explore other options.

UPDATE:

OK, so I can see two basic ways to deal with your memory problem. (1) Keep everything in memory using data structures that are much smaller than your current implementation, or (2) change how you store stuff on disk so that you can store an "index" in memory and rapidly go to the right "page" of the disk file to extract the information you need.

You could be representing a point as a single short where the top byte is x and the bottom byte is y, instead of representing it as two ints; a savings of 75%.

A set of points could be implemented as a sorted array of shorts, which is pretty compact and easy to write a hash algorithm for.

That's probably the approach I'd go for since your data are so compressible.

Eric Lippert 2010-03-06 22:52:35

First of all: Thanks for taking the time. Yes, you're on the right path: What you rephrase is what I want to do. HashSet: Might be an option, but I had memory issues in the past. If S contains on average 10 points, L consists of 1.000.000 times S.. I guess I need to do a more clever thing. Will update again in a minute.

Benjamin Podszun 2010-03-06 23:11:24

Thanks again. Both the "index to a file on disk" and "use a short to represent both coordinates" ideas were very helpful."Easy to write a hash algorithm for" is - well - not exactly what I feel when I think about it. I could come up with something, but I would always be afraid of collisions => losing data in my application. Anyway, you wasted a ton of your time on my spare time project, thanks a lot. I'll accept this answer, since it takes me from 10% to 80% for my current problem and I have to figure out the tiny details on my own anyway.

Benjamin Podszun 2010-03-07 21:29:52

@Benjamin: You're welcome. But I think you're misunderstanding what a hash code is for. A hash algorithm is always lossy; its purpose in a hash set is simply to *speed up* the comparison process by enabling you to rapidly identify and reject non-matches. The problem with collisions is that they result in slower lookups, not because they lose data.

Eric Lippert 2010-03-07 22:54:12

Maybe I was again unable to communicate my train of thoughts: Loss of data, if I would store nothing but the hash of the set of points as a reduced dataset - because that's what I wanted to ask initially (see topic/subject/headline of the question). Collisions would mean that I now have two different sets of point mapped to one value (int, long) -> Bad. A hashset would just have a slightly worse lookup time as far as I know, but if I keep _only_ the hashes around I'd need a stable 1:1 mapping.

Benjamin Podszun 2010-03-08 05:55:55

@Benjamin: Right. You're not going to easily get that without collisions. A 100+ bit crypto strength hash would do the trick, but a randomly distributed 32 bit hash code has a 1% chance of collision in the first 9300 hashes and a 99% chance of collision in the first couple hundred thousand hashes. The hashes aren't a substitute for storing the data.

Eric Lippert 2010-03-08 08:22:33

ansaurus

tags:

views:

answers:

Improving set comparisons by hashing them (being overly clever..?)

related questions