OK, I think I understand the problem at least now. Let me see if I can rephrase.
Let's start by leaving sets out of it. We'll keep it abstract.
You have a large list L, containing instances of reference type S (for "set"). Such a list is of course logically a mapping from natural numbers N onto S.
L: N --> S
S has the property that two instances can be compared for both reference equality and value equality. That is, there can be two instances of S which are not reference equals, but logically represent the same value.
You have a function F which takes a value of type V (for "vector") and an instance of type S and produces another instance of type S.
F: (V, S) --> S
Furthermore, you know that if F is given an instance of S from L then the resulting instance of S will be value equals to something on the list, but not necessarily reference equals.
The problem you face is: given an instance s of S which is the result of a call to F, which member L(n) is value-equals to s?
Yes?
The naive method -- go down L(1), L(2), ... testing set equality along the way will be dead slow. It'll be at least linear in the size of L.
I can think of several different ways to proceed. The easiest is your initial thought: make L something other than a list. Can you make it a HashSet<S>
instead of List<S>
? If you implement a hashing algorithm and equality method on S then you can build a fast lookup table.
If that doesn't suit then we'll explore other options.
UPDATE:
OK, so I can see two basic ways to deal with your memory problem. (1) Keep everything in memory using data structures that are much smaller than your current implementation, or (2) change how you store stuff on disk so that you can store an "index" in memory and rapidly go to the right "page" of the disk file to extract the information you need.
You could be representing a point as a single short where the top byte is x and the bottom byte is y, instead of representing it as two ints; a savings of 75%.
A set of points could be implemented as a sorted array of shorts, which is pretty compact and easy to write a hash algorithm for.
That's probably the approach I'd go for since your data are so compressible.