views:

419

answers:

6

I am a graduate student of physics and I am working on writing some code to sort several hundred gigabytes of data and return slices of that data when I ask for it. Here is the trick, I know of no good method for sorting and searching data of this kind.

My data essentially consists of a large number of sets of numbers. These sets can contain anywhere from 1 to n numbers within them (though in 99.9% of the sets, n is less than 15) and there are approximately 1.5 ~ 2 billion of these sets (unfortunately this size precludes a brute force search).

I need to be able to specify a set with k elements and have every set with k+1 elements or more that contains the specified subset returned to me.

Simple Example:
Suppose I have the following sets for my data:
(1,2,3)
(1,2,3,4,5)
(4,5,6,7)
(1,3,8,9)
(5,8,11)

If I were to give the request (1,3) I would have the sets: (1,2,3), (1,2,3,4,5), and (1,3,8,9).
The request (11) would return the set: (5,8,11).
The request (1,2,3) would return the sets: (1,2,3) and (1,2,3,4,5)
The request (50) would return no sets:

By now the pattern should be clear. The major difference between this example and my data is that the sets withn my data are larger, the numbers used for each element of the sets run from 0 to 16383 (14 bits), and there are many many many more sets.

If it matters I am writing this program in C++ though I also know java, c, some assembly, some fortran, and some perl.

Does anyone have any clues as to how to pull this off?

edit:
To answer a couple questions and add a few points:

1.) The data does not change. It was all taken in one long set of runs (each broken into 2 gig files).

2.) As for storage space. The raw data takes up approximately 250 gigabytes. I estimate that after processing and stripping off a lot of extraneous metadata that I am not interested in I could knock that down to anywhere from 36 to 48 gigabytes depending on how much metadata I decide to keep (without indices). Additionally if in my initial processing of the data I encounter enough sets that are the same I might be able to comress the data yet further by adding counters for repeat events rather than simply repeating the events over and over again.

3.) Each number within a processed set actually contains at LEAST two numbers 14 bits for the data itself (detected energy) and 7 bits for metadata (detector number). So I will need at LEAST three bytes per number.

4.) My "though in 99.9% of the sets, n is less than 15" comment was misleading. In a preliminary glance through some of the chunks of the data I find that I have sets that contain as many as 22 numbers but the median is 5 numbers per set and the average is 6 numbers per set.

5.) While I like the idea of building an index of pointers into files I am a bit leery because for requests involving more than one number I am left with the semi slow task (at least I think it is slow) of finding the set of all pointers common to the lists, ie finding the greatest common subset for a given number of sets.

6.) In terms of resources available to me, I can muster approximately 300 gigs of space after I have the raw data on the system (The remainder of my quota on that system). The system is a dual processor server with 2 quad core amd opterons and 16 gigabytes of ram.

7.) Yes 0 can occur, it is an artifact of the data acquisition system when it does but it can occur.

+2  A: 

Assuming a random distribution of 0-16383, with a consistent 15 elements per set, and two billion sets, each element would appear in approximately 1.8M sets. Have you considered (and do you have the capacity for) building a 16384x~1.8M (30B entries, 4 bytes each) lookup table? Given such a table, you could query which sets contain (1) and (17) and (5555) and then find the intersections of those three ~1.8M-element lists.

Sparr
I am not entirely certain the look up table is the way to go, the raw data itself takes up approximately 265 gigabytes, after initial processing it will take perhaps 230~250 gigs bytes (the raw format wastes space and has a lot of extraneous info) so the table would be half the size of the data.
James Matta
Though to be honest the look up index had occurred to me and if nothing better is suggested it will probably be my method, with an index to the index so that I can jump to the right places within the index.
James Matta
+2  A: 

My guess is as follows.

Assume that each set has a name or ID or address (a 4-byte number will do if there are only 2 billion of them).

Now walk through all the sets once, and create the following output files:

  • A file which contains the IDs of all the sets which contain '1'
  • A file which contains the IDs of all the sets which contain '2'
  • A file which contains the IDs of all the sets which contain '3'
  • ... etc ...

If there are 16 entries per set, then on average each of these 2^16 files will contain the IDs of 2^20 sets; with each ID being 4 bytes, this would require 2^38 bytes (256 GB) of storage.

You'll do the above once, before you process requests.

When you receive requests, use these files as follows:

  • Look at a couple of numbers in the request
  • Open up a couple of the corresponding index files
  • Get the list of all sets which exist in both these files (there's only a million IDs in each file, so this should't be difficult)
  • See which of these few sets satisfy the remainder of the request

My guess is that if you do the above, creating the indexes will be (very) slow and handling requests will be (very) quick.

ChrisW
A: 

Just playing devil's advocate for an approach which includes brute force + index lookup :

  1. Create an index with the min , max and no of elements of sets.
  2. Then apply brute force excluding sets where max < max(set being searched) and min > min (set being searched)
  3. In brute force also exclude sets whole element count is less than that of the set being searched.

95% of your searches would really be brute forcing a very smaller subset. Just a thought.

Learning
The subset would be smaller, but why do you think that it would be *very much* smaller 95% of the time? I'm guessing that you'd usually only be excluding about 75% of the 2 billion sets, which still leaves too many.
ChrisW
+10  A: 

Your problem is the same as that faced by search engines. "I have a bajillion documents. I need the ones which contain this set of words." You just have (very conveniently), integers instead of words, and smallish documents. The solution is an inverted index. Introduction to Information Retrieval by Manning et al is (at that link) available free online, is very readable, and will go into a lot of detail about how to do this.

You're going to have to pay a price in disk space, but it can be parallelized, and should be more than fast enough to meet your timing requirements, once the index is constructed.

Jay Kominek
The only thing to add is that Manning et al have a chapter on index compression that would be very applicable here -- don't skip it.
SquareCog
Thanks for the book link I read some of the online version and liked it enough to order it. It is going right next to Knuth's "The Art of Computer Programming", Stroustrup's "The C++ Programming Language", and Jossuttis' "The Standard Template Library". My primary references.
James Matta
If you are in the reading mood you may also want to take a look at this: http://portal.acm.org/citation.cfm?doid=1132956.1132959
SquareCog
+1  A: 

Make 16383 index files, one for each possible search value. For each value in your input set, write the file position of the start of the set into the corresponding index file. It is important that each of the index files contains the same number for the same set. Now each index file will consist of ascending indexes into the master file.

To search, start reading the index files corresponding to each search value. If you read an index that's lower than the index you read from another file, discard it and read another one. When you get the same index from all of the files, that's a match - obtain the set from the master file, and read a new index from each of the index files. Once you reach the end of any of the index files, you're done.

If your values are evenly distributed, each index file will contain 1/16383 of the input sets. If your average search set consists of 6 values, you will be doing a linear pass over 6/16383 of your original input. It's still an O(n) solution, but your n is a bit smaller now.

P.S. Is zero an impossible result value, or do you really have 16384 possibilities?

Mark Ransom
I see that you've already selected another solution while I was thinking about this. Oh well.
Mark Ransom
Well your solution is fairly similar to the one I selected, and the book Jay recommended fills in a few of the blanks in both methods, now I am just thinking of ways to make the size the data and indices need go down. some sort of structure that can take out some of the redundancies in the data.
James Matta
Yes indeed. 0 does occur, it is a Data Acquisition system artifact but it occurs. Some nonzero value comes down the pipe but it got vetoed by something that prevents unwanted signals from getting through. Unfortunately if the count rate is high the resulting zero sometimes gets recorded.
James Matta
A: 

I have recently discovered methods that use Space Filling curves to map the multi-dimensional data down to a single dimension. One can then index the data based on its 1D index. Range queries can be easily carried out by finding the segments of the curve that intersect the box that represents the curve and then retrieving those segments.

I believe that this method is far superior to making the insane indexes as suggested because after looking at it, the index would be as large as the data I wished to store, hardly a good thing. A somewhat more detailed explanation of this can be found at:

http://www.ddj.com/184410998
and
http://www.dcs.bbk.ac.uk/~jkl/publications.html

James Matta