views:

159

answers:

4

I have an array of about 10-100k ints that I need to store (as compressed as possible), and retrieve back to the complete array the fastest way possible. What is the best way to handle this type of thing in a language like c#.

+6  A: 

That depends on what you mean by "as compressed as possible".

You can use a BinaryWriter to write the integers to a stream, or use BitConverter.GetBytes to get each int as four bytes as copy into a large array. Either would store each int without any extra meta data.

If you want it more compressed than that, the BinaryWriter has a Write7BitEncodedInt method that writes ints with small values in fewer bytes. You can also use the GZipStream class to try to further compress the data once you have it packed in a byte array.

Generally, the smaller you want it, the longer it will take to process. To get the balance between speed and size that you want, you just have to do some testing.

Guffa
'try to further compress' - try being the operative word. Compressing something is no guarantee it will actually be any smaller!
zebrabox
+1 "Generally, the smaller you want it, the longer it will take to process" hardly more to say
stacker
+2  A: 

Depending on the nature of the values in this int array, run-length encoding might be another option. That is, if contiguous cells in your array all have the same value, you only need to store the first occurrence of the value in that sequence, along with the number of times it will be repeated after that. This might work especially well with "sparse" data.

stakx
+1 RLE is pretty fast and easy to implement
stacker
You can encode and decode entire datasets with RLE easily and efficiently -- however without considerable effort sensibly using the data while compressed will be insane. RLE is a technique used in numerical libraries however -- but they have spent considerable complex-LOC making it usable. Plus you could have simply mentioned a standard compressiong scheme that just uses RLE -- most of the standard ones do.
Hassan Syed
+2  A: 

100,000 ints is not that big, why do you need to compress it so much?

Paul Creasey
Assuming int = 4 bytes then that's 390k - a decent size IMO , all depends on your definition of 'big'
zebrabox
A common way of storing a data structure is serialisation, and one common format for that is XML. If you do that to a 100k int array, it gets rather large...
Guffa
dude XML is not compressed or fast in any way ...... come-on !!
Hassan Syed
+2  A: 

Answer for your specific question

  1. Pick the data type that is large enough and only large enough to store your data --e.g., uint32_t or int64_t. Note: it has to be fixed length.
  2. Write the data out in binary form -- back to back -- to a file.
  3. Read the data back directly into the memory of your array type.

Problem solved in most optimal way. If you wanted on-disk compression, run the data through a zipping library. having the data compressed in-memory while you are trying to use it is generally a no-no (the general solution uses other techniques). Please indicate if you want information why it is a no-no.

General answer for computing with large data-sets

Specialized mathematics libraries deal with these issues (e.g., octave or matlab), specifically the issues of dealing with more numbers than you can think possible with your computer.

These libraries have an execution engine and a specific language, but you can often programmatically interface with them.

Hassan Syed
"Problem solved in most optimal way." This does not address his compression requirement.
dss539
You can either have compression or speed as indicated in a comment to the question-- the general solution I mentioned above will take care of representing large amounts of integeral data in a useable fashion. Still, I have added more info to the answer.
Hassan Syed