I have an array of about 10-100k ints that I need to store (as compressed as possible), and retrieve back to the complete array the fastest way possible. What is the best way to handle this type of thing in a language like c#.
That depends on what you mean by "as compressed as possible".
You can use a BinaryWriter to write the integers to a stream, or use BitConverter.GetBytes to get each int as four bytes as copy into a large array. Either would store each int without any extra meta data.
If you want it more compressed than that, the BinaryWriter has a Write7BitEncodedInt method that writes ints with small values in fewer bytes. You can also use the GZipStream class to try to further compress the data once you have it packed in a byte array.
Generally, the smaller you want it, the longer it will take to process. To get the balance between speed and size that you want, you just have to do some testing.
Depending on the nature of the values in this int
array, run-length encoding might be another option. That is, if contiguous cells in your array all have the same value, you only need to store the first occurrence of the value in that sequence, along with the number of times it will be repeated after that. This might work especially well with "sparse" data.
100,000 ints is not that big, why do you need to compress it so much?
Answer for your specific question
- Pick the data type that is large enough and only large enough to store your data --e.g., uint32_t or int64_t. Note: it has to be fixed length.
- Write the data out in binary form -- back to back -- to a file.
- Read the data back directly into the memory of your array type.
Problem solved in most optimal way. If you wanted on-disk compression, run the data through a zipping library. having the data compressed in-memory while you are trying to use it is generally a no-no (the general solution uses other techniques). Please indicate if you want information why it is a no-no.
General answer for computing with large data-sets
Specialized mathematics libraries deal with these issues (e.g., octave or matlab), specifically the issues of dealing with more numbers than you can think possible with your computer.
These libraries have an execution engine and a specific language, but you can often programmatically interface with them.