Let's say I have this 170mb file (roughly 180 million bytes). What I need to do is to create a table that lists:
- all 4096 byte combinations found [column 'bytes'], and
- the number of times each byte combination appeared in it [column 'occurrences']
Assume two things:
- I can save data very fast, but
- I can update my saved data very slow.
How should I sample the file and save the needed information?
Here're some suggestions that are (extremely) slow:
- Go through each 4096 byte combinations in the file, save each data, but search the table first for existing combinations and update it's values. this is unbelievably slow
- Go through each 4096 byte combinations in the file, save until 1 million rows of data in a temporary table. Go through that table and fix the entries (combine repeating byte combinations), then copy to the big table. Repeat going through another 1 million rows of data and repeat the process. this is faster by a bit, but still unbelievably slow
This is kind of like taking the statistics of the file.
NOTE: I know that sampling the file can generate tons of data (around 22Gb from experience), and I know that any solution posted would take a bit of time to finish. I need the most efficient saving process