views:

40

answers:

2

Let's say I have this 170mb file (roughly 180 million bytes). What I need to do is to create a table that lists:

  1. all 4096 byte combinations found [column 'bytes'], and
  2. the number of times each byte combination appeared in it [column 'occurrences']

Assume two things:

  1. I can save data very fast, but
  2. I can update my saved data very slow.

How should I sample the file and save the needed information?

Here're some suggestions that are (extremely) slow:

  • Go through each 4096 byte combinations in the file, save each data, but search the table first for existing combinations and update it's values. this is unbelievably slow
  • Go through each 4096 byte combinations in the file, save until 1 million rows of data in a temporary table. Go through that table and fix the entries (combine repeating byte combinations), then copy to the big table. Repeat going through another 1 million rows of data and repeat the process. this is faster by a bit, but still unbelievably slow

This is kind of like taking the statistics of the file.

NOTE: I know that sampling the file can generate tons of data (around 22Gb from experience), and I know that any solution posted would take a bit of time to finish. I need the most efficient saving process

+1  A: 

The first solution you've provided could be sped up significantly if you also hash the data and store the hash of the 4096-byte segment in your database, then compare to that. Comparing to a string that's 4096 bytes long would take forever, but this would be significantly faster:

For each 4096-byte segment in the file
    Hash the segment into something short (even MD5 is fine, and it's quick)
    Look up the hash in your database
        If it exists (segment may have already been found)
            Compare the actual segment to see if there's a match
        If it doesn't exist
            It's a new segment - save it to your database

Hashing the segment isn't free, but it's pretty cheap, and the comparisons between hashes will be orders of magnitude cheaper than comparing the full byte segments to each other repeatedly. Hashes are useful for a many applications - this is definitely one of them.

rwmnau
Good idea on the hashing. Somehow that did not cross my mind :) Hashes may speed up the look up by a bit, but the bad news here is that you'd still have to do an iteration multiple times in the table, that's the thing which would make the script crawl to a stop
Jami
@Jami: What do you mean "You'd have to do the iteration multiple times"? You're asking (in your example of a 180MB file) to compare 180,000,000 strings to each other - it's going to be slow no matter how you do it. In the beginning, comparing each new byte segment directly to the ones you've already seen is cheap - there aren't that many yet. But as the list of distinct segments you know about grows, the comparison gets astronomically more expensive, and so the speed of your lookup, not the speed of your hash, is what's going to be critically important in determining the speed of your process.
rwmnau
Jami
A: 

It's a bit late and I can't think straight, so my algorithm complexity calculation is kind of off :) But if you manage to fit it into memory, you might have a very very quick implementation with a trie. If you can optimize each trie node to take as little memory as possible, it just might work.

Another thing is basically @rwmnau's suggestion, but don't use predefined hash functions like MD5 - use running totals. Unlike hashes, this is almost free, without any downside for such a large block size (there's plenty of randomness in 4096 bytes). With each new block, you get one byte, and you lose one byte. So calculate a sum of the first 4096 bytes; for each subsequent ones, simply subtract the lost byte and add the new one. Depending on the size of the integer you do the sums in, you will have plenty of buckets. Then you will have much smaller number of blocks to compare byte-for-byte.

Amadan
If I do a trie with just all the bytes, I'd end up with 256^4096 total paths in the tree. If I do a trie on the hashes instead, I'd end up with 16^32 paths. This is a lot. I can't think of any fast process to traverse that trie in a considerable amount of time. What do you suggest?
Jami
Regarding running totals, the downside to this is that I won't be able to filter unique combinations. a byte combination of [0, 0, 1] would be the same as with [1, 0, 0]
Jami
Regarding trie, you only have 256^4096 total paths if you get all possible 4096-byte combinations. You don't - they wouldn't fit into 170MB; so it must be much much less than that. You don't have 256-branch trees - in most cases, below top couple of levels, you will have only a couple of branches; and most bottom layers would be unique (you can represent them as strings until you need to split them apart). If you have blocks that repeat (and you obviously do, trying to count them) or have same prefixes, it's even less - that's the whole point of trie. Depends on your data if it'd fit or not.
Amadan
Regarding running totals, no hash will give you unique identification - it sorts things into buckets (i.e. classes of equivalence on a hash function), and from there you use some other method, usually brute force. If you have a 32-bit hash (or running total, which is the same thing here), you need to check only 1/2^32 of your collected blocks for equality byte-by-byte, which makes your "unbelievably slow" algorithm roughly 2^32 times faster.
Amadan
I see the value now with your running totals suggestion! I can use running totals to segregate the data (maybe hash the bytes first to have fewer 'buckets'), then use HDF5 for storage. I'd have to map the hash to it's equivalent byte combination somewhere though..
Jami