ansaurus

Question

Logic: Best way to sample & count bytes of a 100MB+ file

Answer 1

+1 A:

The first solution you've provided could be sped up significantly if you also hash the data and store the hash of the 4096-byte segment in your database, then compare to that. Comparing to a string that's 4096 bytes long would take forever, but this would be significantly faster:

For each 4096-byte segment in the file
    Hash the segment into something short (even MD5 is fine, and it's quick)
    Look up the hash in your database
        If it exists (segment may have already been found)
            Compare the actual segment to see if there's a match
        If it doesn't exist
            It's a new segment - save it to your database

Hashing the segment isn't free, but it's pretty cheap, and the comparisons between hashes will be orders of magnitude cheaper than comparing the full byte segments to each other repeatedly. Hashes are useful for a many applications - this is definitely one of them.

rwmnau 2010-06-16 14:09:17

Good idea on the hashing. Somehow that did not cross my mind :) Hashes may speed up the look up by a bit, but the bad news here is that you'd still have to do an iteration multiple times in the table, that's the thing which would make the script crawl to a stop

Jami 2010-06-16 14:51:26

@Jami: What do you mean "You'd have to do the iteration multiple times"? You're asking (in your example of a 180MB file) to compare 180,000,000 strings to each other - it's going to be slow no matter how you do it. In the beginning, comparing each new byte segment directly to the ones you've already seen is cheap - there aren't that many yet. But as the list of distinct segments you know about grows, the comparison gets astronomically more expensive, and so the speed of your lookup, not the speed of your hash, is what's going to be critically important in determining the speed of your process.

rwmnau 2010-06-16 18:36:00

Jami 2010-06-17 02:52:51

Answer 2

A:

It's a bit late and I can't think straight, so my algorithm complexity calculation is kind of off :) But if you manage to fit it into memory, you might have a very very quick implementation with a trie. If you can optimize each trie node to take as little memory as possible, it just might work.

Another thing is basically @rwmnau's suggestion, but don't use predefined hash functions like MD5 - use running totals. Unlike hashes, this is almost free, without any downside for such a large block size (there's plenty of randomness in 4096 bytes). With each new block, you get one byte, and you lose one byte. So calculate a sum of the first 4096 bytes; for each subsequent ones, simply subtract the lost byte and add the new one. Depending on the size of the integer you do the sums in, you will have plenty of buckets. Then you will have much smaller number of blocks to compare byte-for-byte.

Amadan 2010-06-16 14:34:59

If I do a trie with just all the bytes, I'd end up with 256^4096 total paths in the tree. If I do a trie on the hashes instead, I'd end up with 16^32 paths. This is a lot. I can't think of any fast process to traverse that trie in a considerable amount of time. What do you suggest?

Jami 2010-06-16 15:24:54

Regarding running totals, the downside to this is that I won't be able to filter unique combinations. a byte combination of [0, 0, 1] would be the same as with [1, 0, 0]

Jami 2010-06-16 15:26:57

Regarding trie, you only have 256^4096 total paths if you get all possible 4096-byte combinations. You don't - they wouldn't fit into 170MB; so it must be much much less than that. You don't have 256-branch trees - in most cases, below top couple of levels, you will have only a couple of branches; and most bottom layers would be unique (you can represent them as strings until you need to split them apart). If you have blocks that repeat (and you obviously do, trying to count them) or have same prefixes, it's even less - that's the whole point of trie. Depends on your data if it'd fit or not.

Amadan 2010-06-16 15:45:39

Regarding running totals, no hash will give you unique identification - it sorts things into buckets (i.e. classes of equivalence on a hash function), and from there you use some other method, usually brute force. If you have a 32-bit hash (or running total, which is the same thing here), you need to check only 1/2^32 of your collected blocks for equality byte-by-byte, which makes your "unbelievably slow" algorithm roughly 2^32 times faster.

Amadan 2010-06-16 15:49:05

I see the value now with your running totals suggestion! I can use running totals to segregate the data (maybe hash the bytes first to have fewer 'buckets'), then use HDF5 for storage. I'd have to map the hash to it's equivalent byte combination somewhere though..

Jami 2010-06-16 16:55:35

ansaurus

tags:

views:

answers:

Logic: Best way to sample & count bytes of a 100MB+ file

related questions