views:

1245

answers:

10

I've seen a few questions here related to determining the similarity of files, but they are all linked to a particular domain (images, sounds, text, etc). The techniques offered as solutions require knowledge of the underlying file format of the files being compared. What I am looking for is a method without this requirement, where arbitrary binary files could be compared without needing to understand what type of data they contain. That is, I am looking to determine the similarity percentage of two files' binary data.

To give a little more detail for you to work with, even though this is potentially applicable to many things, I do have a specific problem that I'm working on. I also currently have a working solution, but I don't think that it is ideal. There are probably many optimizations in terms of the comparison method, and storing the results. Hopefully some people here will be able to give me some new ideas. I will probably edit in some information about my current method after a couple of days, but I don't want to bias peoples' thoughts about the problem by telling you how I'm already doing it.

The problem I'm working on is clone detection for video game ROM images. For those that don't have experience with emulation, ROMs are dumps of the data on game cartridges. A ROM "clone" is typically a modified version of the same game, the most common type being a translated version. For example, the Japanese and English versions of the original Final Fantasy for the NES are clones. The games share almost all of their assets (sprites, music, etc), but the text has been translated.

There are currently several groups that work on maintaining lists of clones for the various systems, but as far as I can tell, this is all done manually. What I am attempting to do is find a method to detect similar ROM images automatically and objectively, based on data similarity instead of "these seem like the same game". There are several reasons for detecting clones, but one of the major motivations is to be used with Solid compression. This allows compression of all game clones together into the same archive, with the entire compressed clone set often taking up only slightly more space than one of the individual ROMs.

Some concerns to consider when coming up with potential approaches:

  • ROMs vary highly in size, depending on the system. Some are small, but modern systems may have large ones, 256MB or more. Some (all?) systems only have powers of 2 as possible sizes, a 130MB game on one of these systems would have a 256MB rom, largely empty. Note that because of this, some clones may have wildly different sizes, if a game version crosses the threshold and has to use a cartridge that is twice the size.
  • There are currently thousands of known ROMs on many systems, with most systems still having new ones released constantly. Even for older systems, there is a major ROM-hacking community that produces modified ROMs often.
  • Storing similarity data for every possible pair of ROMs would result in millions of rows of data for any of the more popular systems. A system with 5000 ROMs would require 25 million rows of similarity data, with a single new game adding another 5000 rows.
  • State of the processing must be recoverable, so that if it is interrupted it can pick up where it left off. With any method, a lot of processing will be required, and assuming that the whole thing will run in one batch is not safe.
  • New ROMs could be added at any time, so the method should not assume that it already has a "complete" set. That is, even after you have already figured out similarity for all existing ROMs, if a new one is added (and this could also occur before previous processing was entirely finished) there must be a method for comparing it to all previous ones, to determine which (if any) it is a clone of.
  • Higher processing speed should be given priority over accuracy (to a point). Knowing whether two ROMs are 94% or 96% similar is not particularly important, but if it takes a day of processing to compare a new ROM to all the previous ones, the program would probably never truly complete.

It's been an interesting problem to work on, I look forward to seeing what other people can come up with. Let me know in the comments if you want any more details, and I'll try to supply them.

+7  A: 

You might want to look at bsdiff, which is a binary diffing/patching system. There is also a thesis with lots of theory.

jpalecek
+1  A: 

You can start by storing something like hash trees. It is only needed to store one such set of hashes for every ROM, and the required storage space is only proportional to (but much lower than) the size of the ROM, assuming constant block size. The chosen block size must give sufficient granularity to ensure accuracy, e.g.: for a minimum size of 128MiB, accuracy constraint of 1% and Tiger-128 hash (similar to what they use to check files transferred via DirectConnect), a block size of 1MiB does fine and you can store all the hashes in 128 * 128 / 8 = 2048 bytes! So doing it for 10,000 ROMs would only require about 20MiB of space. Further, you can choose a less secure, but faster and/or smaller hash. Adding/checking for similarity a new ROM would entail something like:

  1. Split the new ROM in blocks and hash each of them.
  2. For every ROM already in the database, compare (see below) its hashes with the new ROM's hashes.

The comparison function should check for similarity. But it should treat each hash as an indivisible value, i.e. don't bother trying to find a logically significant difference function between two hashes. As long as the block size is low enough and hash collisions are rare enough, accuracy is guaranteed by a simple is-equal comparison.

As you see, the problem is reduced to a simpler one performance-wise: checking much smaller data sets for similarity.

Eduard - Gabriel Munteanu
This is certainly good in terms of efficiency, but my concern is reliability. If the alignment of the data in one of the files diverges slightly from the other one, all the hashes after that point are totally useless. It would only work with very "rigid" data, unless I'm missing something.
Chad Birch
I think it works well with an application like DC++, where the result you're looking for is two identical files, and you want to know which chunks are "damaged", but it won't necessarily apply well to a situation where you're just trying to detect similarity.
Chad Birch
If you can devise an app-specific block-delimiting scheme (eg. you separate blocks at what look like subroutine 'ret' instructions) then those blocks can slide around without disturbing the hashes too much.My suggested CRM114 is basically small sliding windows and some statistical data structures.
Liudvikas Bukys
+2  A: 

I think some techniques borrowed from data-compression could be interesting here:

Assume you have two files, A and B.

Compress each file individually and add the compressed sizes together. Then concatenate the two files into a single, large file and compress it as well.

The difference in the sizes will give you a rough estimate how similar the files are.

I suggest that you try the Burrow Wheeler Transformation (bzip2) to do the compression. Most other compression algorithms only have a limited history. The BWT algorithm otoh can work on very large chunks of data. The algorithm "sees" both files at the same time and any similarity will result in a higher compression ratio.

Nils Pipenbrinck
+1  A: 

Two thoughts:

  • Consider organizing the file as a data flow graph and doing some canonicalization on that represention. Since you know the instruction set, this may be feasible, maybe just strapping up a disassembler and doing some text processing.
  • A trainable classifier such as CRM114 might come in handy for giving you a compact representation that gives you some idea whether binaries have much in common.
Liudvikas Bukys
+3  A: 

Though it's been a lot more than "a couple of days", I figured I should probably add my current solution in here.

Nils Pipenbrinck was going in the same direction as my current method. Since one of the main results of finding clones is huge savings from solid archiving, I figured that I could just try compressing any two ROMs together and seeing how much space was saved. I am using the LZMA algorithm in 7zip for this.

The first step is to compress every ROM individually and note the compressed size, then try archiving any two ROMs together and see how much the resulting size differs from their individual compressed sizes. If the combined size is the same as the sum of the individual sizes, they are 0% similar, and if the size is the same as one of them (the largest one), they are identical.

Now, this is a huge number of compression attempts required, so I have a couple of optimizations so far (and would like to figure out more):

  1. Prioritize comparisons based on how similar the compressed sizes are. If ROM A has a compressed size of 10MB and ROM B has a compressed size of 2MB, it is impossible for them to be any more than 20% similar, so comparing them to get the real result can be left until later. Running the same compression algorithm on highly-similar files tends to result in similar-size results, so this finds a lot of the clones very quickly.

  2. Combined with the above, keep both upper and lower "bounds" on the possible similarity between any pair of ROMs. This allows further prioritization. If ROMs A and B are 95% similar, and ROMs B and C are only 2% similar, then you already know that A and C are between 0% and 7%. This is too low to be a clone, so this comparison can be safely postponed or even ignored entirely, unless I really want to know the exact similarities of everything.

Chad Birch
It is an interesting problem, I'm surprised more people didn't answer.Your solution is simple and turn-key.A bunch of us (including me) went deeper into custom representations which I see now you're no so interested in.All you wanted was a simple distance metric. Now just add some clustering.
Liudvikas Bukys
+5  A: 

Use some ideas from Plagiarism Detection algorithms.

My idea:

In order to create a comparable "signature" for each ROM, that varies slightly as small portions change, produce something like a word frequency graph, but instead of recording the frequencies of words, you could hash very short sections of the ROM, and record the frequencies of the hash values.

Don't just hash one section, then the next section starting from the end of the first section, but instead use a sliding window, hashing the section starting from byte 1, then hash the same size section starting from byte 2, then from byte 3, etc. That will negate the effect of variable sized varying portions within your ROM.

If you used a simple hash function like xor of each 8 bit byte, so that you can easily compute the hash of the next window position by xor the current hash with the outgoing 8 bits, and xor the incoming 8 bits. Another alternative hash function may simply be to use the instruction code word length. That may be sufficient to create static patterns for the codes representing machine instructions. The important thing is that you'll want a hash function that results in common short sequences in the instruction code resulting in the same hash values.

You would probably want fewer hash values with higher frequencies of each, but don't go too far or your graph will be too flat, resulting in difficulty comparing them. Likewise don't go too wide, or you'll have lots of very small frequencies, making comparison hard again.

Store this graph per ROM. Compare frequency graphs for two different ROMs by calculating the sum of the squares of the difference in frequencies for each hash value. If that sums to zero then the ROMs are likely to be identical. The further away from zero it is, the less similar the ROMs will be.

Stephen Denne
+4  A: 

It sounds like you want a binary delta or perhaps an index derived from the application of a binary delta (like it's size). You could then compare this index to some baseline that you determine experimentally to decide if it's a "clone" or not.

There are a lot of similarities between compression and delta creation, so I'd say you aren't far off with your current implementation.

That being said, pairwise comparison of every binary file in your database is probably prohibitively expensive (O(n2), I think). I would try to find a simple hash for identifying possible candidates for comparison. Something conceptually similar to what spdenne and Eduard are suggesting. That is, find a hash which can be applied to every item once, sort that list and then use a finer grained comparison on items whose hashes are close together in the list.

Constructing hashes useful for the general case has been an actively pursued research topic in CS for several years. The LSHKit software library implements some algorithms of this sort. The internet accessible paper FINDING SIMILAR FILES IN A LARGE FILE SYSTEM seems like it might be targeted more at comparing text files but might be useful to you. The more recent paper Multi-resolution similarity hashing describes a more powerful algorithm. It doesn't appear to be accessible without a subscription, though. You probably want to keep the wikipedia article on Locality Sensitive Hashing handy as you browse the other resources. They all get pretty technical and the wikipedia entry itself is pretty math heavy. As a more user-friendly alternative you might be able to apply some ideas (or even executables) from the field of Acoustic Fingerprinting.

If you're willing to abandon the general case it's likely that you can find a much simpler (and faster) domain-specific hash function that works just for your ROMs. Possibly something involving the placement of standard, or common, byte sequences and the value of select bits near them. I don't really know much about your binary format but I'm imagining things that signal the start of sections in the file like regions for sound, images or text. Binary formats frequently store the addresses of these sorts of sections near the beginning of the file. Some also use a chaining mechanism that stores the address of the first section at a known location along with it's size. This allows you to move to the next section which also contains a size, etc. A little investigation will probably allow you to discover any relevant formatting, if you're not already aware of it, and should put you well on your way to constructing a useful hash.

If the hash functions don't get you all the way (or they require input of some sort to define a metric/distance) then there are several binary delta algorithms and implementations available on the web. The one I'm most familiar with is used by the subversion version control system. It uses a binary delta algorithm called xdelta to efficiently store binary file revisions. Here's a link directly to the file in their repository that implements it: xdelta.c. There's probably a tool on the web that makes this more accessible as well.



Edit: I was very wrong about the computational complexity of the pairwise diffing. It's O(n2) instead of O(n!). Still pretty expensive, but much less than my original assessment.

Waylon Flinn
Lots of great information and links/papers to read through in here, thanks.
Chad Birch
+1  A: 

As Waylon Flinn said, you may need a binary delta algorithm. The rsync algorithm is a good one. It is fast and reliable. See also the utility's documentation.

Yuval F
+1  A: 

The difficulty here is that since you are dealing with executable code, simple changes can propagate across the entire ROM. The addresses and offsets for ALL values can change with the addition of a single variable or no-op instruction. That will make even block-based hashing worthless.

A quick-and-dirty solution would be to hack up a solution with difflib (or the equivalent w/ your favorite language), since it gets you a sliding comparison that can deal with data addition or removal. Split the ROM into executable and data sections (if possible). The data section can be compared directly and a similarity ratio calculated, though you'll still have problems w/ addresses or offsets.

The executable section is more interesting. Read up on the machine's asm format, take the executable and split it into a sequence of opcodes. Leave the opcode and register parts, but mask off the "payload" / "immediate" parts (where it loads the variable addresses). Hand the resulting info to the similarity ratio calculator too.

The unfortunate part is that this is still an O(n^2) operation on the number of ROMs you track, but that can be alleviated with (incremental) clustering or a frequency-based comparison order to reduce the amount of comparisons needed.

HUAGHAGUAH
+1  A: 

XDelta is pretty useful for getting decent binary diffs: http://xdelta.org

Rik Hemsley