I've seen a few questions here related to determining the similarity of files, but they are all linked to a particular domain (images, sounds, text, etc). The techniques offered as solutions require knowledge of the underlying file format of the files being compared. What I am looking for is a method without this requirement, where arbitrary binary files could be compared without needing to understand what type of data they contain. That is, I am looking to determine the similarity percentage of two files' binary data.
To give a little more detail for you to work with, even though this is potentially applicable to many things, I do have a specific problem that I'm working on. I also currently have a working solution, but I don't think that it is ideal. There are probably many optimizations in terms of the comparison method, and storing the results. Hopefully some people here will be able to give me some new ideas. I will probably edit in some information about my current method after a couple of days, but I don't want to bias peoples' thoughts about the problem by telling you how I'm already doing it.
The problem I'm working on is clone detection for video game ROM images. For those that don't have experience with emulation, ROMs are dumps of the data on game cartridges. A ROM "clone" is typically a modified version of the same game, the most common type being a translated version. For example, the Japanese and English versions of the original Final Fantasy for the NES are clones. The games share almost all of their assets (sprites, music, etc), but the text has been translated.
There are currently several groups that work on maintaining lists of clones for the various systems, but as far as I can tell, this is all done manually. What I am attempting to do is find a method to detect similar ROM images automatically and objectively, based on data similarity instead of "these seem like the same game". There are several reasons for detecting clones, but one of the major motivations is to be used with Solid compression. This allows compression of all game clones together into the same archive, with the entire compressed clone set often taking up only slightly more space than one of the individual ROMs.
Some concerns to consider when coming up with potential approaches:
- ROMs vary highly in size, depending on the system. Some are small, but modern systems may have large ones, 256MB or more. Some (all?) systems only have powers of 2 as possible sizes, a 130MB game on one of these systems would have a 256MB rom, largely empty. Note that because of this, some clones may have wildly different sizes, if a game version crosses the threshold and has to use a cartridge that is twice the size.
- There are currently thousands of known ROMs on many systems, with most systems still having new ones released constantly. Even for older systems, there is a major ROM-hacking community that produces modified ROMs often.
- Storing similarity data for every possible pair of ROMs would result in millions of rows of data for any of the more popular systems. A system with 5000 ROMs would require 25 million rows of similarity data, with a single new game adding another 5000 rows.
- State of the processing must be recoverable, so that if it is interrupted it can pick up where it left off. With any method, a lot of processing will be required, and assuming that the whole thing will run in one batch is not safe.
- New ROMs could be added at any time, so the method should not assume that it already has a "complete" set. That is, even after you have already figured out similarity for all existing ROMs, if a new one is added (and this could also occur before previous processing was entirely finished) there must be a method for comparing it to all previous ones, to determine which (if any) it is a clone of.
- Higher processing speed should be given priority over accuracy (to a point). Knowing whether two ROMs are 94% or 96% similar is not particularly important, but if it takes a day of processing to compare a new ROM to all the previous ones, the program would probably never truly complete.
It's been an interesting problem to work on, I look forward to seeing what other people can come up with. Let me know in the comments if you want any more details, and I'll try to supply them.