views:

316

answers:

5

Let me start off with a bit of background.

This morning one of our users reported that Testuff's setup file has been reported as infected with a virus by the CA antivirus. Confident that this was a false positive, I looked on the web and found that users of another program (SpyBot) have reported the same problem.

A now, for the actual question.

Assuming the antivirus is looking for a specific binary signature in the file, I'd like to find the matching sequences in both files and hopefully find a way to tweak the setup script to prevent that sequence from appearing.

I tried the following in Python, but it's been running for a long time now and I was wondering if there was a better or faster way.

from difflib import SequenceMatcher

spybot = open("spybotsd160.exe", "rb").read()
testuff = open("TestuffSetup.exe", "rb").read()

s = SequenceMatcher(None, spybot, testuff)
print s.find_longest_match(0, len(spybot), 0, len(testuff))

Is there a better library for Python or for another language that can do this? A completely different way to tackle the problem is welcome as well.

+1  A: 

Why don't you contact CA and ask them to tell them what they're searching for, for that virus?

Or, you could copy the file and change each individual byte until the warning disappeared (may take a while depending on the size).

It's possible the virus detection may be a lot more complicated than simply looking for a fixed string.

paxdiablo
+1  A: 

Hey, better not wonder about the complexity and time these kinds of algorithms need.

If you have interest in this - here .ps document linked here you can find a good introduction into this thematic.

If a good implementation for these algorithms exist, I can not tell. Maybe use google to find some - or put a new topic on stackoverflow :)

regards

mana
+3  A: 

See the longest common substring problem. I guess difflib uses the DP solution, which is certainly too slow to compare executables. You can do much better with suffix trees/arrays.

Using perl Tree::Suffix might be easiest solution. Apparently it gives all common substrings in a specified length range:

@lcs = $tree->lcs;
@lcs = $tree->lcs($min_len, $max_len);
@lcs = $tree->longest_common_substrings;
+2  A: 

Note that even if you did find it this way, there's no guarantee that the longest match is actually the one being looked for. Instead, you may find common initialisation code or string tables added by the same compiler for instance.

Brian
A: 

I suspect that looking for binary strings isn't going to help you. An install program is likely to be doing some 'suspicious' things.

You probably need to talk to CA and spybot about white-listing your installer, or about what is triggering the alert.

Douglas Leeder