views:

191

answers:

3

Does anyone know of a way to compare two .NET assemblies to determine whether they were built from the "same" source files?

I am aware that there are some differencing utilities available, such as the plugin for Reflector, but I am not interested in viewing differences in a GUI, I just want an automated way to compare a collection of binaries to see whether they were built from the same (or equivalent) source files. I understand that multiple different source files could produce the same IL, and realise that the process would only be sensitive to differences in the IL, not the original source.

The main obstacle to just comparing the byte streams for the two assemblies is that .NET includes a field called "MVID" (Module Version Identifier) the assembly. This appears to have a different value for every compilation, so if you build the same code twice the assembly will be different.

A related question is, does anyone know how to force the MVID to be the same for each compilation? This would avoid us needing to have a comparison process that is insensitive to differences in the value of the MVID. A consistent MVID would be preferable, as this means that standard checksums could be used.

The background behind this is that a third-party company is responsible for independently reviewing and signing off our releases, prior to us being permitted to release to Production. This includes reviewing the source code. They want to independently confirm that the source code we give them matches the binaries that we earlier built, tested and currently plan to deploy. We are looking for a process that allows them to independently build the system from the source we supply them with, and the compare the checksums against the checksums for the binaries we have tested.

BTW. Please note that we are using continuous integration, automated builds, source control etc. The issue is not related to an internal lack of control over what source files went into a given build. The issue is that a third party is responsible for verifying that the source we give them produces the same binaries that we have tested and plan to put into Production. They should not be trusting any of our internal systems or controls, including the build server or the source code control system. All they care about is getting the source associated with the build, performing the build themselves, and verifying that the outputs match what we say we are deploying.

The runtime speed of the comparison solution is not particularly important.

thanks

+1  A: 

There are a few ways to do this depending on the amount of work you're willing to do and the importance of performance and/or accuracy. One way as Eric J. pointed is to compare the assemblies in binary, excluding the parts that change on every compilation. This solution is easy and fast but could give you a lot of false negatives. One better way is to drill down by using reflection. If performance is critical you can start by comparing the types and if they match go to member definitions. After checking type and member definitions and if everything is equal to that point you can go further by examining the actual IL of each method by getting it through GetILAsByteArray method. Again you're going to find differences even if everything is the same but compiled with a little bit different flags or different version of the compiler. I'd say that the best solution is to use a continuous integration tools that tags the build with the changeset number of your source control (you are using one, right?).

A related article

Diadistis
(Q. edited to include additional detail) You and Eric J are correct regarding ignoring the variant portion of the file. This is simple if the format is documented, but I've not found a ref yet. Do you know of one?Regarding using reflection, we are inclined towards the simplest solution, because the external party will need to understand and test the utility. If it's provided by the dev team, there will be greater suspicion of it than if the software were provided by a fourth party. Ignoring a few bytes in the file will be simpler than using reflection.
Clayton
A: 

I am facing a similar problem at the moment. An independent govt regulator must give approval for our system/s and they expect to be able to be given the source code, build it themselves (using processes, hardware, software and systems provided by us if required, so different options/versions etc are not an issue) and then produce checksums of the produced binaries which should be identical to the checksums we produced on our build server and form our submission for request for approval.

They agree/understand that microsoft compiler adds MVID and date/time stamps every compile, and that it is impossible to achieve by running a straight checksum algorithm like SHA1 or MD5 over the produced files and are happy for some sort of tool to be developed/used that can achieve the desired result (providing of course that they also review/approve said tool) to either filter out the known differences like MVID and timestamps... But of course then comes the challenge of actually writing such a thing and having it be reliable/reproducible.

It seems stupid to me since our build process is veyr conrteolled/formal and is based on labeled code from source control built by a build script/system on a dedicated build server (different server for test and prod) and the results are SHA1 summed. We can currently prove that what we run in production matches the SHA1s we submitted, but we cant prove that the source code we said we built actually produced the binaries we said it did

And of course in the world of government regulation you just dont really have a choice but to find a way to meet their requirements. Another factor that hurts us is that our linux based c++ systems using "make" do indeed produce binarily identical executables from the same source code no matter when it is built, so the regulators expectation is that we should be able to do this for all our products, even those using VS2008 VC9 c++ and/or c# but it's simply not the way that Microsoft compilers work!

If anyone finds examples of how to do either method (filtering or reflection) please share

Ryan
+1  A: 

It's not too painful to use command-line tools to filter out MVID and date-time stamps from a text representation of the IL. Suppose file1.exe and file2.exe are built from the same sources:

c:\temp> ildasm /all /text file1.exe | find /v "Time-date stamp:" | find /v "MVID" > file1.txt

c:\temp> ildasm /all /text file2.exe | find /v "Time-date stamp:" | find /v "MVID" > file2.txt

c:\temp> fc file1.txt file2.txt

Comparing files file1.txt and FILE2.TXT

FC: no differences encountered

Jerry Currry