I think you have a bit of development to achieve this. The process to follow conceptually sounds simple but there is a fair bit of work to achieve such a task.
You have to break the files down in to a set of samples, at some rate or frequency, then perform some form of hashing function in the images or segments of sound such that you can then compare hashes for collisions and thus find sets of images that are the same or set of sound samples that are present in both streams.
For Audio, you could use NAudio or DirectX to decode the stream. For video, you could look at any library which can decode a video file in to the separate images, such as Directshow.