views:

54

answers:

2

I'm searching for strategies one might use to programmatically find files which may be duplicates of each other. Specifically in this case, videos.

I'm not looking for exact matches (as nice as that would be in the land of rainbows and sunshine). I'm just looking to collect pairs of video which content might be the same so that a human can compare them to confirm. For example, same content, different resolution.

The strategies I have so far:

  • Hashing
  • Comparing file size
  • Comparing length of video
  • Comparing file names
  • Holding findings persistently to "remember" previous duplicates
  • Mixing and matching strategies above

Are there any strategies, or refinements of the strategies listed above you are aware of?

Does anyone know of any hash functions that produce ranges of hashing to indicate that the overall content is "close".

+2  A: 

For efficient n-way comparison you'll need to reduce the videos to a small parameter space (a "fingerprint") that has a similarity metric that correlates well with video similarity. Hashing for instance isn't a good parameter space, because small differences in input videos leads to large differences in hashes. On the opposite side of the spectrum, video length isn't a good parameter because rather different videos can have the same length.

A good parameter space depends on what kind of differences you want to ignore, and what kind to amplify. One option that might work would be to divide the video into 10 second intervals in the time dimensions and into 16 rectangles in the space dimension. Then take the average color of each rectangle over the 10 second interval. Then use the euclidean distance between the parameter vectors as the similarity metric. (i.e. for each time interval, for each square, for each color channel, subtract the two intensities, take the square and add it all together) If you need to detect clips that might be small parts of other clips it gets a bit trickier, but the general principle of calculating feature vectors should work. For instance scene change detection should help in creating video length invariant parameters.

Ants Aasma
+1 for euclidean distance: smaller means more similar. nice!
Karussell
Perfect, *great* new direction for me to look in. Thanks
Präriewolf
A: 

This will be almost impossible for a computer to tell. The slightest difference in video streams, such as the width being one pixel less, will result in a completely different datastream. To make any meaningful comparison you will have to recode the videos to a known format and resolution, with a very low framerate. Then you can start looking at each frame to see if they are similar to each other. This is a very intensive job, both computationally and algorithmically.

calmh