I'm searching for strategies one might use to programmatically find files which may be duplicates of each other. Specifically in this case, videos.
I'm not looking for exact matches (as nice as that would be in the land of rainbows and sunshine). I'm just looking to collect pairs of video which content might be the same so that a human can compare them to confirm. For example, same content, different resolution.
The strategies I have so far:
- Hashing
- Comparing file size
- Comparing length of video
- Comparing file names
- Holding findings persistently to "remember" previous duplicates
- Mixing and matching strategies above
Are there any strategies, or refinements of the strategies listed above you are aware of?
Does anyone know of any hash functions that produce ranges of hashing to indicate that the overall content is "close".