views:

316

answers:

4
+1  Q: 

Search by hash?

I had the idea of a search engine that would index web items like other search engines do now but would only store the file's title, url and a hash of the contents.

This way it would be easy to find items on the web if you already had them and didn't know where they came from or wanted to know all the places that something appeared.

More useful for non textual items like images, executables and archives.

I was wondering if there is already something similar?

+1  A: 

Well, for images, there's [http://tineye.com/][1], which will one-up that, and find you similar images too.

[1]: http://tineye.com/ tin eye

zigdon
A: 

It's not a bad idea. Sometimes I find myself stumbled upon some file trying to figure out where it comes from :) But how are you going to track item's sources? Content can be obtained by various means - web browser, download manager, simply by copying from network share.

aku
+2  A: 

Check out the wikipedia page on locality sensitive hashing. There's also a good page hosted by a research on MIT.

In general, there are several flavors available: hashes for strings (such as simhash), sets or 0/1 features (such as min-wise hashes), and for real vectors.

The main trick for numerical hashes is basically dimension reduction, so far. For strings, the idea is to come up with a representation that's robust in the face of minor edits.

I'm also doing a little research in this field, although I guess stackoverflow might not be the right place for nascent work.

Tyler
A: 

If I understand your proposal right, http://bitzi.com/ has done this for a while.

rjmunro