views:

31

answers:

2

I have an idea for a site that involves uploading files to the site. But what I'd like - and wondering if it's possible - is when a user clicks on "Browse", and selects the file, if it's possible for the site to automatically scan the site's database for similar files before they upload the file to the site. Kind of similar to the automatic "Related Questions" when you act a question on this site.

A: 

It's possible to get the file name without uploading the file so you can do the search based on the file name. The content would only be available after the upload.

tucaz
The site could be an upload-first, then tag/comment/fill in the meta info kind of procedure.
Ben S
+1  A: 

Sure, that's possible. But you'll have to come up with your own definition, as well as algorithm for finding what's similar.

File Type differences

Different file types should be compared differently. For example a text file would be well suited to a diff to find similar files, but comparing images or videos that are similar is considerably more difficult.

Difficulty of comparisons

Also, comparing against a large number of files is a very expensive thing to do since it's typically done pair-wise. Some indexing methods could help the efficiency of the search though, but I don't see an easy way to do this quickly.

Crowd Source Alternative

Another alternative would be to have the users of the site point out the similarities, that way you simply display a list of the most popular files that were voted similar. Of course, this doesn't help when uploading a new file, but it can help you gain insight as to what users find similar.

What many sites do to compare similarity of content is to allow users to tag items. If one item shares many of the same tags with another, they're likely similar. This is probably the easiest approach.

This also has the benefit that any content type can be compared to any other content type. So text files that have the same tags as a video can be presented as similar.

Ben S
I never thought about "Crowd Source Alternative" - thanks for your suggestions.
Wazle
+1, crowd sourced tagging is probably the way to go in a generalized scenario.
Neil N