tags:

views:

270

answers:

9

We have a images folder which has about a million images in it. We need to write a program which would fetch the image based upon a keyword that is entered by the user. We need to match the file names while searching to find the right image. Looking for any suggestions. Thanks N

+3  A: 
  1. Keep the images on a separate site or subdomain. You probably don't want all 1M files in a single directory, of course.

  2. You need a database with (at least) three tables:

    ImageFile  
        ID  
        Filepath

    Keyword
        ID
        theWord

    ImageKeyword
        ImageID
        KeywordID
egrunin
As well as this, you could hash the image so that you can check if the image actually already exists. Don't use MD5 as it can produce the same result for different files - try SHA1 or higher.
Dominic Zukiewicz
@downvote?? Sheesh, tough crowd.
egrunin
@Dominic: sure. What kind of app are you thinking of that would benefit from that?
egrunin
@Dominic Zukiewicz: "Don't use MD5 and instead use SHA-1"?! Fine, MD5 is 128 bits and SHA-1 is 160, but feeding SHA-1 with anything larger than 80 bytes will eventually result in a collision. Saying that SHA-1 will never produce collisions is just silly talk..
Patrick
@egrunin - If you want to check if the exact file already exists in the DB, a hash would help. But I was saying that certain algorithms have been known to produce the same key for complete different files.@Patrick - I appreciate that these algorithms have been broken and especially with images having such a diversity of data. Can we agree on SHA-256? Just trying to balance speed with data compactness.
Dominic Zukiewicz
+1  A: 

Store all (images & keywords) in a database.

You can use a full-text index to search for the words, or store each word as a seperate entry.

And you will have much faster access to the meta data (filename, creation date, etc) without retrieving (or opening) the image itself.

This is probably much faster as relying on a file system that is not made to store one million entries in a single folder.

GvS
A: 

There is Win32 API FindFirstFile, FindNextFile, FindClose: http://msdn.microsoft.com/en-us/library/aa364418(VS.85).aspx - probably they map somehow into .NET as well. Use them to search for the image without any databases.

justadreamer
+1  A: 

Getting a million file names from a folder will take a lot of time. I would suggest that you get the file names and put them in a database. That way you can search the names within seconds instead of minutes.

Guffa
A: 

My first thoughts for such a large number of images would be to create an inverted-list to use as an index.

If you are able to maintain this list it would make searching relatively quick and you wouldn't have to trawl through a million images which I'm guessing would be too time consuming for you.

I'd start with looking for some inverted-list implementations.

Ben Cawley
+1  A: 

This is the obvious but would imagine it would be pretty slow for a million images:

public IList<string> GetMatchingImages(string path, string keyword)
    {
        var matches = new List<string>();

        var images = System.IO.Directory.GetFiles(path);

        foreach (var image in images)
        {
            if (image.Contains(keyword))
            {
                matches.Add(image);
            }
        }

        return matches;
    }
Paul Hiles
A: 

One simple solution is a database in which you store the an ID, the path, and a varchar (string) field in which you'll keep all the keywords. (That could be stored in a different table for efficiency purposes)

That way you could search by filename or by keywords associated to an image.

Juan Nunez
+1  A: 

Depending on the operating system, I suggest you use Indexing Service, Windows Desktop Search, or the latest version of Windows Search. This solves your problem of file lookup based on keyword, it addresses the performance issues in regards to the number of files within a folder, it is scalable, and easily extended.

The DSearch sample at http://msdn.microsoft.com/en-us/library/dd940335(VS.85).aspx does almost exactly what you want and is easy to implement.

For example, if querying a million files and need to move file into subfolders to increase performance then you can simply create the folders and move the files. You will not need to change any code.

If you need to change how keywords are applied, such as using the keywords of the file's summary properties, then you only need to change the query.

For the later operating systems, you do not even need to install any software because the search feature is part of the operting system and available through OleDB. If you want to use Advance Query Syntax (AQS), Microsoft provides a typed-library to access the COM Interfaces that make it easy to generate the SQL command to query the index database.

Honestly, all these other suggestions about databases, and so on, are a waste of time.

AMissico
These methods work if the keywords will be embedded in the file metadata. The people suggesting databases are assuming otherwise, and that he wants centralized editing of keywords.
egrunin
@egrunin: You can store the keywords in the file's Summary information provided by the operating system, which is stored as an Alternate Data Stream. Keyword can be managed through Windows Explorer. Everything is already provided.
AMissico
A: 

Just rename all the images to their respective keywords delimited by spaces. Then use the OS's own search feature.

If that doesn't work, only then look for fancier solutions.

CannibalSmith