We have a images folder which has about a million images in it. We need to write a program which would fetch the image based upon a keyword that is entered by the user. We need to match the file names while searching to find the right image. Looking for any suggestions. Thanks N
Keep the images on a separate site or subdomain. You probably don't want all 1M files in a single directory, of course.
You need a database with (at least) three tables:
ImageFile ID Filepath Keyword ID theWord ImageKeyword ImageID KeywordID
Store all (images & keywords) in a database.
You can use a full-text index to search for the words, or store each word as a seperate entry.
And you will have much faster access to the meta data (filename, creation date, etc) without retrieving (or opening) the image itself.
This is probably much faster as relying on a file system that is not made to store one million entries in a single folder.
There is Win32 API FindFirstFile, FindNextFile, FindClose: http://msdn.microsoft.com/en-us/library/aa364418(VS.85).aspx - probably they map somehow into .NET as well. Use them to search for the image without any databases.
Getting a million file names from a folder will take a lot of time. I would suggest that you get the file names and put them in a database. That way you can search the names within seconds instead of minutes.
My first thoughts for such a large number of images would be to create an inverted-list to use as an index.
If you are able to maintain this list it would make searching relatively quick and you wouldn't have to trawl through a million images which I'm guessing would be too time consuming for you.
I'd start with looking for some inverted-list implementations.
This is the obvious but would imagine it would be pretty slow for a million images:
public IList<string> GetMatchingImages(string path, string keyword)
{
var matches = new List<string>();
var images = System.IO.Directory.GetFiles(path);
foreach (var image in images)
{
if (image.Contains(keyword))
{
matches.Add(image);
}
}
return matches;
}
One simple solution is a database in which you store the an ID, the path, and a varchar (string) field in which you'll keep all the keywords. (That could be stored in a different table for efficiency purposes)
That way you could search by filename or by keywords associated to an image.
Depending on the operating system, I suggest you use Indexing Service, Windows Desktop Search, or the latest version of Windows Search. This solves your problem of file lookup based on keyword, it addresses the performance issues in regards to the number of files within a folder, it is scalable, and easily extended.
The DSearch sample at http://msdn.microsoft.com/en-us/library/dd940335(VS.85).aspx does almost exactly what you want and is easy to implement.
For example, if querying a million files and need to move file into subfolders to increase performance then you can simply create the folders and move the files. You will not need to change any code.
If you need to change how keywords are applied, such as using the keywords of the file's summary properties, then you only need to change the query.
For the later operating systems, you do not even need to install any software because the search feature is part of the operting system and available through OleDB. If you want to use Advance Query Syntax (AQS), Microsoft provides a typed-library to access the COM Interfaces that make it easy to generate the SQL command to query the index database.
Honestly, all these other suggestions about databases, and so on, are a waste of time.
MSDN search of windows search at http://social.msdn.microsoft.com/Search/en-US?query=windows+search
Related Search Technologies to Windows Search at http://msdn.microsoft.com/en-us/library/bb286798(VS.85).aspx
Searching a million files in one folder is going to be prohibitive slow. (See my response at http://stackoverflow.com/questions/2979432/directory-file-size-calculation-how-to-make-it-faster/3050354#3050354 for Directory file size calculation - how to make it faster?.
I can search my hard drive of ~300,000 files for *tabcontrol.cs" in less that a second The first query takes approx. 4000ms and each query, using a different search term, after the first one takes 300-600ms.
- I just updated from "Indexing Service" to "Windows Search" and I can search 300,000 files over 58GB for "filename: tabcontrol" in 1.25 seconds with subsequent searches taking 0.13 to 0.26 seconds.
See the DSearch sample at http://msdn.microsoft.com/en-us/library/dd940335(VS.85).aspx for how easy this is to implement.
"Searching the Desktop" at http://blogs.msdn.com/b/coding4fun/archive/2007/01/05/1417884.aspx
Searching for a file across a hard drive is a slow, tedious operation. Learn how to take advantage of the Windows Desktop Search API and database to find files very quickly. Add innovative new features to your applications using the search capabilities built-in to Vista and available for Windows XP.
Just rename all the images to their respective keywords delimited by spaces. Then use the OS's own search feature.
If that doesn't work, only then look for fancier solutions.