views:

239

answers:

1

Hello,

I am looking for the fastest way to check for the existence of an object. The scenario is pretty simple, assume a directory tool, which reads the current hard drive. When a directory is found, it should be either created, or, if already present, updated.

First lets only focus on the creation part:

    public static DatabaseDirectory Get(DirectoryInfo dI)
    {
        var result = DatabaseController.Session
                      .CreateCriteria(typeof (DatabaseDirectory))
                      .Add(Restrictions.Eq("FullName", dI.FullName))
                      .List<DatabaseDirectory>().FirstOrDefault();

        if (result == null)
        {
            result = new DatabaseDirectory
                         {
                             CreationTime = dI.CreationTime,
                             Existing = dI.Exists,
                             Extension = dI.Extension,
                             FullName = dI.FullName,
                             LastAccessTime = dI.LastAccessTime,
                             LastWriteTime = dI.LastWriteTime,
                             Name = dI.Name
                         };
        }
        return result;
    }

Is this the way to go regarding:

  • Speed
  • Separation of Concern

What comes to mind is the following: A scan will always be performed "as a whole". Meaning, during a scan of drive C, I know that nothing new gets added to the database (from some other process). So it MAY be a good idea to "cache" all existing directories prior to the scan, and look them up this way. On the other hand, this may be not suitable for large sets of data, like files (which will be 600.000 or more)...

Perhaps some performance gain can be achieved using "index columns" or something like this, but I am not so familiar with this topic. If anybody has some references, just point me in the right direction...

Thanks, Chris

PS: I am using NHibernate, Fluent Interface, Automapping and SQL Express (could switch to full SQL)

Note: In the given problem, the path is not the ID in the database. The ID is an auto-increment, and I can't change this requirement (other reasons). So the real question is, what is the fastest way to "check for the existance of an object, where the ID is not known, just a property of that object"

And batching might be possible, by selecting a big group with something like "starts with C:Testfiles\" but the problem then remains, how do I know in advance how big this set will be. I cant select "max 1000" and check in this buffered dictionary, because i might "hit next to the searched dir"... I hope this problem is clear. The most important part, is, is buffering really affecting performance this much. If so, does it make sense to load the whole DB in a dictionary, containing only PATH and ID (which will be OK, even if there are 1.000.000 object, I think..)

+1  A: 

First off, I highly recommend that you (anyone using NH, really) read Ayende's article about the differences between Get, Load, and query.

In your case, since you need to check for existence, I would use .Get(id) instead of a query for selecting a single object.

However, I wonder if you might improve performance by utilizing some knowledge of your problem domain. If you're going to scan the whole drive and check each directory for existence in the database, you might get better performance by doing bulk operations. Perhaps create a DTO object that only contains the PK of your DatabaseDirectory object to further minimize data transfer/processing. Something like:

Dictionary<string, DirectoryInfo> directories;
session.CreateQuery("select new DatabaseDirectoryDTO(dd.FullName) from DatabaseDirectory dd where dd.FullName in (:ids)")
    .SetParameterList("ids", directories.Keys)
    .List();

Then just remove those elements that match the returned ID values to get the directories that don't exist. You might have to break the process into smaller batches depending on how large your input set is (for the files, almost certainly).

As far as separation of concerns, just keep the operation at a repository level. Have a method like SyncDirectories that takes a collection (maybe a Dictionary if you follow something like the above) that handles the process for updating the database. That way your higher application logic doesn't have to worry about how it all works and won't be affected should you find an even faster way to do it in the future.

Stuart Childs
Ok, thanks for your answer. The bulk part is interesting, I have to investigate this in detail. Problem with the first suggestion: I do not know the ID. The path is not the ID, I will add this in the post above to make it more clear. Thanks for the answer...
Christian