tags:

views:

104

answers:

1

I have around 2 million strings with different lengths that I need to compress and put into MongoDb GridFS as files.

The strings are currently stored in MS SQL TEXT field of a table. I wrote a sample app to read each row, compress it and store it as a GridFS file.

There is one reader and a thread pool of 50 threads storing the results. It works but it is very slow (100 records per second on average).

I was wondering if there is any way for faster import into GridFS?

I'm using MongoDb 1.6 on Windows with MongoCSharp driver in C# and .NET.

+2  A: 

I think I found the issue inside MongoDb CSharp driver by profiling it while running a very simple app that puts 1000 strings into 1000 GridFS files.

It turns out that 97% of the time is spent on checking if a file with the same filename exists in the collection. I added an index on the filename field and it's now blazing fast!

The question for me is if the driver needs to keep the filename unique and does a check, why doesn't it add a unique index to it if that's missing? What's the reason behind that?

Khash
That is weird. File names do not have to be unique in GridFS, since there is already the _id primary key, right?
Thilo
I am not sure about GridFS spec, but profiling the sample app using MongoDbCSharp library certainly shows that it checks if the file Exists *and* throws an exception if it is not.
Khash