tags:

views:

307

answers:

8

The question is entitled with the word "Stream" because the question below is a concrete example of a more generic doubt I have about Streams:

I have a problem that accepts two solutions and I want to know the best one:

  1. I download a file, save it to disk (2 min), read it and write the contents to the DB (+ 2 min).
  2. I download a file and write the contents directly to the DB (3 min).

If the write to DB fails I'll have to download again in the second case, but not in the first case.

Which is best? Which would you use?

+1  A: 

There's no reason step 2 has to take two minutes twice. While you download the file, you can stream it through variables in memory on the way to the database.

Unless you have a compelling reason to keep a file-system copy of the file, I would go with #2 in most cases.

Jekke
You are right, the first one takes 2 steps instead of the second one.
Jader Dias
I corrected it in the question now.
Jader Dias
+2  A: 

I would assume that if the write to the database fails due to something in the contents of the file, it will always fail no matter how many times I try to write the same contents to the database. In this case, the only solution is to (fix and) re-download the file anyway. If the write to the database is failing because of something in the database, you've got bigger problems than whether you need to download the file again.

Go with Option #2.

tvanfosson
+1  A: 

I don't understand the qualifiers you've added regarding the times or having to download the file twice, but, if the system is strapped for memory, caching your download to the disk and then sending it to the DB may really be your only option (assuming your data provider can accept a stream).

EDIT: in the original post the author describes writing directly to the database as a two-stage process, which I assume to be 1. Download file into a variable, 2. Stream variable contents to DB. If he's streaming directly into the DB in option 2, then I agree that's a better way to go.

overslacked
What do you mean by strapped for memory? Writing directly to the DB would consume more memory?
Jader Dias
If the system has, for example, 25MB free, and you want to insert 45MB of data, you could not store all the data in memory, and you would have to cache it to disk and then send it to the DB in smaller chunks. However, with the changes to your question, I agree option 2 is the better way to go.
overslacked
But with streaming, the data still occupies 45MB? Isn't each chunk discarded as soon as it is used?
Jader Dias
It really depends on how the "pipelining" is set up and how the data provider reads the stream you pass it, but - in theory, at least - if you stream directly into the database you shouldn't have any memory problems. I do not know if this is driectly supported by any data providers, however.
overslacked
+1  A: 

I would go with option two. There shouldn't be failures very often, and when there are you can just re-download. If for some reason you need to have that local copy on the file system then don't download, save, read, and send to database... just download and send to database at the same time you're saving to the file system.

Max Schmeling
I am wondering how to implement it, as in .NET I would have to save the contents to memory to write it at 2 places
Jader Dias
You can read from your StreamReader and then write to multiple StreamWriters.
Max Schmeling
+2  A: 

To detail Jekke's reply:

Depending on the file system creates many occasions for failure (you must create a valid file name, make sure the file system isn't full, make sure the file can be opened and written to by you but not by anyone else, what about concurrent use, etcetera).

The only benefit of writing to file I can think of is that you'll know the download completed successfully prior to doing anything with the database. If you can hold the contents in memory, do that instead. If you can't and really insist on not going to the database in case of an interrrupted download, at least use .NET's built-in support to help you with the tricky bits (e.g. IsolatedStorageFileStream).

reinierpost
+1 for IsolatedStorageFileStream
Jader Dias
+3  A: 

Unless the increased latency is really killing you, I'd usually go for Option 1 unless there's a good reason you don't want the data on the file system (e.g. concerns about security, capacity, ...).

Or maybe Option 3 as suggested by Max Schmeling, save to the filesystem at the same time as writing to the database.

Disk space is cheap, and it's often useful to have a backup of downloaded data (e.g. to test changes to your database writing code, as evidence of the contents of data downloaded, ...).

Joe
+1 for going against the flow
Jader Dias
+1  A: 

I'd choose option 3. Save it to disk and store the URI in the database. I've never been a fan of storing files in a database.

Quibblesome
In my case the file (XML) isn't stored into the DB, but his parsed data is.
Jader Dias
Wait but if you've parsed the data then you have it in memory so the question is then irrelevant, no?
Quibblesome
No, it's not irrelevant. I don't see your point.
Jader Dias
You download the file. You parse the file. You insert into the database. If the database insert fails then surely you still have the objects you created during the parsing? Unless you are merely extracting primatives?
Quibblesome
Humm, now I see...
Jader Dias
A: 

I'd go for a so far not yet mentioned (except in comments perhaps) option mentioned in the subject matter of my blog post about blobstreams: set up a processing pipeline of streams that take care of downloading and interpreting the file you need. Then use code to read interpreted records from this compound stream and do the needed inserts/updates in your database inside one transaction (per file/per record, as per your functional requirements).

This kind of scenario is where Stream based classes excel. It would mean that you never have the entire file anywhere on disk or in memory at the same time while processing. As you mentioned downloading the file takes minutes, it could be big. Can your system take the intermediate storage of the full file (maybe more than once: memory and on disk)? Even if multiple files get processed concurrently?

Also, if you would find out in practice that the chain is not reliable enough for you and you would like to be able to temporarily store the downloaded file to disk and indeed want to then repeat processing of it without having to download it again, this is easy. All that is needed is an extra Stream in the pipeline that would check if the file needed is already in your "already downloaded files" cache (in some folder, in isolated storage, whatever) and return the bytes in that instead of actually looping a downloading Stream into your processing pipeline.

peSHIr