Stream (.NET) handling best-practices

views:

307

answers:

+2 Q:

Stream (.NET) handling best-practices

The question is entitled with the word "Stream" because the question below is a concrete example of a more generic doubt I have about Streams:

I have a problem that accepts two solutions and I want to know the best one:

I download a file, save it to disk (2 min), read it and write the contents to the DB (+ 2 min).
I download a file and write the contents directly to the DB (3 min).

If the write to DB fails I'll have to download again in the second case, but not in the first case.

Which is best? Which would you use?

+1 A:

There's no reason step 2 has to take two minutes twice. While you download the file, you can stream it through variables in memory on the way to the database.

Unless you have a compelling reason to keep a file-system copy of the file, I would go with #2 in most cases.

Jekke 2009-02-26 18:15:41

You are right, the first one takes 2 steps instead of the second one.

Jader Dias 2009-02-26 18:20:19

I corrected it in the question now.

Jader Dias 2009-02-26 18:21:00

+2 A:

I would assume that if the write to the database fails due to something in the contents of the file, it will always fail no matter how many times I try to write the same contents to the database. In this case, the only solution is to (fix and) re-download the file anyway. If the write to the database is failing because of something in the database, you've got bigger problems than whether you need to download the file again.

Go with Option #2.

tvanfosson 2009-02-26 18:17:33

+1 A:

I don't understand the qualifiers you've added regarding the times or having to download the file twice, but, if the system is strapped for memory, caching your download to the disk and then sending it to the DB may really be your only option (assuming your data provider can accept a stream).

EDIT: in the original post the author describes writing directly to the database as a two-stage process, which I assume to be 1. Download file into a variable, 2. Stream variable contents to DB. If he's streaming directly into the DB in option 2, then I agree that's a better way to go.

overslacked 2009-02-26 18:18:18

What do you mean by strapped for memory? Writing directly to the DB would consume more memory?

Jader Dias 2009-02-26 18:23:19

If the system has, for example, 25MB free, and you want to insert 45MB of data, you could not store all the data in memory, and you would have to cache it to disk and then send it to the DB in smaller chunks. However, with the changes to your question, I agree option 2 is the better way to go.

overslacked 2009-02-26 18:34:20

But with streaming, the data still occupies 45MB? Isn't each chunk discarded as soon as it is used?

Jader Dias 2009-02-26 18:36:45

It really depends on how the "pipelining" is set up and how the data provider reads the stream you pass it, but - in theory, at least - if you stream directly into the database you shouldn't have any memory problems. I do not know if this is driectly supported by any data providers, however.

overslacked 2009-02-26 18:46:56

+1 A:

I would go with option two. There shouldn't be failures very often, and when there are you can just re-download. If for some reason you need to have that local copy on the file system then don't download, save, read, and send to database... just download and send to database at the same time you're saving to the file system.

Max Schmeling 2009-02-26 18:24:51

I am wondering how to implement it, as in .NET I would have to save the contents to memory to write it at 2 places

Jader Dias 2009-02-26 18:27:36

You can read from your StreamReader and then write to multiple StreamWriters.

Max Schmeling 2009-02-26 18:47:34

+2 A:

To detail Jekke's reply:

Depending on the file system creates many occasions for failure (you must create a valid file name, make sure the file system isn't full, make sure the file can be opened and written to by you but not by anyone else, what about concurrent use, etcetera).

The only benefit of writing to file I can think of is that you'll know the download completed successfully prior to doing anything with the database. If you can hold the contents in memory, do that instead. If you can't and really insist on not going to the database in case of an interrrupted download, at least use .NET's built-in support to help you with the tricky bits (e.g. IsolatedStorageFileStream).

reinierpost 2009-02-26 18:30:27

+1 for IsolatedStorageFileStream

Jader Dias 2009-02-26 18:38:09

+3 A:

Unless the increased latency is really killing you, I'd usually go for Option 1 unless there's a good reason you don't want the data on the file system (e.g. concerns about security, capacity, ...).

Or maybe Option 3 as suggested by Max Schmeling, save to the filesystem at the same time as writing to the database.

Disk space is cheap, and it's often useful to have a backup of downloaded data (e.g. to test changes to your database writing code, as evidence of the contents of data downloaded, ...).

Joe 2009-02-26 18:31:27

+1 for going against the flow

Jader Dias 2009-02-26 18:39:19

+1 A:

I'd choose option 3. Save it to disk and store the URI in the database. I've never been a fan of storing files in a database.

Quibblesome 2009-02-26 18:37:27

In my case the file (XML) isn't stored into the DB, but his parsed data is.

Jader Dias 2009-02-26 18:41:08

Wait but if you've parsed the data then you have it in memory so the question is then irrelevant, no?

Quibblesome 2009-02-26 18:42:29

No, it's not irrelevant. I don't see your point.

Jader Dias 2009-02-26 18:51:07

You download the file. You parse the file. You insert into the database. If the database insert fails then surely you still have the objects you created during the parsing? Unless you are merely extracting primatives?

Quibblesome 2009-02-26 19:06:43

Humm, now I see...

Jader Dias 2009-02-26 21:35:26

I'd go for a so far not yet mentioned (except in comments perhaps) option mentioned in the subject matter of my blog post about blobstreams: set up a processing pipeline of streams that take care of downloading and interpreting the file you need. Then use code to read interpreted records from this compound stream and do the needed inserts/updates in your database inside one transaction (per file/per record, as per your functional requirements).

This kind of scenario is where Stream based classes excel. It would mean that you never have the entire file anywhere on disk or in memory at the same time while processing. As you mentioned downloading the file takes minutes, it could be big. Can your system take the intermediate storage of the full file (maybe more than once: memory and on disk)? Even if multiple files get processed concurrently?

Also, if you would find out in practice that the chain is not reliable enough for you and you would like to be able to temporarily store the downloaded file to disk and indeed want to then repeat processing of it without having to download it again, this is easy. All that is needed is an extra Stream in the pipeline that would check if the file needed is already in your "already downloaded files" cache (in some folder, in isolated storage, whatever) and return the bytes in that instead of actually looping a downloading Stream into your processing pipeline.

peSHIr 2010-05-26 08:58:29

ansaurus

tags:

views:

answers:

Stream (.NET) handling best-practices

related questions