views:

87

answers:

1

I want to make a class (let's call the class HugeStream) that takes an IEnumerable<Stream> in its constructor. This HugeStream should implement the Stream abstract class.

Basically, I have 1 to many pieces of UTF8 streams coming from a DB that when put together, make a gigantic XML document. The HugeStream needs to be file-backed so that I can seek back to position 0 of the whole stitched-together-stream at any time.

Anyone know how to make a speedy implementation of this?

I saw something similar created at this page but it does not seem optimal for handling large numbers of large streams. Efficiency is the key.

On a side note, I'm having trouble visualizing Streams and am a little confused now that I need to implement my own Stream. If there's a good tutorial on implementing the Stream class that anyone knows of, please let me know; I haven't found any good articles browsing around. I just see a lot of articles on using already-existing FileStreams and MemoryStreams. I'm a very visual learner and for some reason can't find anything useful to study the concept.

Thanks,

Matt

A: 

If you only read data sequentially from the HugeStream, then it simply needs to read each child stream (and append it into a local file as well as returning the read data to the caller) until the child-stream is exhausted, then move on to the next child-stream. If a Seek operation is used to jump "backwards" in the data, you must start reading from the local cache file; when you reach the end of the cache file, you must resume reading the current child stream where you left off.

So far, this is all pretty straight-forward to implement - you just need to indirect the Read calls to the appropriate stream, and switch streams as each one runs out of data.

The inefficiency of the quoted article is that it runs through all the streams every time you read to work out where to continue reading from. To improve on this, you need to open the child streams only as you need them, and keep track of the currently-open stream so you can just keep reading more data from that current stream until it is exhausted. Then open the next stream as your "current" stream and carry on. This is pretty straight-forward, as you have a linear sequence of streams, so you just step through them one by one. i.e. something like:

int currentStreamIndex = 0;
Stream currentStream = childStreams[currentStreamIndex++];

...

public override int Read(byte[] buffer, int offset, int count)
{
    while (count > 0)
    {
        // Read what we can from the current stream
        int numBytesRead = currentSteam.Read(buffer, offset, count);
        count -= numBytesRead;
        offset += numBytesRead;

        // If we haven't satisfied the read request, we have exhausted the child stream.
        // Move on to the next stream and loop around to read more data.
        if (count > 0)
        {
            // If we run out of child streams to read from, we're at the end of the HugeStream, and there is no more data to read
            if (currentStreamIndex >= numberOfChildStreams)
                break;

            // Otherwise, close the current child-stream and open the next one
            currentStream.Close();
            currentStream = childStreams[currentStreamIndex++];
        }
    }

   // Here, you'd write the data you've just read (into buffer) to your local cache stream
}

To allow seeking backwards, you just have to introduce a new local file stream that you copy all the data into as you read (see the comment in my pseudocode above). You need to introduce a state so you know that you are reading from the cache file rather than the current child stream, and then just directly access the cache (seeking etc is easy because the cache represents the entire history of the data read from the HugeStream, so the seek offsets are identical between the HugeStream and the Cache - you simply have to redirect any Read calls to get the data out of the cache stream)

If you read or seek back to the end of the cache stream, you need to resume reading data from the current child stream. Just go back to the logic above and continue appending data to your cache stream.

If you wish to be able to support full random access within the HugeStream you will need to support seeking "forwards" (beyond the current end of the cache stream). If you don't know the lengths of the child streams beforehand, you have no choice but to simply keep reading data into your cache until you reach the seek offset. If you know the sizes of all the streams, then you could seek directly and more efficiently to the right place, but you will then have to devise an efficient means for storing the data you read to the cache file and recording which parts of the cache file contain valid data and which have not actually been read from the DB yet - this is a bit more advanced.

I hope that makes sense to you and gives you a better idea of how to proceed...

(You shouldn't need to implement much more than the Read and Seek interfaces to get this working).

Jason Williams
Jason, thanks for the in-depth explanation. Later today, I'll implement my Stream and see if I can get it to work as expected. I'll vote up your answer once I reach 15 reputation. :)
Matt
By the way, what is the OpenStream method you wrote in the pseudocode? Can't the currentStream just be set to the childStreams[currentStreamIndex++], or do I need extra handling to "open" a stream?
Matt
OpenStream is just pseudocode to show you where you'd open the child stream. i.e. assuming you have created the streams (childstream[i] = new ???Stream();) then you would use childStream.Open() to actually oen the stream, then childStream.Read() and childStream.Close() when you're done with it.
Jason Williams
C# streams don't have an Open() method. Is it "opened" when Read is called?
Matt
Oops - Brain not in gear! Yes, when you create a new C# stream, you don't need to open it, but just start reading from it. (I was thinking of calls like File.Open() which opens a file and returns a FileStream for it, and I've spent years in other languages where you have to explicitly Open streams). You're right - you won't need the OpenStream() bit! Many apologies. I'll correct the pseudocode...
Jason Williams
I don't understand why you use "if (numBytes > 0)" where we check if the current stream is exhausted; wouldn't we want to check "if (count <= 0)", meaning no more bytes are available to be read?
Matt
Sorry - just poor typing. Should be 'if (count > 0)', i.e. if we still need to read more data, then move on to the next stream. I'll correct the code.
Jason Williams