views:

707

answers:

7

I'm downloading some files asynchronously into a large byte array, and I have a callback that fires off periodically whenever some data is added to that array. If I want to give developers the ability to use the last chunk of data that was added to array, then... well how would I do that? In C++ I could give them a pointer to somewhere in the middle, and then perhaps tell them the number of bytes that were added in the last operation so they at least know the chunk they should be looking at... I don't really want to give them a 2nd copy of that data, that's just wasteful.

I'm just thinking if people want to process this data before the file has completed downloading. Would anyone actually want to do that? Or is it a useless feature anyway? I already have a callback for when the buffer (entire byte array) is full, and then they can dump the whole thing without worrying about start and end points...

A: 

I think you shouldn't bother.

Why on earth would anyone want to use it?

SLaks
You know, I was thinking about this from a theoretical standpoint, and perhaps you are too, but from a realistic standpoint you're quite right.... not another soul on this planet is ever going to use this library anyway. But out of curiosity, do you mean it's not a very useful feature, or what I just said?
Mark
I think it's not useful in the first place.
SLaks
+3  A: 

You can't give them a pointer into the array, but you could give them the array and start index and length of the new data.

But I have to wonder what someone would use this for. Is this a known need? or are you just guessing that someone might want this someday. And If so, is there any reason why you couldn't wait to add the capability once somone actually needs it?

John Knoeller
**YAGNI: You Ain't Gonna Need It.** Generally, don't add features to an API until there is a clear use case. Adding features later is easy - removing them once they're in can be nearly impossible. http://en.wikipedia.org/wiki/You_ain%27t_gonna_need_it
LBushkin
No, it's not a known need. I certainly don't need it for the project I'm working on/creating this library for. Just theorizing about how someone might use this... anyway, this sounds reasonable. Thanks :)
Mark
Better yet, use a System.ArraySegment<byte>.
Trillian
+1  A: 

Copying a chunk of a byte array may seem "wasteful," but then again, object-oriented languages like C# tend to be a little more wasteful than procedural languages anyway. A few extra CPU cycles and a little extra memory consumption can greatly reduce complexity and increase flexibility in the development process. In fact, copying bytes to a new location in memory to me sounds like good design, as opposed to the pointer approach which will give other classes access to private data.

But if you do want to use pointers, C# does support them. Here is a decent-looking tutorial. The author is correct when he states, "...pointers are only really needed in C# where execution speed is highly important."

Phil
A: 

That sounds like you want an event.

public class ArrayChangedEventArgs : EventArgs {
    public (byte[] array, int start, int length) {
        Array = array;
        Start = start;
        Length = length;
    }
    public byte[] Array { get; private set; }
    public int Start { get; private set; }
    public int Length { get; private set; }
}

// ...
// and in your class:

public event EventHandler<ArrayChangedEventArgs> ArrayChanged;

protected virtual void OnArrayChanged(ArrayChangedEventArgs e)
{
    // using a temporary variable avoids a common potential multithreading issue
    // where the multicast delegate changes midstream.
    // Best practice is to grab a copy first, then test for null

    EventHandler<ArrayChangedEventArgs> handler = ArrayChanged;

    if (handler != null)
    {
        handler(this, e);
    }
}

// finally, your code that downloads a chunk just needs to call OnArrayChanged()
// with the appropriate args

Clients hook into the event and get called when things change. This is what most client code in .NET expects to have in an API ("call me when something happens"). They can hook into the code with something as simple as:

yourDownloader.ArrayChanged += (sender, e) =>
    Console.WriteLine(String.Format("Just downloaded {0} byte{1} at position {2}.",
            e.Length, e.Length == 1 ? "" : "s", e.Start));
plinth
Err... I already *do* have an event, I was asking about best format to send data to the user during that event. i.e. what should event `e` contain?
Mark
+7  A: 

.NET has a struct that does exactly what you want:

System.ArraySegment.

In any case, it's easy to implement it yourself too - just make a constructor that takes a base array, an offset, and a length. Then implement an indexer that offsets indexes behind the scenes, so your ArraySegment can be seamlessly used in the place of an array.

Stefan Monov
Oh, well there's a nice clean solution!
Mark
+1  A: 

Whether this is needed or not depends on whether you can afford to accumulate all the data from a file before processing it, or whether you need to provide a streaming mode where you process each chunk as it arrives. This depends on two things: how much data there is (you probably would not want to accumulate a multi-gigabyte file), and how long it takes the file to completely arrive (if you are getting the data over a slow link you might not want your client to wait till it had all arrived). So it is a reasonable feature to add, depending on how the library is to be used. Streaming mode is usually a desirable attribute, so I would vote for implementing the feature. However, the idea of putting the data into an array seems wrong, because it fundamentally implies a non-streaming design, and because it requires an additional copy. What you could do instead is to keep each chunk of arriving data as a discrete piece. These could be stored in a container for which adding at the end and removing from the front is efficient.

Permaquid
I'm building a wrapper for `HttpWebRequest` (`WebClient` doesn't do everything I want it to). As far as I know, the only asynchronous read operation I can perform on that (`BeginRead`) puts the data into a `byte[]` array (which for me is a nice enough format to work with). *Most* of the the time I *do* know the size final size of the array by looking at the `ContentLength` header so I can allocate the memory appropriately. If however the file is huge, I've designated an upper limit at which point the data can be dumped to a file (I've implemented a 'buffer full' callback).
Mark
...anyway, I don't think writing it to a byte array is really inefficient; I'm not too sure it *can* be done more efficiently. Why would I want discrete, randomly sized chunks of data anyway? That would be so much harder to work with. Using `ArraySegment` as others have suggested, developers can handle each chunk as it arrives, or wait for the whole file and do whatever they want with the full array, rather than having to try and mesh it together afterwords.
Mark
+1  A: 

I agree with the OP: sometimes you just plain need to pay some attention to efficiency. I don't think the example of providing an API is the best, because that certainly calls for leaning toward safety and simplicity over efficiency.

However, a simple example is when processing large numbers of huge binary files that have zillions of records in them, such as when writing a parser. Without using a mechanism such as System.ArraySegment, the parser becomes a big memory hog, and is greatly slowed down by creating a zillion new data elements, copying all the memory over, and fragmenting the heck out of the heap. It's a very real performance issue. I write these kinds of parsers all the time for telecommunications stuff which generate millions of records per day in each of several categories from each of many switches with variable length binary structures that need to be parsed into databases.

Using the System.ArraySegment mechanism versus creating new structure copies for each record tremendously speeds up the parsing, and greatly reduces the peak memory consumption of the parser. These are very real advantages because the servers run multiple parsers, run them frequently, and speed and memory conservation = very real cost savings in not having to have so many processors dedicated to the parsing.

System.Array segment is very easy to use. Here's a simple example of providing a base way to track the individual records in a typical big binary file full of records with a fixed length header and a variable length record size (obvious exception control deleted):

public struct MyRecord
{
    ArraySegment<byte> header;
    ArraySegment<byte> data;
}


public class Parser
{
    const int HEADER_SIZE = 10;
    const int HDR_OFS_REC_TYPE = 0;
    const int HDR_OFS_REC_LEN = 4;
    byte[] m_fileData;
    List<MyRecord> records = new List<MyRecord>();

    bool Parse(FileStream fs)
    {
        int fileLen = (int)fs.FileLength;
        m_fileData = new byte[fileLen];
        fs.Read(m_fileData, 0, fileLen);
        fs.Close();
        fs.Dispose();
        int offset = 0;
        while (offset + HEADER_SIZE < fileLen)
        {
            int recType = (int)m_fileData[offset];
            switch (recType) { /*puke if not a recognized type*/ }
            int varDataLen = ((int)m_fileData[offset + HDR_OFS_REC_LEN]) * 256
                     + (int)m_fileData[offset + HDR_OFS_REC_LEN + 1];
            if (offset + varDataLen > fileLen) { /*puke as file has odd bytes at end*/}
            MyRecord rec = new MyRecord();
            rec.header = new ArraySegment(m_fileData, offset, HEADER_SIZE);
            rec.data = new ArraySegment(m_fileData, offset + HEADER_SIZE,   
                          varDataLen);
            records.Add(rec);
            offset += HEADER_SIZE + varDataLen;
        } 
    }
}

The above example gives you a list with ArraySegments for each record in the file while leaving all the actual data in place in one big array per file. The only overhead are the two array segments in the MyRecord struct per record. When processing the records, you have the MyRecord.header.Array and MyRecord.data.Array properties which allow you to operate on the elements in each record as if they were their own byte[] copies.

Christo