views:

587

answers:

1

I've created an HttpModule in ASP.NET to allow users to upload large files. I found some sample code online that I was able to adapt for my needs. I grab the file if it is a multi-part message and then I chunk the bytes and write them to disk.

The problem is that the file is always corrupt. After doing some research, it turns out that for some reason there is HTTP header or message body tags applied to the first part of the bytes I receive. I can't seem to figure out how to parse out those bytes so I only get the file.

Extra data / junk is prepended to the top of the file such as this:

-----------------------8cbb435d6837a3f
Content-Disposition: form-data; name="file"; filename="test.txt"
Content-Type: application/octet-stream

This kind of header information of course corrupts the file I am receiving so I need to get rid of it before I write the bytes.

Here is the code I wrote to handle the upload:

public class FileUploadManager : IHttpModule
{
    public int BUFFER_SIZE = 1024;

    protected void app_BeginRequest(object sender, EventArgs e)
    {
        // get the context we are working under
        HttpContext context = ((HttpApplication)sender).Context;

        // make sure this is multi-part data
        if (context.Request.ContentType.IndexOf("multipart/form-data") == -1)
        {
            return;
        }

        IServiceProvider provider = (IServiceProvider)context;
        HttpWorkerRequest wr = 
        (HttpWorkerRequest)provider.GetService(typeof(HttpWorkerRequest));

        // only process this file if it has a body and is not already preloaded
        if (wr.HasEntityBody() && !wr.IsEntireEntityBodyIsPreloaded())
        {
            // get the total length of the body
            int iRequestLength = wr.GetTotalEntityBodyLength();

            // get the initial bytes loaded
            int iReceivedBytes = wr.GetPreloadedEntityBodyLength();

            // open file stream to write bytes to
            using (System.IO.FileStream fs = 
            new System.IO.FileStream(
               @"C:\tempfiles\test.txt", 
               System.IO.FileMode.CreateNew))
            {
                // *** NOTE: This is where I think I need to filter the bytes 
                // received to get rid of the junk data but I am unsure how to 
                // do this?

                int bytesRead = BUFFER_SIZE;
                // Create an input buffer to store the incomming data 
                byte[] byteBuffer = new byte[BUFFER_SIZE];
                while ((iRequestLength - iReceivedBytes) >= bytesRead)
                {
                    // read the next chunk of the file
                    bytesRead = wr.ReadEntityBody(byteBuffer, byteBuffer.Length);
                    fs.Write(byteBuffer, 0, byteBuffer.Length);
                    iReceivedBytes += bytesRead;

                    // write bytes so far of file to disk
                    fs.Flush();
                }
            }
        }
    }
}

How would I detect and parse out this header junk information in order to isolate just the file bits?

A: 

What you're running into is the boundary used to separate the various parts of the HTTP request. There should be a header at the beginning of the request called Content-type, and within that header, there's a boundary statement like so:

Content-Type: multipart/mixed;boundary=gc0p4Jq0M2Yt08jU534c0p

Once you find this boundary, simply split your request on the boundary with two hyphens (--) prepended to it. In other words, split your content on:

"--"+Headers.Get("Content-Type").Split("boundary=")[1]

Sorta pseudo-code there, but it should get the point across. This should divide the multipart form data into the appropriate sections.

For more info, see RFC1341

It's worth noting, apparently the final boundary has two hyphens appended to the end of the boundary as well.

EDIT: Okay, so the problem you're running into is that you're not breaking the form data into the necessary components. The sections of a multipart/form-data request can each individually be treated as separate requests (meaning they can contain headers). What you should probably do is read the bytes into a string:

string formData = Encoding.ASCII.GetString(byteBuffer);

split into multiple strings based on the boundary:

string boundary = "\r\n"+context.Request.ContentType.Split("boundary=")[1];
string[] parts = Regex.Split( formData, boundary );

loop through each string, separating headers from content. Since you actually want the byte value of the content, keep track of the data offset since converting from ASCII back to byte might not work properly (I could be wrong, but I'm paranoid):

int dataOffset = 0;
for( int i=0; i < parts.Length; i++ ){
    string header = part.Substring( 0, part.IndexOf( "\r\n\r\n" ) );
    dataOffset += boundary.Length + header.Length + 4;
    string asciiBody = part.Substring( part.IndexOf( "\r\n\r\n" ) + 4 );
    byte[] body = new byte[ asciiBody.Length ];

    for( int j=dataOffset,k=0; j < asciiBody.Length; j++ ){
        body[k++] = byteBuffer[j];
    }

    // body now contains your binary data
}

NOTE: This is untested, so it may require some tweaking.

cmptrgeekken
Yes, that is exactly what I am talking about. I am not sure the context of your Headers.Get code. I am reading in bytes from the request. But Headers.Get seems to return a string. I am dealing with files that could be pure binary such as images so how can I use the string header information to remove it?In other words, in my loop that reads in the bytes, how do I ignore bytes that may refer to these header lines? I'm not making the correlation between your pseudo code that returns a string and the byte array I am filling?
dr.ess
Answer updated. Let me know if if it's what you're looking for.
cmptrgeekken