views:

59

answers:

2

Let me start off by saying, I'm using the twisted.web framework. Twisted.web's file uploading didn't work like I wanted it to (it only included the file data, and not any other information), cgi.parse_multipart doesn't work like I want it to (same thing, twisted.web uses this function), cgi.FieldStorage didn't work ('cause I'm getting the POST data through twisted, not a CGI interface -- so far as I can tell, FieldStorage tries to get the request via stdin), and twisted.web2 didn't work for me because the use of Deferred confused and infuriated me (too complicated for what I want).

That being said, I decided to try and just parse the HTTP request myself.

Using Chrome, the HTTP request is formed like this:

------WebKitFormBoundary7fouZ8mEjlCe92pq
Content-Disposition: form-data; name="upload_file_nonce"

11b03b61-9252-11df-a357-00266c608adb
------WebKitFormBoundary7fouZ8mEjlCe92pq
Content-Disposition: form-data; name="file"; filename="login.html"
Content-Type: text/html

<!DOCTYPE html>
<html>
  <head> 

...

------WebKitFormBoundary7fouZ8mEjlCe92pq
Content-Disposition: form-data; name="file"; filename=""


------WebKitFormBoundary7fouZ8mEjlCe92pq--

Is this always how it will be formed? I'm parsing it with regular expressions, like so (pardon the wall of code):

(note, I snipped out most of the code to show only what I thought was relevant (the regular expressions (yeah, nested parentheses), this is an __init__ method (the only method so far) in an Uploads class I built. The full code can be seen in the revision history (I hope I didn't mismatch any parentheses)

if line == "--{0}--".format(boundary):
    finished = True

if in_header == True and not line:
    in_header = False
    if 'type' not in current_file:
        ignore_current_file = True

if in_header == True:
    m = re.match(
        "Content-Disposition: form-data; name=\"(.*?)\"; filename=\"(.*?)\"$", line)
    if m:
        input_name, current_file['filename'] = m.group(1), m.group(2)

    m = re.match("Content-Type: (.*)$", line)
    if m:
        current_file['type'] = m.group(1)

    else:
        if 'data' not in current_file:
            current_file['data'] = line
        else:
            current_file['data'] += line

you can see that I start a new "file" dict whenever a boundary is reached. I set in_header to True to say that I'm parsing headers. When I reach a blank line, I switch it to False -- but not before checking if a Content-Type was set for that form value -- if not, I set ignore_current_file since I'm only looking for file uploads.

I know I should be using a library, but I'm sick to death of reading documentation, trying to get different solutions to work in my project, and still having the code look reasonable. I just want to get past this part -- and if parsing an HTTP POST with file uploads is this simple, then I shall stick with that.

Note: this code works perfectly for now, I'm just wondering if it will choke on/spit out requests from certain browsers.

+1  A: 

The content-disposition header has no defined order for fields, plus it may contain more fields than just the filename. So your match for filename may fail - there may not even be a filename!

See rfc2183 (edit that's for mail, see rfc1806, rfc2616 and maybe more for http)

Also I would suggest in these kind of regexps to replace every space by \s*, and not to rely on character case.

mvds
These all specify mail or attachment, I'm looking for primarily `form-data` Content-Disposition. But they look the same, I'll probably look at all of the examples
Carson Myers
whatever the exact rfc, I think relying on the availability / absence and exact order of fields isn't very wise. Even if `filename` is the only field right now, in the future more may be added and your code breaks.
mvds
+1  A: 

You're trying to avoid reading documentation, but I think the best advice is to actually read:

to make sure you don't miss any cases. An easier route might be to use the poster library.

ars
it's not that I'm trying to _avoid_ documentation, it's that I either can't find enough of it, or it leads to a dead end (doesn't do what I want). I will certainly give those a read though
Carson Myers
Got it, I misread that bit. Definitely read the RFCs since they're the official word on these things, but you'd be right to question whether actual browser implementations in the wild have dark corners.
ars
Also, it seems that poster is only for creating the HTTP request -- I can't see anything in that documentation about decoding. Can it decode as well?
Carson Myers
I meant easier to test your own implementation (versus thinking about what cases you might be missing ...). Not sure about server side libraries, but there's probably something in the wsgi/pylons code base.
ars
True enough. Thanks for the help
Carson Myers