views:

132

answers:

2

Through this forum, I have learned that it is not a good idea to use the following for converting CGI input (from either an escape()d Ajax call or a normal HTML form post) to UTF-8:

read (STDIN, $_, $ENV{CONTENT_LENGTH});
s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg;
utf8::decode $_;

A safer way (which for example does not allow bogus characters through) is to do the following:

use Encode qw (decode);
read (STDIN, $_, $ENV{CONTENT_LENGTH});
s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg;
decode ('UTF-8', $_, Encode::FB_CROAK);

I would, however, very much like to avoid using any modules (including XSLoader, Exporter, and whatever else they bring with them). The function is for a high-volume mod_perl driven website and I think both performance and maintainability will be better without modules (especially since the current code does not use any).

I guess one approach would be to examine the Encode module and strip out the functions and constants used for the “decode ('UTF-8', $_, Encode::FB_CROAK)” call. I am not sufficiently familiar with Unicode and Perl modules to do this. Maybe somebody else is capable of doing this or know a similar, safe “native” way of doing the UTF-8 conversion?

UPDATE:

I prefer keeping things non-modular, because then the only black-box is Perl's own compiler (unless of course you dig down into the module libs).

Sometimes you see large modules being replaced with a few specific lines of code. For example, instead of the CGI.pm module (which people are also in love with), one can use the following for parsing AJAX posts:

my %Input;
if ($ENV{CONTENT_LENGTH}) {
    read (STDIN, $_, $ENV{CONTENT_LENGTH});
    foreach (split (/&/)) {
        tr/+/ /; s/%([a-fA-F0-9]{2})/pack("C", hex($1))/eg;
        if (m{^(\w+)=\s*(.*?)\s*$}s) { $Input{$1} = $2; }
        else { die ("bad input ($_)"); }
    }
}

In a similar way, it would be great if one could extract or replicate Encode's UTF-8 decode function.

+6  A: 

Don't pre-optimize. Do it the conventional way first then profile and benchmark later to see where you need to optimize. People usually waste all their time somewhere else, so starting off blindfolded and hadcuffed doesn't give you any benefit.

Don't be afraid of modules. The point of mod_perl is to load up everything as few times as possible so the startup time and module loading time are insignificant.

brian d foy
What's wrong with pre-optimizing, when you know exactly what you are going to need? Why go through the trouble of benchmarking if you don't have to (because you have shaved off all unnecessary logic)?Of course, you have a point regarding mod_perl, and I generally acknowledge the fact that your knowledge of Perl is about 1000 times greater than mine. So I am certainly taking your point of view into consideration and look forward to hearing other peoples' point of view.
W3Coder
Well, I don't think you know exactly what you need. It doesn't sound like you know what you are doing.
brian d foy
Very constructive comment. If I knew what to do, would I be asking questions? You see, I believe that is the purpose of this site - not the promotion of commercial books by their authors.
W3Coder
Oh no you didn't.
AmbroseChapel
The purpose of this site is to help people who genuinely want help. You don't appear to want real help. Instead, you're looking for validation for your pre-conceived ideas. I don't feel bad about promoting my books. I don't feel bad about promoting other people's books. "The more that you read, the more things you will know. The more that you learn, the more places you'll go." That's why we write books.
brian d foy
+1  A: 

Don't use escape() to create your posted data. This isn't compatible with URL-encoding, it's a mutant JavaScript oddity which should normally never be used. One of the defects is that it will encode non-ASCII characters to non-standard %uNNNN sequences based on UTF-16 code units, instead of standard URL-encoded UTF-8. Your current code won't be able to handle that.

You should typically use encodeURIComponent() instead.

If you must URL-decode posted input yourself rather than using a form library (and this does mean you won't be able to handle multipart/form-data), you will need to convert + symbols to spaces before replacing %-sequences. This replacement is standard in form submissions (though not elsewhere in URL-encoded data).

To ensure input is valid UTF-8 if you really don't want to use a library, try this regex. It also excludes some control characters (you may want to tweak it to exclude more).

bobince
Your input is much appreciated. I was aware of the shortcomings of escape (), binary/multipart posting, etc., but the RegEx you are linking to seems very useful. Whether my approach to decoding UTF-8 makes sense or not, time will show, but your answer is definitely helpful, thanks very much!
W3Coder