ansaurus

Question

How can I convert CGI input to UTF-8 without Perl's Encode module?

Answer 1

+6 A:

Don't pre-optimize. Do it the conventional way first then profile and benchmark later to see where you need to optimize. People usually waste all their time somewhere else, so starting off blindfolded and hadcuffed doesn't give you any benefit.

Don't be afraid of modules. The point of mod_perl is to load up everything as few times as possible so the startup time and module loading time are insignificant.

brian d foy 2010-09-19 14:33:14

What's wrong with pre-optimizing, when you know exactly what you are going to need? Why go through the trouble of benchmarking if you don't have to (because you have shaved off all unnecessary logic)?Of course, you have a point regarding mod_perl, and I generally acknowledge the fact that your knowledge of Perl is about 1000 times greater than mine. So I am certainly taking your point of view into consideration and look forward to hearing other peoples' point of view.

W3Coder 2010-09-19 19:32:41

Well, I don't think you know exactly what you need. It doesn't sound like you know what you are doing.

brian d foy 2010-09-19 19:36:39

Very constructive comment. If I knew what to do, would I be asking questions? You see, I believe that is the purpose of this site - not the promotion of commercial books by their authors.

W3Coder 2010-09-20 07:40:06

Oh no you didn't.

AmbroseChapel 2010-09-21 12:28:02

The purpose of this site is to help people who genuinely want help. You don't appear to want real help. Instead, you're looking for validation for your pre-conceived ideas. I don't feel bad about promoting my books. I don't feel bad about promoting other people's books. "The more that you read, the more things you will know. The more that you learn, the more places you'll go." That's why we write books.

brian d foy 2010-09-22 13:56:56

Answer 2

+1 A:

Don't use escape() to create your posted data. This isn't compatible with URL-encoding, it's a mutant JavaScript oddity which should normally never be used. One of the defects is that it will encode non-ASCII characters to non-standard %uNNNN sequences based on UTF-16 code units, instead of standard URL-encoded UTF-8. Your current code won't be able to handle that.

You should typically use encodeURIComponent() instead.

If you must URL-decode posted input yourself rather than using a form library (and this does mean you won't be able to handle multipart/form-data), you will need to convert + symbols to spaces before replacing %-sequences. This replacement is standard in form submissions (though not elsewhere in URL-encoded data).

To ensure input is valid UTF-8 if you really don't want to use a library, try this regex. It also excludes some control characters (you may want to tweak it to exclude more).

bobince 2010-09-21 09:59:07

Your input is much appreciated. I was aware of the shortcomings of escape (), binary/multipart posting, etc., but the RegEx you are linking to seems very useful. Whether my approach to decoding UTF-8 makes sense or not, time will show, but your answer is definitely helpful, thanks very much!

W3Coder 2010-09-21 13:24:25

ansaurus

tags:

views:

answers:

How can I convert CGI input to UTF-8 without Perl's Encode module?

related questions