views:

391

answers:

2

Im reading in urls and they often have percent encoded characters.

Example: %C3%A9 is actually é

According to http://www.microsystools.com/products/sitemap-generator/faq/character-percentage-url-encoding/ , characters in the upper half of 8-Bit ASCII (128-255) are encoded as UTF-8, then their bytes are saved as hex. Now, when I get my URL, the %HEX's have been reencoded as 8-bit ascii, and I need to convert those back to their true 8bit ascii. Is there any function/library I can use, or else, how would I go about the conversion?

Im using C/C++.

A: 

And what's your question? And what programming language do you use?

archimed7592
+1  A: 

First you need to URLDecode. Not a function available in cross-platform C++, but, luckily for you, not a hard problem. Copy bytes from source to target. Non-% bytes just get copied. When you hit %xx, convert XX from hex chars to binary, and you have your byte.

This gives you a buffer of text in UTF-8. You say you want 'ASCII' -- ISO-646. Then you can't have an accented e. I can think of several possibilities for what you really want:

  1. ISO-8859-1. You can use ICU to convert UTF-8 to ISO-8859-1.
  2. ISO-646. You can also use ICU, and I believe it will make accented chars into their ISO-646 equivalents.
bmargulies