Hi,
I'm having real trouble understanding the specification and guidelines on how to properly escape and encode a URL for submission in a sitemap.
In the sitemap.org (entity escaping) examples, they have an example URL:
http://www.example.com/ümlat.php&q=name
Which when UTF-8 encoded ends up as (according to them):
http://www.example.com/%C3%BCmlat.php&q=name
However, when I try this (rawurlencode) on PHP I end up with:
http%3A%2F%2Fwww.example.com%2F%C3%BCmlat.php%26q%3Dname
I've sort of beaten this by using this function found on PHP.net
$entities = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40',
'%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%23', '%5B', '%5D');
$replacements = array('!', '*', "'", "(", ")", ";", ":", "@", "&", "=", "+",
"$", ",", "/", "?", "#", "[", "]");
$string = str_replace($entities, $replacements, rawurlencode($string));
but according to someone I spoke to (Kohana BDFM), this interpretation is wrong. Honestly, I'm so confused I don't even know what's right.
What's the correct way to encode a URL for use in the sitemap?
Relevant RFC 3986