tags:

views:

206

answers:

4

I have some URL that contains special characters. For example:

http://www.example.com/bléèàû.html

If you type this URL in a browser, my web server would show the correct page (it can handle special character).

I have looked at the sitemaps specs and it's not clear whether or not sitemaps file can contain special character. From what I understand of the protocol, if the URL is working fine and the server serves the correct page and the XML file is UTF-8 encoded, then it's ok.

For example, this entry is a valid sitemaps entry:

   <url>
      <loc>http://www.example.com/bléèàû.html&lt;/loc&gt;
      <changefreq>weekly</changefreq>
   </url>

Anyone can confirm this?

[Update] The reason I'm reluctant to encode the special characters is that I don't want to introduce duplicate URLs for the same content. For example

http://www.example.com/bl%C3%A9%C3%A8%C3%A0%C3%BB.html

and

http://www.example.com/bléèàû.html

would serve the same page. I presume Google would catch both URL with its normal indexing and the sitemaps. Unfortunately Google have a tendency to downgrade page rank of sites that have duplicate URLs pointing to the same page.

A: 

The spec says that sitemaps must be UTF-8 encoded. If the URL is correct and is used on your site, then yes include it in your sitemap.

Cal Jacobson
+1  A: 

While it's true that a sitemap must be encoded in UTF-8, that doesn't mean that the URLs contained therein have all of UTF-8 available to them. URLs are supposed to be a subset of US-ASCII as specified in RFC 1738. I don't believe your example URLs are valid according to that reference and should instead be URL encoded/escaped.

Joe Holloway
+2  A: 

The sitemaps specification doesn't say. It shows examples of URLs in various escaped forms but does not definitively say whether the first example (raw characters) is allowable. It only calls them ‘URL’s, with no reference to a particular definition of ‘URL’ or RFC which would clarify whether they mean old-school ASCII URIs, or IRIs (which may contain non-ASCII characters).

So it would be safest to %-escape the UTF-8 encoding of the URL. The link will then work globally, and should be presented to the user as a Unicode character in all modern browsers.

<loc>http://www.example.com/bl%C3%A9%C3%A8%C3%A0%C3%BB.html&lt;/loc&gt;
bobince
A: 

For those users who don't have the ability to easily enter special characters **, I suggest you also accept the closest approximation on the server side. (i.e. alias or forward /bleeau.html to /bléèàû.html)

** For example cell phone browsers with limited keyboards and persons with different keyboard language layouts.

Chris Nava