views:

481

answers:

2

I have lots of UTF-8 content that I want inserted into the URL for SEO purposes. For example, post tags that I want to include in th URI (site.com/tags/id/TAG-NAME). However, only ASCII characters are allowed by the standards.

Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.

The solution seems to be to:

  • Convert the character string into a sequence of bytes using the UTF-8 encoding
  • Convert each byte that is not an ASCII letter or digit to %HH, where HH is the hexadecimal value of the byte

However, that converts the legible (and SEO valuable) words into mumbo-jumbo. So I'm wondering if google is still smart enough to handle searches in URL's that contain encoded data - or if I should attempt to convert those non-english characters into there semi-ASCII counterparts (which might help with latin based languages)?

A: 

Do you know what language everything will be in? Is it all latin based?

If so, then I would suggest building a sort of lookup table that will convert UTF-8 to ASCII when possible(and non-colliding) Something like that would convert Ź into Z and such, and when there is a collision or the character doesn't exist in your lookup table, then it just uses %HH.

Earlz
Well, I have borrowed a accent converter table (Ź into Z) which you can find the wordpress code base. But I don't know what you mean by `%HH`.
Xeoncross
`Convert each byte that is not an ASCII letter or digit to %HH, where HH is the hexadecimal value of the byte`
Earlz
How do you convert each byte to hexadecimal?
Xeoncross
+3  A: 

Firstly, search engines really don't care about the URLs. They help visitors: visitors link to sites, and search engines care about that. URLs are easy to spam, if they cared there would be incentive to spam. No major search engines wants that. The allinurl: is merely a feature of google to help advanced users, not something that gets factored into organic rankings. Any benefits you get from using a more natural URL will probably come as a fringe benefit of the PR from an inferior search engine indexing your site -- and there is some evidence this can be negative with the advent of negative PR too.

From Google Webmaster Central

Does that mean I should avoid rewriting dynamic URLs at all?

That's our recommendation, unless your rewrites are limited to removing unnecessary parameters, or you are very diligent in removing all parameters that could cause problems. If you transform your dynamic URL to make it look static you should be aware that we might not be able to interpret the information correctly in all cases. If you want to serve a static equivalent of your site, you might want to consider transforming the underlying content by serving a replacement which is truly static. One example would be to generate files for all the paths and make them accessible somewhere on your site. However, if you're using URL rewriting (rather than making a copy of the content) to produce static-looking URLs from a dynamic site, you could be doing harm rather than good. Feel free to serve us your standard dynamic URL and we will automatically find the parameters which are unnecessary.

I personally don't believe it matters all that much short of getting a little more click through and helping users out. So far as Unicode, you don't understand how this works: the request goes to the hex-encoded unicode destination, but the rendering engine must know how to handle this if it wishes to decode them back to something visually appealing. Google will render (aka decode) unicode (encoded) URL's properly.

Some browsers make this slightly more complex by always encoding the hostname portion, because of phishing attacks using ideographs that look the same.

I wanted to show you an example of this, here is request to http://hy.wikipedia.org/wiki/Գլխավոր_Էջ issued by wget:

Hypertext Transfer Protocol
    GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n
        [Expert Info (Chat/Sequence): GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n]
            [Message: GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n]
            [Severity level: Chat]
            [Group: Sequence]
        Request Method: GET
        Request URI: /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB
        Request Version: HTTP/1.0
    User-Agent: Wget/1.11.4\r\n
    Accept: */*\r\n
    Host: hy.wikipedia.org\r\n
    Connection: Keep-Alive\r\n
    \r\n

As you can see, wget like every other browser will just url-encode the destination for you, and the continue the request to the url-encoded destination. The url-decoded domain only exists as a visual convenience.

Evan Carroll
Xeoncross
No, it isn't better: simply the same. `/$id` would make it slightly more difficult on users. All URLs must be encoded per rfc3986 before you can make the request. The fact that your browser has ability to encode the link you give it is just a nicety. Technically if the server does it you open yourself up the almost non-existent market that doesn't have the ability to decode/encode unicode links, wikipedia also does this (the unicode representation is the anchor, the link is encoded). Per the spec, this is the way it is supposed to be.
Evan Carroll
So then what should I do? When I am creating a link that contains a UTF8 string like `<a href="site.com/tags/id/non-ascii-tag">non-ascii-tag</a>` should I just trust the browser to encode the URI - or should I run it through some kind of encoder function so the browser doesn't have too?
Xeoncross
Run it through the encoder, the browser will decode the the path and query portion of the URL for visual appeal, but either way it almost certainly doesn't matter to a search engine. I imagine they normalize all Unicode urls by encoding them anyway.
Evan Carroll
Run it through what encoder?
Xeoncross
a url-encoder of course.
Evan Carroll