ansaurus

Question

Allowing non-English (ASCII) characters in the URL for SEO?

Answer 1

A:

Do you know what language everything will be in? Is it all latin based?

If so, then I would suggest building a sort of lookup table that will convert UTF-8 to ASCII when possible(and non-colliding) Something like that would convert Ź into Z and such, and when there is a collision or the character doesn't exist in your lookup table, then it just uses %HH.

Earlz 2010-01-28 00:25:00

Well, I have borrowed a accent converter table (Ź into Z) which you can find the wordpress code base. But I don't know what you mean by `%HH`.

Xeoncross 2010-01-28 03:31:06

`Convert each byte that is not an ASCII letter or digit to %HH, where HH is the hexadecimal value of the byte`

Earlz 2010-01-28 15:13:23

How do you convert each byte to hexadecimal?

Xeoncross 2010-01-28 22:07:31

Answer 2

+3 A:

Firstly, search engines really don't care about the URLs. They help visitors: visitors link to sites, and search engines care about that. URLs are easy to spam, if they cared there would be incentive to spam. No major search engines wants that. The allinurl: is merely a feature of google to help advanced users, not something that gets factored into organic rankings. Any benefits you get from using a more natural URL will probably come as a fringe benefit of the PR from an inferior search engine indexing your site -- and there is some evidence this can be negative with the advent of negative PR too.

From Google Webmaster Central

Does that mean I should avoid rewriting dynamic URLs at all?

That's our recommendation, unless your rewrites are limited to removing unnecessary parameters, or you are very diligent in removing all parameters that could cause problems. If you transform your dynamic URL to make it look static you should be aware that we might not be able to interpret the information correctly in all cases. If you want to serve a static equivalent of your site, you might want to consider transforming the underlying content by serving a replacement which is truly static. One example would be to generate files for all the paths and make them accessible somewhere on your site. However, if you're using URL rewriting (rather than making a copy of the content) to produce static-looking URLs from a dynamic site, you could be doing harm rather than good. Feel free to serve us your standard dynamic URL and we will automatically find the parameters which are unnecessary.

I personally don't believe it matters all that much short of getting a little more click through and helping users out. So far as Unicode, you don't understand how this works: the request goes to the hex-encoded unicode destination, but the rendering engine must know how to handle this if it wishes to decode them back to something visually appealing. Google will render (aka decode) unicode (encoded) URL's properly.

Some browsers make this slightly more complex by always encoding the hostname portion, because of phishing attacks using ideographs that look the same.

I wanted to show you an example of this, here is request to http://hy.wikipedia.org/wiki/Գլխավոր_Էջ issued by wget:

Hypertext Transfer Protocol
    GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n
        [Expert Info (Chat/Sequence): GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n]
            [Message: GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n]
            [Severity level: Chat]
            [Group: Sequence]
        Request Method: GET
        Request URI: /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB
        Request Version: HTTP/1.0
    User-Agent: Wget/1.11.4\r\n
    Accept: */*\r\n
    Host: hy.wikipedia.org\r\n
    Connection: Keep-Alive\r\n
    \r\n

As you can see, wget like every other browser will just url-encode the destination for you, and the continue the request to the url-encoded destination. The url-decoded domain only exists as a visual convenience.

Evan Carroll 2010-01-29 03:09:57

Xeoncross 2010-01-29 18:07:56

No, it isn't better: simply the same. `/$id` would make it slightly more difficult on users. All URLs must be encoded per rfc3986 before you can make the request. The fact that your browser has ability to encode the link you give it is just a nicety. Technically if the server does it you open yourself up the almost non-existent market that doesn't have the ability to decode/encode unicode links, wikipedia also does this (the unicode representation is the anchor, the link is encoded). Per the spec, this is the way it is supposed to be.

Evan Carroll 2010-01-29 19:14:11

So then what should I do? When I am creating a link that contains a UTF8 string like `<a href="site.com/tags/id/non-ascii-tag">non-ascii-tag</a>` should I just trust the browser to encode the URI - or should I run it through some kind of encoder function so the browser doesn't have too?

Xeoncross 2010-01-30 21:21:39

Run it through the encoder, the browser will decode the the path and query portion of the URL for visual appeal, but either way it almost certainly doesn't matter to a search engine. I imagine they normalize all Unicode urls by encoding them anyway.

Evan Carroll 2010-01-30 22:51:00

Run it through what encoder?

Xeoncross 2010-01-30 23:53:27

a url-encoder of course.

Evan Carroll 2010-01-31 00:10:06

ansaurus

tags:

views:

answers:

Allowing non-English (ASCII) characters in the URL for SEO?

related questions