Quick one.
I'm using mod rewrite and have most replacements in place:
- empty space =
_
æ
=ae
Æ
=ae
and so on.
What would be the natural replacement character for /
?
Thanks
Quick one.
I'm using mod rewrite and have most replacements in place:
_
æ
= ae
Æ
= ae
and so on.
What would be the natural replacement character for /
?
Thanks
I would use a dash - as Google separates words like this for SEO purposes, or even an underscore _ as these are both prefect for readability.
The underscore is considered as a word character, so foo_bar
is one word and not two words. But the hyphen is not considered as a word character, so foo-bar
is two words.
So you should use the hypen -
wherever you want to separate two parts and the underscore _
wherever you want to connect two parts. In the case of /
that is used to separate, I would prefer the hypen -
.
- Æ = ae
You don't necessarily need to do that. You can put non-ASCII Unicode characters in a URL just as UTF-8 encoded bytes. So:
http://en.wikipedia.org/wiki/%C3%86
displays in browsers as:
http://en.wikipedia.org/wiki/Æ
and either can be pasted into the address bar.
Space and slash can be encoded as %20 and %2F. However, those forms do still appear as percents in the browser, because they're otherwise reserved characters. So they don't look quite as pretty. There is an additional problem with %2F in that traditional scripting environments based around CGI can't read them, and Apache by default deliberately blocks them to stop such scripts getting confused and leaving security holes.
So I make title slugs by removing completely:
# % ' ( ) ? [ ] (U+00AD soft hyphen)
along with any control characters (U+0000 to U+001F except U+00A0, and U+007F to U+00BF). Then replacing any run of:
" $ & * + , / : ; < = > @ \ ^ (U+0020 space) (U+000A newline)
with a single underscore. This removes the necessity for a %xx sequence to appear in the URL. (For Unicode characters there will still be %xx sequences, but the user won't see them.)
You can use hyphen instead of underscore if you prefer, whichever is prettier. Search engines should be fine with either.
eta re comment on other answer:
Because I use php to search for the name in my database. So each character has to be converted back to the original character else no match will be found.
In that case you can't do anything unrecoverable to the characters, though. You can't change ‘Æ’ to ‘ae’, spaces must be encoded as ‘%20’ and slashes as ‘%2F’. This will result in slightly ugly URLs, and the ‘%2F’ will give you deployment problems on Apache and IIS.
If you need to key solely on a title you'll need to add a (UNIQUE indexed) column on the processed slug to look up, as suggested above. However note that you then can't rename/correct a page title, as it will change the slug, breaking the URL.
A common approach to get around this is to include a numeric ID in addition to the slug (see, for example, how SO does it). You can also 301-redirect where the slug name is wrong for optimal SEO.