tags:

views:

4153

answers:

9

I need to make a website that will have articles, and I would like to make friendly urls for it, example:

Title: Article Test

should become http://www.example.com/articles/article_test

Of course I need to remove some characters from the title like ? or #, but I'm not sure which ones to remove.

Can someone tell me what characters are safe to keep?

thanks!

+3  A: 

The format for an URI is defined in RFC 3986. See section 3.3 for details.

joschi
A: 

I think you're looking for something like "URL Encoding" - encoding a URL so that it's "safe" to use on the web:

Here's a reference for that. If you don't want any special characters, just remove any that require URL encoding:

http://www.w3schools.com/TAGS/ref_urlencode.asp

Andy White
+1  A: 

Hi, You are best keeping only some characters (whitelist) instead of removing certain characters (blacklist).

You can technically allow any character, just as long as you properly encode it. But, to answer in the spirit of the question, you should only allow these characters:

  1. Lower case letters (convert upper case to lower)
  2. Numbers, 0 through 9
  3. A dash - or underscore _
  4. Tilda ~

Everything else has a potentially special meaning. For example, you may think you can use +, but it can be replaced with a space. & is dangerous, too, especially if using some rewrite rules.

As with the other comments, check out the standards and specifications for complete details.

carl
what about the period?
Neil N
A: 

What you need i address rewritting this wikipedia article should give you enough information to stat with. Specifically, if you use apache, read about mod_rewrite.

ya23
+1  A: 

There are two sets of characters you need to watch out for - Reserved and Unsafe.

The reserved characters are: ampersand ("&") dollar ("$") plus sign ("+") comma (",") forward slash ("/") colon (":") semi-colon (";") equals ("=") question mark ("?") 'At' symbol ("@").

The characters generally considered unsafe are: space, question mark ("?"), less than and greater than ("<>") open and close brackets ("[]") open and close braces ("{}") pipe ("|") backslash ("\") caret ("^") tilde ("~") percent ("%") and pound ("#").

I may have forgotten one or more, which leads to me echoing Carl V's answer. In the long run you are probably better off using a white list of allowed characters and then encoding the string than trying to say abreast of characters that are disallowed by servers and systems.

Gary.Ray
+9  A: 

To quote section 2.3 of RFC 3986:

"Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde."

Skip Head
A: 

There was a similar question, here. Check it out, you may find some useful answers there also (there were quite a lot of them).

ldigas
+1  A: 

From the context you describe, I suspect that what you're actually trying to make is something called an 'SEO slug'. The best general known practice for those is:

  1. Convert to lower-case
  2. Convert entire sequences of characters other than a-z and 0-9 to one hyphen (-) (not underscores)
  3. Remove 'stop words' from the URL, i.e. not-meaningfully-indexable words like 'a', 'an', and 'the'; Google 'stop words' for extensive lists

So, as an example, an article titled "The Usage of !@%$* to Represent Swearing In Comics" would get a slug of "usage-represent-swearing-comics".

chaos
Is it really a good approach to remove these "stop words" from the url? Would search engines penalize a website because of this?
Paulo
Search engines are generally believed to only acknowledge some portion of the URL and/or to give reduced significance to later portions, so by removing stop words what you're doing is maximizing the number of keywords you embed in your URL that you have a chance of actually ranking on.
chaos
A: 

From an SEO perspective, hyphens are preferred over underscores. Convert to lowercase, remove all apostrophes, then replace all non-alphanumeric strings of characters with a single hyphen. Trim excess hyphens off the start and finish.

Mark