views:

31

answers:

2

Folks,

Anybody knows how Wikipedia or MediaWiki in general, encode the URI according to the title? It's not normal URI encoding, " "s are replaced with "_"s and single quotations are not encoded and things like that. Any reference on that?

Cheers Parsa

+1  A: 

The process is quite complex and isn't exactly pretty. You need to look at the Title class found in includes/Title.php. You should start with the newFromText method, but the bulk of the logic is in the secureAndSplit method.

Note that (as ever with MediaWiki) the code is not decoupled in the slightest. If you want to replicate it, you'll need to extract the logic rather than simply re-using the class.

The logic looks something like this:

  • Decode character references (e.g. é)
  • Convert spaces to underscores
  • Check whether the title is a reference to a namespace or interwiki
  • Remove hash fragments (e.g. Apple#Name
  • Remove forbidden characters
  • Forbid subdirectory links (e.g. ../directory/page)
  • Forbid triple tilde sequences (~~~) (for some reason)
  • Limit the size to 255 bytes
  • Capitalise the first letter

Furthermore, I believe I'm right in saying that quotation marks don't need to be encoded by the original user -- browsers can handle them transparently.

I hope that helps!

lonesomeday
+1  A: 

hxxp://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_%28technical_restrictions%29 - here you've got some kind of description of what their engine enforces on article names.

They should have something like this in their LocalSettings.php: $wgArticlePath = '/wiki/$1';

and proper server URI rewrites configuration - they seem to be using Apache (HTTP header), so it's probably mod_rewrite. hxxp://www.mediawiki.org/wiki/Manual:Short_URL

You can also refer to the index.php file for an article on Wikipedia like this: hxxp://en.wikipedia.org/w/index.php?title=Foo%20bar and get redirected by the engine to hxxp://en.wikipedia.org/wiki/Foo_bar. Behind the scenes mod_rewrite translates it into /index.php?title=Foo_bar. For the MediaWiki engine it's the same as if you visited hxxp://en.wikipedia.org/w/index.php?title=Foo_bar - this page doesn't redirect you.

Zygmunt
I believe `mod_rewrite` does not rewrite URLs to `index.php?title=Foo_bar`. The links are rewritten (if at all) to `index.php/Foo_bar` and then read by `$_SERVER['REQUEST_URI']` or something similar.
lonesomeday