views:

1590

answers:

5

I'm revealing my embarrassing ignorance of REGEX-fu here, but: I currently have a website where a load of the articles' URLs are written as "article_name", whilst the newer ones are written as "article-name".

I want to move all of them to using dashes, so is there a regular expression I could use to rewrite the older URLs to their newer equivalents?

Thanking you in advance!

+1  A: 

How will mod rewrite know what the actual url is supposed to be? You can rewrite all articles to use the underscore or the dash, but there is no way for mod_rewrite to tell if new location exists.

For example,

/I_Like_Bees      is stored as   /path/i_like_bees
/I-like-flowers   is stored as   /path/i-like-flowers

You want i-like-bees to rewrite to i_like_bees.

  • If you rewrite underscores to dashes, i_like_bees wouldn't be found
  • if you rewrite dashes to underscores i-like-flowers wouldn't be found

If you stored all your articles consistently you could easily make a rewrite rule work. Instead you probably have to write a script to check the directories existence and do a 301 redirect to the correct place.

Byron Whitlock
Sorry, I wasn't quite clear - all the articles will be normalised, I just want to put rules in place to handle old links and prevent linkrot.
Keith Williams
+3  A: 

First you must achieve consistency in the existing URLs. Basically, you have to normalize all existing names to always use dashes. Ok, you've done that.

We're starting with the following assumption:

The URL is roughly of the form:

http://example.com/articles/what-ever/really-doesnt_matter/faulty_article_name

where only URLs under /articles should be rewritten, and only the /faulty_article_name part needs to be sanitized.

Greatly updated, with something that actually works

For Apache:

RewriteEngine     On
RewriteRule       ^(/?articles/.*/[^/]*?)_([^/]*?_[^/]*)$ $1-$2 [N]
RewriteRule       ^(/?articles/.*/[^/]*?)_([^/_]*)$       $1-$2 [R=301]

That's generally inspired by GApple's answer.

The first /? ensures that this code will run on both vhost confs and .htaccess files. The latter does not expect a leading slash.

I then add the articles/ part to ensure that the rules only apply for URLs within /articles.

Then, while we have at least two underscores in the URL, we keep looping through the rules. When we end up with only one remaining underscore, the second rule kicks in, replaces it with a dash, and does a permanent redirect.

Phew.

kch
Normalisation has been achieved - I've just migrated to a new CMS (wordpress), so all the articles use dashes for whitespace now. The rules are going in a wordpress plugin for redirecting content, which accepts either static re-directs (don't want to put 50+ individual ones in!!) or a regex.
Keith Williams
Oh, and yes - that's the exact structure of the URLs.
Keith Williams
You mean you're not using apache's mod_rewrite? What's this wordpress plugin? Got a link to it? I'd like to know how exactly it does its redirections. If you could please update your question to point out this fact…
kch
Ah, now it came to me that Apache won't really perform a substitution as gsub would, it will expect you to capture elements you want to reuse, and put them back in when generating the final URL. Which does make things a bit trickier. gnarf's solution might be your best bet.
kch
Sorry for the long delay - I went off on holiday just after I posted this, only just got back and got around to trying it. Worked like a charm! To answer your question, I didn't want to use mod_rewrite because I don't have access to the HTTP config file (shared hosting), and wordpress inserts its own redirection code into .htaccess.
Keith Williams
A: 

Here's a method: http://yoast.com/apache-rewrite-dash-underscore/

Basically it separates the url into tokens on either side of the underscore, and rewrites the tokens again with the underscore replaced. The problem is it only replaces a single underscore at a time; it will redirect to a closer but not quite correct url, which will again redirect to a even closer, but possibly still not correct url...

It suggests fixing the multiple redirects by having several rewrite conditions & rules with successively more underscores and tokens, but this would require as many conditions and rules as you have underscores in your longest title.

Make sure to add any qualifiers if you can however, as the rule may replace paths you don't want changed (eg., image files) as is.

GApple
That's a "Too many redirects" error waiting to happen. I'd stay away from this solution. It's a clever hack, but not without issues.
kch
Well, this could work if instead of using [R=301] one used [N] (For next round)
kch
And it does. See my updated answer.
kch
+1  A: 

A potential different approach to think about:

I'm assuming that your "old format" and your "new format" will be in different directories for this idea, if they aren't you might want to consider making the new format have a different directory name.

For instance:

http://site.com/articles/2008/12/31/new_years_celebration
http://site.com/article/2008/12/31/new-years-celebration

In which case you could use mod_rewrite to detect anything in the "old directory" and redirect it to a "redirector.php".

Although on second thought, your mod_rewrite could look for something like this:

RedirectRule /articles/(.*_.*)  /redirector.php?article=$1

Matching anything with a _ and sending it through the redirector.

Inside of redirector.php you can get the $_SERVER['REQUEST_URI'] and use tools like preg_replace and even database queries to find the correct url to redirect them to - as well as study the number of hits to old urls.

gnarf
This may turn out to be a much easier solution to implement.
kch
+2  A: 

Try this:

RewriteRule ^([^_]*)_([^_]*_.*) $1-$2 [N]
RewriteRule ^([^_]*)_([^_]*)$ /$1-$2 [L,R=301]

The first rule replaces one underscore at a time until there are one or less left. The last rule will then replace the last underscrore and do an external redirect.

Gumbo
+1 for remembering to add the forward slash when redirecting. I'm not sure I want to update my vhost-htaccess-agnostic answer to actually take this into consideration. Oh so many parentheses.
kch