views:

2141

answers:

15

What is a good complete Regex or some other process that would take "How do you change a title to be part of the url like Stackoverflow?" and turn it into "how-do-you-change-a-title-to-be-part-of-the-url-like-stackoverflow" that is used in the smart urls?

The dev environment is I am using is Rails but if there are some other platform specific solutions (.net, php, django), I would love to see those too. I am sure I (or another reader) will come across the same problem on a different platform down the line.

-- edit --

I am using custom routes, I mainly want to know how to alter the string to all special chars are removed, it's all lowercase, and all whitespace is replaced.

+9  A: 

You will want to setup a custom route to point the url to the controller that will handle it. Since you are using Rails, here is an introduction in using their routing engine.

Edit

Sorry, I misunderstood your question. In Ruby, you will need a regex like you already know and here is the regex to use:

def permalink_for(str)
    str.gsub(/[^\w\/]|[!\(\)\.]+/, ' ').strip.downcase.gsub(/\ +/, '-')
end
Dale Ragan
+1  A: 

On my LAMP sites I use the mod_rewrite function in .htaccess

Read more here: http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html

Andrew G. Johnson
+2  A: 

I am not familiar with Rails, but the following is (untested) PHP code. You can probably translate this very quickly to Rails if you find it useful.

$sURL = "This is a title to convert to URL-format. It has 1 number in it!";
// lower-case
$sURL = strtolower($sURL);
// replace all non-word characters with spaces
$sURL = preg_replace("/\W+/", " ", $sURL);
// remove trailing spaces (so we won't end with a separator)
$sURL = trim($sURL);
// replace spaces with separators (hyphen)
$sURL = str_replace(" ", "-", $sURL);
echo $sURL;
// outputs: this-is-a-title-to-convert-to-url-format-it-has-1-number-in-it

Hope this helps.

Vegard Larsen
+1  A: 

I don't much about Ruby or Rails, but in Perl, this is what I would do:

my $title = "How do you change a title to be part of the url like Stackoverflow?";

my $url = lc $title;   # Change to lower case and copy to URL.
$url =~ s/^\s+//g;     # Remove leading spaces.
$url =~ s/\s+$//g;     # Remove trailing spaces.
$url =~ s/\s+/\-/g;    # Change one or more spaces to single hyphen.
$url =~ s/[^\w\-]//g;  # Remove any non-word characters.

print "$title\n$url\n";

I just did a quick test and it seems to work. Hopefully this is relatively easy to translate to Ruby.

Brian
+1  A: 

Assuming that your model class has a title attribute, you can simply override the to_param method within the model, like this:

def to_param
  title.downcase.gsub(/ /, '-')
end

This Railscast episode has all the details. You can also ensure that the title only contains valid characters using this:

validates_format_of :title, :with => /^[a-z0-9-]+$/,
                    :message => 'can only contain letters, numbers and hyphens'
John Topley
A: 

What about funny characters? What are you going to do about those? Umlauts? Punctuation? These need to be considered. Basically, I would use a white-list approach, as opposed to the black-list approaches above: Describe which characters you will allow, which characters you will convert (to what?) and then change the rest to something meaningfull (""). I doubt you can do this in one regex... Why not just loop through the characters?

Daren Thomas
+1  A: 

Brian's code, in Ruby:

title.downcase.strip.gsub(/\ /, '-').gsub(/[^\w\-]/, '')

downcase turns the string to lowercase, strip removes leading and trailing whitespace, the first gsub call globally substitutes spaces with dashes, and the second removes everything that isn't a letter or a dash.

Sören Kuklau
+28  A: 

Here's how we do it. Note that there are probably more edge conditions than you realize at first glance.

This is the second version, unrolled for 5x more performance (and yes, I benchmarked it). I figured I'd optimize it because this function can be called hundreds of times per page.

if (String.IsNullOrEmpty(title)) return "";

// to lowercase, trim extra spaces
title = title.ToLower().Trim();

var len = title.Length;
var sb = new StringBuilder(len);
bool prevdash = false;
char c;

for (int i = 0; i < title.Length; i++)
{
    c = title[i];
    if (c == ' ' || c == ',' || c == '.' || c == '/' || c == '\\' || c == '-')
    {
        if (!prevdash)
        {
            sb.Append('-');
            prevdash = true;
        }
    }
    else if ((c >= 'a' && c <= 'z') || (c >= '0' && c <= '9'))
    {
        sb.Append(c);
        prevdash = false;
    }
    if (i == 80) break;
}

title = sb.ToString();
// remove trailing dash, if there is one
if (title.EndsWith("-"))
    title = title.Substring(0, title.Length - 1);
return title;

To see the previous version of the code this replaced (but is functionally equivalent to, and 5x faster), view revision history of this post (click the date link).

Jeff Atwood
What's entityRegex?
CVertex
It would be nice with a version that doesn't just drop accented characters like åäö but instead deaccentuate them to aao... ^^
Oskar Duveborn
+5  A: 
The How-To Geek
+1  A: 

I'd add to the answers here that this is commonly known as a URL 'slug' if you want to google the term.

izb
+1  A: 

There is a small Rails plugin called PermalinkFu, that does this.

The escape method does the transformation into a string that is suitable for a url. Have a look at the code, that method is quite simple.

To remove non-ascii chars it uses the iconv lib to translate to 'ascii//ignore//translit' from 'utf-8'. Spaces are then turned into dashes, everything is downcased etc.

Lau
+3  A: 
D4V360
+1  A: 

T-SQL implementation, adapted from dbo.UrlEncode:

CREATE FUNCTION dbo.Slug(@string varchar(1024))
RETURNS varchar(3072)
AS
BEGIN
 DECLARE @count int, @c char(1), @i int, @slug varchar(3072)

 SET @string = replace(lower(ltrim(rtrim(@string))),' ','-')

 SET @count = Len(@string)
 SET @i = 1
 SET @slug = ''

 WHILE (@i <= @count)
 BEGIN
  SET @c = substring(@string, @i, 1)

  IF @c LIKE '[a-z0-9--]'
   SET @slug = @slug + @c

  SET @i = @i +1
 END

 RETURN @slug
END
Sören Kuklau
+1  A: 

If you are using Rails edge, you can rely on Inflector.parametrize - here's the example from the documentation:

  class Person
    def to_param
      "#{id}-#{name.parameterize}"
    end
  end

  @person = Person.find(1)
  # => #<Person id: 1, name: "Donald E. Knuth">

  <%= link_to(@person.name, person_path(@person)) %>
  # => <a href="/person/1-donald-e-knuth">Donald E. Knuth</a>

Also if you need to handle more exotic characters such as accents (éphémère) in previous version of Rails, you can use a mixture of PermalinkFu and DiacriticsFu:

DiacriticsFu::escape("éphémère")
=> "ephemere"

DiacriticsFu::escape("räksmörgås")
=> "raksmorgas"

cheers!

Thibaut

--

http://blog.logeek.fr

Thibaut Barrère
Mmm, räksmörgås!
bzlm
hehe - I love those funny comments :)
Thibaut Barrère
+1  A: 

no, no, no. you are all so very wrong. Except for the diacritics-fu stuff, you're getting there, but what about asian characters (shame on ruby developers for not considering their nihonjin brethren)

firefox and safari both display non-ascii characters in the url, and frankly they look great. It is nice to support links like 'http://somewhere.com/news/read/お前たちはアホじゃないかい'

so here's some PHP code that'll do it, but I just wrote it, and haven't stress tested it.

<?php

function slug($str)
{
  $args = func_get_args();
  array_filter($args);  //remove blanks
  $slug = mb_strtolower(implode('-', $args));

  $real_slug = '';
  $hyphen = '';
  foreach(SU::mb_str_split($slug) as $c)
  {
    if (strlen($c) > 1 && mb_strlen($c)===1)
    {
      $real_slug .= $hyphen . $c;
      $hyphen = '';
    }
    else
    {
      switch($c)
      {
        case '&':
          $hyphen = $real_slug ? '-and-' : '';
          break;
        case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': case 'g': case 'h': case 'i': case 'j': case 'k': case 'l': case 'm':
        case 'n': case 'o': case 'p': case 'q': case 'r': case 's': case 't': case 'u': case 'v': case 'w': case 'x': case 'y': case 'z':
        case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': case 'G': case 'H': case 'I': case 'J': case 'K': case 'L': case 'M':
        case 'N': case 'O': case 'P': case 'Q': case 'R': case 'S': case 'T': case 'U': case 'V': case 'W': case 'X': case 'Y': case 'Z':
        case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9':
          $real_slug .= $hyphen . $c;
          $hyphen = '';
          break;
        default:
          $hyphen = $hyphen ? $hyphen : ($real_slug ? '-' : '');
      }
    }
  }

  return $real_slug;
}

Example:

$str = "~!@#$%^&*()_+-=[]\{}|;':\",./<>?\n\r\t\x07\x00\x04 コリン ~!@#$%^&*()_+-=[]\{}|;':\",./<>?\n\r\t\x07\x00\x04 トーマス ~!@#$%^&*()_+-=[]\{}|;':\",./<>?\n\r\t\x07\x00\x04 アーノルド ~!@#$%^&*()_+-=[]\{}|;':\",./<>?\n\r\t\x07\x00\x04";
echo slug($str);

Outputs: コリン-and-トーマス-and-アーノルド

the '-and-' is because &'s get changed to '-and-'.