views:

100

answers:

5

Web frameworks such as Rails and Django has built-in support for "slugs" which are used to generate readable and SEO-friendly URLs:

A slug string typically contains only of the characters a-z, 0-9 and - and can hence be written without URL-escaping (think "foo%20bar").

I'm looking for a Perl slug function that given any valid Unicode string will return a slug representation (a-z, 0-9 and -).

A super trivial slug function would be something along the lines of:

$input = lc($input),
$input =~ s/[^a-z0-9-]//g;

However, this implementation would not handle internationalization and accents (I want ë to become e). One way around this would be to enumerate all special cases, but that would not be very elegant. I'm looking for something more well thought out and general.

My question:

  • What is the most general/practical way to generate Django/Rails type slugs in Perl? This is how I solved the same problem in Java.
A: 

What you have already excludes characters such as ë.

You might want to change your regular expression to s/[^\w\d-]//g for readability's sake. \w does include the _ character though, so if this is not your wish then I suggest s/[^a-z\d-]//g

EDIT: Unless you wanted to replace the characters? There's not much you can do but slug it out, so to speak. Sometimes there are no tricky ways to do something.

OmnipotentEntity
I think you misunderstood the question. I want to turn ë into e, not replace ë. "ë" is just an example, the solution should cover éèë, etc.
knorv
Then what you're looking for is one of the CPAN Modules other users have posted, Text:Unaccent looks to be your best bet.
OmnipotentEntity
+1  A: 

Adding Text::Unaccent to the beginning of the chain looks like it will do what you want.

David Dorward
+3  A: 

Are you looking for something like Text::Unidecode?

phaylon
+6  A: 

The slugify filter currently used in Django translates (roughly) to the following Perl code:

use Unicode::Normalize;

sub slugify($) {
    my ($input) = @_;

    $input = NFKD($input);         # Normalize the Unicode string
    $input =~ tr/\000-\177//cd;    # Strip non-ASCII characters (>127)
    $input =~ s/[^\w\s-]//g;       # Remove all characters that are not word characters (includes _), spaces, or hyphens
    $input =~ s/^\s+|\s+$//g;      # Trim whitespace from both ends
    $input = lc($input);
    $input =~ s/[-\s]+/-/g;        # Replace all occurrences of spaces and hyphens with a single hyphen

    return $input;
}

Since you also want to change accented characters to unaccented ones, throwing in a call to unidecode (defined in Text::Unidecode) before stripping the non-ASCII characters seems to be your best bet (as pointed out by phaylon).

Cameron
+1  A: 

String::Dirify is used for making slugs in the blogging software Movable Type/Melody.

daxim
Does that do unicode or just ISO-8859?
MkV
Codepoints beyond 255 are untouched.
daxim