ansaurus

Question

How do I use regex to do this in Python?

Answer 1

+5 A:

import re
text = "this isn't alphanumeric"
result = re.sub(r'\W','-',text) # result will be "this-isn-t-alphanumeric"

The \W class is the inverse of the \w class, which consists of alphanumeric characters and underscores ([a-zA-Z0-9_]). Thus, replacing any character that doesn't match \W with a dash will leave you with a string that consists of only alphanumerics, underscores, and dashes, suitable for a URL.

Amber 2010-04-06 23:40:26

He may want underscores replaced by dashes as well.

jemfinch 2010-04-06 23:47:09

It's possible. If that's the case, `r'[\W-]'` as the pattern will accomplish that easily.

Amber 2010-04-06 23:49:29

Answer 2

A:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

SamB 2010-04-06 23:41:00

-1: Just because regex isn't the solution for everything doesn't mean it's not the solution for some things.

Amber 2010-04-06 23:44:08

Regular expressions are very much the correct solution to this problem, pithy quotations notwithstanding.

jemfinch 2010-04-06 23:46:32

Some people, when confronted with a problem, think "I know, I'll quote Jamie Zawinski." Now they still have their original problem.

Paul McGuire 2010-04-07 01:26:25

Answer 3

+1 A:

This response doesn't use regular expressions, but should also work, with greater control over the types of symbols to filter. It uses the unicodedata module to remove all symbols by checking the categories of the characters.

import unicodedata

# See http://www.dpawson.co.uk/xsl/rev2/UnicodeCategories.html for character categories
replace = ('Sc', 'Sk', 'Sm', 'So', 'Zs')
def symbolsReplaceDashes(text):
    L = []
    for char in text:
        if unicodedata.category(unicode(char)) in replace:
            L.append('-')
        else: L.append(char)
    return ''.join(L)

You may need to use something like urllib.quote(output.encode('utf-8')) to encode characters if ranges are beyond basic ASCII alphanumeric characters.

David Morrissey 2010-04-06 23:50:35

Answer 4

+1 A:

Instead of regex, if you want to escape a string to be used for an url, use urllib.quote() or urllib.quote_plus(). For more complex queries, you might want to build the url using urllib.urlencode(). You can reverse the quotation with urllib.unquote() and urllib.unquote_plus().

Lie Ryan 2010-04-06 23:54:02

Since the OP is asking for something which is a lossy transformation, my guess is less that they want to escape the string, and more that they want to generate "nice" URLs from things like post titles, et cetera.

Amber 2010-04-07 04:05:17

ansaurus

tags:

views:

answers:

How do I use regex to do this in Python?

related questions