ansaurus

Question

how remove special characters from the end of every word in a string?

Answer 1

+5 A:

>>> import re
>>> test = "i am test-ing., i am test.ing-, i am_, test_ing,"
>>> re.sub(r'([^\w\s]|_)+(?=\s|$)', '', test)
'i am test-ing i am test.ing i am test_ing'

Matches one or more non-alphanumeric characters ([^\w\s]|_) followed by either a space (\s) or the end of the string ($). The (?= ) construct is a lookahead assertion: it makes sure that a matching space is not included in the match, so it doesn't get replaced; only the [\W_]+ gets replaced.

Okay, but why [^\w\s]|_, you ask? The first part matches anything that's not alphanumeric or an underscore ([^\w]) or whitespace ([^\s]), i.e. punctuation characters. Except we do want to eliminate underscores, so we then include those with |_.

John Kugelman 2010-08-25 00:27:51

John: thanks for the reply, i'd like to know what's the difference between $ and \Z?

killown 2010-08-25 00:41:23

-1 Normally (non-MULTILINE) there *is* a difference; `$` perlishly matches the end of the input string OR A NEWLINE AT THE END OF THE STRING. `\Z` matches only at the end of the string, which is usually the desired behaviour.

John Machin 2010-08-25 01:00:17

more precisely: "OR just before A NEWLINE AT ..."

John Machin 2010-08-25 01:06:58

This solution also deletes excess whitespace between words, which is presumably an unintentional (and possibly undesirable) side-effect.

jchl 2010-08-25 09:12:11

I think using `r'([^\w\s]|_)+(?=\s|$)'` instead will fix the whitespace deletion problem.

jchl 2010-08-25 09:17:03

@jchl Good catch. I noticed this but thought I could get away with the simpler regex, ha.

John Kugelman 2010-08-25 13:07:21

ansaurus

tags:

views:

answers:

how remove special characters from the end of every word in a string?

related questions