ansaurus

Question

python, regex split and special character

Answer 1

+3 A:

Your regex should be (\s) instead of (\W) like this:

l = re.compile("(\s)").split(s)

The code above will give you the exact output you requested. However the following line makes more sense:

l = re.compile("\s").split(s)

which splits on whitespace characters and doesn't give you all the spaces as matches. You may need them though so I posted both answers.

Andrew Hare 2009-03-15 11:32:00

Thanks, it works on print of single words.Why the print of list contains unicode hex code instead of decoded chars ?

alexroat 2009-03-15 11:35:32

It's meant to be so the output is valid Python code that you could copy and paste back in... and since you might be working in a non-Unicode environment it outputs in the most portable way possible.

Porges 2009-03-15 11:37:57

Thanks Andrew.you fully answered to all my doubts.

alexroat 2009-03-15 11:39:59

@alexroat: please accept an answer if it is helpful.

S.Lott 2009-03-15 12:01:46

Done,but I have a further question: why with \s ()[]- and so on are not taken as separator ?

alexroat 2009-03-15 13:29:53

They are the characters used by the regex syntax. If you want to separate a string when a ] occur you should escape it with a ] (just like you do when you are pattern matching something with regex). Benvenuto su stackoverflow :)

Andrea Ambu 2009-03-15 13:39:15

Ok, maybe I can understand, \W is everything not alphanumeric and \s is all whitespaces set. But it seems that "à" is not considered as alphanumeric ? My guess is to split using with every single non alfanumeric char ([^a-zA-Z0-9_]) plus accented chars.Any idea on how to do that ?

alexroat 2009-03-15 13:44:03

Answer 2

A:

Try defining an encoding for the regular expression:

l=re.compile("\W", re.UNICODE).split(s)

kgiannakakis 2009-03-15 11:36:49

It doesn't work, I've already tried that ...However the solution of Andrew Hare works well.

alexroat 2009-03-15 11:38:44

Have you tried without the parenthesis?

kgiannakakis 2009-03-15 11:42:48

Yes, but the behaviour is like the string split (it removes whitespaces) and I want to maintain them.However re.UNICODE mess up with encoding changing some characters.

alexroat 2009-03-15 13:32:11

Answer 3

+1 A:

I think it's overkill to use a regexp in this case. If the only thing you want to do is split the string on whitespace characters I recommend using the split method on the string

s = 'La felicità è tutto'
words = s.split()

danvari 2009-03-15 12:59:43

My intention is to maintain whitespaces in the list so string split is not helpful for that because it remove whitespaces and is not fully configurable as regex split.

alexroat 2009-03-15 13:25:50

@alexroat: Why exactly do you need the spaces? You know that the occur between each word (list item), can't you have your algorithm add them back in where necessary?

Mark 2010-07-13 05:24:34

Answer 4

A:

Well, after some further tests on Andrew Hare answer I've seen that character as ()[]- and so on are no more considered as separator while I want to split a sentence (maintaining all the separator) in words composed with ensemble of alphanumerical values set eventually expanded with accented chars (that is, everything marked as alphanumeric in unicode). So, the solution of kgiannakakis is more correct but it miss a conversion of string s into unicode format.

Take this extension of the first example:

# -*- coding: utf-8 -*-
import re
s="(La felicità è tutto)"#no explicit unicode given string (UTF8)
l=re.compile("([\W])",re.UNICODE).split(unicode(s,'utf-8'))#split on s converted to unicode from utf8

print " string> "+s
print " wordlist> "+str(l)
for i in l:
    print " word> "+i

The output now is :

 string> (La felicità è tutto)
 wordlist> [u'', u'(', u'La', u' ', u'felicit\xe0', u' ', u'\xe8', u' ', u'tutto', u')', u'']
 word> 
 word> (
 word> La
 word>  
 word> felicità
 word>  
 word> è
 word>  
 word> tutto
 word> )
 word>

That is exactly what I'm looking for.

Cheers :)

Alessandro

alexroat 2009-03-15 14:22:00

Answer 5

+1 A:

using a unicode regular expression will work, provided you give it a unicode string to start with (which you haven't in the provided example). Try this:

s=u"La felicità è tutto" # "The happiness is everything" in italian
l=re.compile("(\W)",re.UNICODE).split(s)

print " s> "+s
print " wordlist> "+str(l)
for i in l:
    print " word> "+i

Results:

 s> La felicità è tutto
 wordlist> [u'La', u' ', u'felicit\xe0', u' ', u'\xe8', u' ', u'tutto']
 word> La
 word>  
 word> felicità
 word>  
 word> è
 word>  
 word> tutto

Your string s is created as a str type, and will probably be in utf-8 coding, which is different than unicode.

TheSoundOfMatt 2010-07-13 05:17:58

ansaurus

tags:

views:

answers:

python, regex split and special character

related questions