views:

2897

answers:

5

How can I split correctly a string containing a sentence with special chars using whitespaces as separator ? Using regex split method I cannot obtain the desired result.

Example code:

# -*- coding: utf-8 -*-
import re


s="La felicità è tutto" # "The happiness is everything" in italian
l=re.compile("(\W)").split(s)

print " s> "+s
print " wordlist> "+str(l)
for i in l:
    print " word> "+i

The output is :

 s> La felicità è tutto
 wordlist> ['La', ' ', 'felicit', '\xc3', '', '\xa0', '', ' ', '', '\xc3', '', '\xa8', '', ' ', 'tutto']
 word> La
 word>  
 word> felicit
 word> Ã
 word> 
 word> ?
 word> 
 word>  
 word> 
 word> Ã
 word> 
 word> ?
 word> 
 word>  
 word> tutto

while I'm looking for an output like:

 s> La felicità è tutto
 wordlist> ['La', ' ', 'felicità', ' ', 'è', ' ', 'tutto']
 word> La
 word>  
 word> felicità
 word>  
 word> è
 word>  
 word> tutto

To be noted that s is a string that is returned from another method so I cannot force the encoding like

s=u"La felicità è tutto"

On official python documentation of Unicode and reg-ex I haven't found a satisfactory explanation.

Thanks.

Alessandro

+3  A: 

Your regex should be (\s) instead of (\W) like this:

l = re.compile("(\s)").split(s)

The code above will give you the exact output you requested. However the following line makes more sense:

l = re.compile("\s").split(s)

which splits on whitespace characters and doesn't give you all the spaces as matches. You may need them though so I posted both answers.

Andrew Hare
Thanks, it works on print of single words.Why the print of list contains unicode hex code instead of decoded chars ?
alexroat
It's meant to be so the output is valid Python code that you could copy and paste back in... and since you might be working in a non-Unicode environment it outputs in the most portable way possible.
Porges
Thanks Andrew.you fully answered to all my doubts.
alexroat
@alexroat: please accept an answer if it is helpful.
S.Lott
Done,but I have a further question: why with \s ()[]- and so on are not taken as separator ?
alexroat
They are the characters used by the regex syntax. If you want to separate a string when a ] occur you should escape it with a ] (just like you do when you are pattern matching something with regex). Benvenuto su stackoverflow :)
Andrea Ambu
Ok, maybe I can understand, \W is everything not alphanumeric and \s is all whitespaces set. But it seems that "à" is not considered as alphanumeric ? My guess is to split using with every single non alfanumeric char ([^a-zA-Z0-9_]) plus accented chars.Any idea on how to do that ?
alexroat
A: 

Try defining an encoding for the regular expression:

l=re.compile("\W", re.UNICODE).split(s)
kgiannakakis
It doesn't work, I've already tried that ...However the solution of Andrew Hare works well.
alexroat
Have you tried without the parenthesis?
kgiannakakis
Yes, but the behaviour is like the string split (it removes whitespaces) and I want to maintain them.However re.UNICODE mess up with encoding changing some characters.
alexroat
+1  A: 

I think it's overkill to use a regexp in this case. If the only thing you want to do is split the string on whitespace characters I recommend using the split method on the string

s = 'La felicità è tutto'
words = s.split()
danvari
My intention is to maintain whitespaces in the list so string split is not helpful for that because it remove whitespaces and is not fully configurable as regex split.
alexroat
@alexroat: Why exactly do you need the spaces? You know that the occur between each word (list item), can't you have your algorithm add them back in where necessary?
Mark
A: 

Well, after some further tests on Andrew Hare answer I've seen that character as ()[]- and so on are no more considered as separator while I want to split a sentence (maintaining all the separator) in words composed with ensemble of alphanumerical values set eventually expanded with accented chars (that is, everything marked as alphanumeric in unicode). So, the solution of kgiannakakis is more correct but it miss a conversion of string s into unicode format.

Take this extension of the first example:

# -*- coding: utf-8 -*-
import re
s="(La felicità è tutto)"#no explicit unicode given string (UTF8)
l=re.compile("([\W])",re.UNICODE).split(unicode(s,'utf-8'))#split on s converted to unicode from utf8

print " string> "+s
print " wordlist> "+str(l)
for i in l:
    print " word> "+i

The output now is :

 string> (La felicità è tutto)
 wordlist> [u'', u'(', u'La', u' ', u'felicit\xe0', u' ', u'\xe8', u' ', u'tutto', u')', u'']
 word> 
 word> (
 word> La
 word>  
 word> felicità
 word>  
 word> è
 word>  
 word> tutto
 word> )
 word>

That is exactly what I'm looking for.

Cheers :)

Alessandro

alexroat
+1  A: 

using a unicode regular expression will work, provided you give it a unicode string to start with (which you haven't in the provided example). Try this:

s=u"La felicità è tutto" # "The happiness is everything" in italian
l=re.compile("(\W)",re.UNICODE).split(s)

print " s> "+s
print " wordlist> "+str(l)
for i in l:
    print " word> "+i

Results:

 s> La felicità è tutto
 wordlist> [u'La', u' ', u'felicit\xe0', u' ', u'\xe8', u' ', u'tutto']
 word> La
 word>  
 word> felicità
 word>  
 word> è
 word>  
 word> tutto

Your string s is created as a str type, and will probably be in utf-8 coding, which is different than unicode.

TheSoundOfMatt