ansaurus

Question

Python Unicode Regular Expression Question

Answer 1

A:

this might help: http://www.daa.com.au/pipermail/pygtk/2009-July/017299.html

b3rx 2009-07-23 00:02:25

Answer 2

+1 A:

You probably want to either enable the DOTALL flag or you want to use the search method instead of the match method. ie:

# DOTALL makes . match newlines 
re_UNSUB_amsterdam = re.compile(".*UNSUBSCRIBE.*", re.UNICODE | re.DOTALL)

or:

# search will find matches even if they aren't at the start of the string
... re_UNSUB_amsterdam.search(foo) ...

These will give you different results, but both should give you matches. (See which one is the type you want.)

As an aside: You seem to be getting the encoded text (which is bytes) and decoded text (characters) confused. This isn't uncommon, especially in pre-3.x Python. In particular, this is very suspicious:

ISO_8859_2_encoded = rawdata.decode('ISO-8859-2')

You're de-coding with ISO-8859-2, not en-coding, so call this variable "decoded". (Why not "ISO_8859_2_decoded"? Because ISO_8859_2 is an encoding. A decoded string doesn't have an encoding anymore.)

The rest of your code is trying to do matches on rawdata and on UTF_8_encoded (both encoded strings) when it should probably be using the decoded unicode string instead.

Laurence Gonsalves 2009-07-23 00:14:40

Thank you very much. After adding the re.DOTALL flag this behaved exactly as I was expecting. It seems like .* behaves differently on ASCII; in ASCII it was matching newlines for me, but with the decoded non-ASCII was not, but I may have just been unclear on this.Thanks for clarifying encoded text and decoded text as well. This is my first project dealing with different encodings and I appreciate the clarification.

Brian Stinar 2009-07-24 14:37:48

Answer 3

A:

With default flag settings, .* doesn't match newlines. UNSUBSCRIBE appears only once, after the first newline. Adobe occurs before the first newline. You could fix that by using re.DOTALL.

HOWEVER you haven't inspected what you got with the Adobe match: it's 1478 bytes wide! Turn on re.DOTALL and it (and the corresponding UNSUBSCRIBE pattern) will match the whole text!!

You definitely need to lose the trailing .* -- you're not interested and it slows down the match. Also you should lose the leading .* and use search() instead of match().

The re.UNICODE flag is of no use to you in this case -- read the manual and see what it does.

Why are you transcoding your data into UTF-8 and searching on that? Just leave in Unicode.

Someone else pointed out that in general you need to decode Ӓ etc thingies before doing any serious work on your data ... but didn't mention the « etc thingies with which your data is peppered :-)

John Machin 2009-07-23 02:11:17

Answer 4

A:

Your question is about regular expressions, but your problem can possibly be solved without them; instead use the standard string replace method.

import urllib
raw = urllib.urlopen('http://www.cs.unm.edu/~brian.stinar/legal.html').read()
decoded = raw.decode('iso-8859-2')
type(decoded)    # decoded is now <type 'unicode'>
substituted = decoded.replace(u'UNSUBSCRIBE', u'whatever you prefer')

If nothing else, the above shows how to handle the encoding: simply decode into a unicode string and work with that. But note that this only works well for the case where you have only one or a very small number of substitutions to make (and those substitutions are not pattern based) because replace() can only handle one substitution at a time.

For both string and pattern based substitutions you can do something like this to effect multiple replacements at once:

import re
REPLACEMENTS = ((u'[aA]dobe', u'!twiddle!'),
                (u'UNS.*IBE', u'@wobble@'),
                (u'Dublin', u'Sydney'))

def replacer(m):
    return REPLACEMENTS[list(m.groups()).index(m.group(0))][1]

r = re.compile('|'.join('(%s)' % t[0] for t in REPLACEMENTS))
substituted = r.sub(replacer, decoded)

mhawke 2009-07-23 04:01:02

ansaurus

tags:

views:

answers:

Python Unicode Regular Expression Question

related questions