views:

664

answers:

4

Hello,

I am using python 2.4 and I am having some problems with unicode regular expressions. I have tried to put together a very clear and concise example of my problem. It looks as though there is some problem with how Python is recognizing the different character encodings, or a problem with my understanding. Thank you very much for taking a look!

#!/usr/bin/python
#
# This is a simple python program designed to show my problems with regular expressions and character encoding in python
# Written by Brian J. Stinar
# Thanks for the help! 

import urllib # To get files off the Internet
import chardet # To identify charactor encodings
import re # Python Regular Expressions 
#import ponyguruma # Python Onyguruma Regular Expressions - this can be uncommented if you feel like messing with it, but I have the same issue no matter which RE's I'm using

rawdata = urllib.urlopen('http://www.cs.unm.edu/~brian.stinar/legal.html').read()
print (chardet.detect(rawdata))
#print (rawdata)

ISO_8859_2_encoded = rawdata.decode('ISO-8859-2') # Let's grab this as text
UTF_8_encoded = ISO_8859_2_encoded.encode('utf-8') # and encode the text as UTF-8
print(chardet.detect(UTF_8_encoded)) # Looks good

# This totally doesn't work, even though you can see UNSUBSCRIBE in the HTML
# Eventually, I want to recognize the entire physical address and UNSUBSCRIBE above it
re_UNSUB_amsterdam = re.compile(".*UNSUBSCRIBE.*", re.UNICODE)
print (str(re_UNSUB_amsterdam.match(UTF_8_encoded)) + "\t\t\t\t\t--- RE for UNSUBSCRIBE on UTF-8")
print (str(re_UNSUB_amsterdam.match(rawdata)) + "\t\t\t\t\t--- RE for UNSUBSCRIBE on raw data")

re_amsterdam = re.compile(".*Adobe.*", re.UNICODE)
print (str(re_amsterdam.match(rawdata)) + "\t--- RE for 'Adobe' on raw data") # However, this work?!?
print (str(re_amsterdam.match(UTF_8_encoded)) + "\t--- RE for 'Adobe' on UTF-8")

'''
# In additon, I tried this regular expression library much to the same unsatisfactory result
new_re = ponyguruma.Regexp(".*UNSUBSCRIBE.*")
if new_re.match(UTF_8_encoded) != None:
   print("Ponyguruma RE matched! \t\t\t--- RE for UNSUBSCRIBE on UTF-8")
else:
   print("Ponyguruma RE did not match\t\t--- RE for UNSUBSCRIBE on UTF-8")

if new_re.match(rawdata) != None:
   print("Ponyguruma RE matched! \t\t\t--- RE for UNSUBSCRIBE on raw data")
else:
   print("Ponyguruma RE did not match\t\t--- RE for UNSUBSCRIBE on raw data")

new_re = ponyguruma.Regexp(".*Adobe.*")
if new_re.match(UTF_8_encoded) != None:
   print("Ponyguruma RE matched! \t\t\t--- RE for Adobe on UTF-8")
else:
   print("Ponyguruma RE did not match\t\t\t--- RE for Adobe on UTF-8")

new_re = ponyguruma.Regexp(".*Adobe.*")
if new_re.match(rawdata) != None:
   print("Ponyguruma RE matched! \t\t\t--- RE for Adobe on raw data")
else:
   print("Ponyguruma RE did not match\t\t\t--- RE for Adobe on raw data")
'''

I am working on a substitution project, and am having a difficult time with the non-ASCII encoded files. This problem is part of a bigger project - eventually I would like to substitute the text with other text (I got this working in ASCII, but I can't identify occurrences in other encodings yet.) Thanks again.

http://brian-stinar.blogspot.com

-Brian J. Stinar-

A: 

this might help: http://www.daa.com.au/pipermail/pygtk/2009-July/017299.html

b3rx
+1  A: 

You probably want to either enable the DOTALL flag or you want to use the search method instead of the match method. ie:

# DOTALL makes . match newlines 
re_UNSUB_amsterdam = re.compile(".*UNSUBSCRIBE.*", re.UNICODE | re.DOTALL)

or:

# search will find matches even if they aren't at the start of the string
... re_UNSUB_amsterdam.search(foo) ...

These will give you different results, but both should give you matches. (See which one is the type you want.)

As an aside: You seem to be getting the encoded text (which is bytes) and decoded text (characters) confused. This isn't uncommon, especially in pre-3.x Python. In particular, this is very suspicious:

ISO_8859_2_encoded = rawdata.decode('ISO-8859-2')

You're de-coding with ISO-8859-2, not en-coding, so call this variable "decoded". (Why not "ISO_8859_2_decoded"? Because ISO_8859_2 is an encoding. A decoded string doesn't have an encoding anymore.)

The rest of your code is trying to do matches on rawdata and on UTF_8_encoded (both encoded strings) when it should probably be using the decoded unicode string instead.

Laurence Gonsalves
Thank you very much. After adding the re.DOTALL flag this behaved exactly as I was expecting. It seems like .* behaves differently on ASCII; in ASCII it was matching newlines for me, but with the decoded non-ASCII was not, but I may have just been unclear on this.Thanks for clarifying encoded text and decoded text as well. This is my first project dealing with different encodings and I appreciate the clarification.
Brian Stinar
A: 

With default flag settings, .* doesn't match newlines. UNSUBSCRIBE appears only once, after the first newline. Adobe occurs before the first newline. You could fix that by using re.DOTALL.

HOWEVER you haven't inspected what you got with the Adobe match: it's 1478 bytes wide! Turn on re.DOTALL and it (and the corresponding UNSUBSCRIBE pattern) will match the whole text!!

You definitely need to lose the trailing .* -- you're not interested and it slows down the match. Also you should lose the leading .* and use search() instead of match().

The re.UNICODE flag is of no use to you in this case -- read the manual and see what it does.

Why are you transcoding your data into UTF-8 and searching on that? Just leave in Unicode.

Someone else pointed out that in general you need to decode Ӓ etc thingies before doing any serious work on your data ... but didn't mention the « etc thingies with which your data is peppered :-)

John Machin
A: 

Your question is about regular expressions, but your problem can possibly be solved without them; instead use the standard string replace method.

import urllib
raw = urllib.urlopen('http://www.cs.unm.edu/~brian.stinar/legal.html').read()
decoded = raw.decode('iso-8859-2')
type(decoded)    # decoded is now <type 'unicode'>
substituted = decoded.replace(u'UNSUBSCRIBE', u'whatever you prefer')

If nothing else, the above shows how to handle the encoding: simply decode into a unicode string and work with that. But note that this only works well for the case where you have only one or a very small number of substitutions to make (and those substitutions are not pattern based) because replace() can only handle one substitution at a time.

For both string and pattern based substitutions you can do something like this to effect multiple replacements at once:

import re
REPLACEMENTS = ((u'[aA]dobe', u'!twiddle!'),
                (u'UNS.*IBE', u'@wobble@'),
                (u'Dublin', u'Sydney'))

def replacer(m):
    return REPLACEMENTS[list(m.groups()).index(m.group(0))][1]

r = re.compile('|'.join('(%s)' % t[0] for t in REPLACEMENTS))
substituted = r.sub(replacer, decoded)
mhawke