tags:

views:

175

answers:

2

I have some unicode text that I want to clean up using regular expressions. For example I have cases where u'(2'. This exists because for formatting reasons the closing paren ends up in an adjacent html cell. My initial solution to this problem was to look ahead at the contents of the next cell and using a string function determine if it held the closing paren. I knew this was not a great solution but it worked. Now I want to fix it but I can't seem to make the regular expression work.

missingParen=re.compile(r"^\(\d[^\)]$")

My understanding of what I think I am doing:
^ at the beginning of the string I want to find
( an open paren, the paren has to be backslashed because it is a special character
\d I also want to find a single digit
[ I am creating a special character class
^ I don't want to find what follows
) which is a close paren
$ at the end of the string

And of course the plot thickens I made a silly assumption that because I placed a \d I would not find (33 but I am wrong so I added a {1} to my regular expression and that did not help, it matched (3333, so my problem is more complicated than I thought. I want the string to be only an open paren and a single digit. Is this the more clever approach

missingParen=re.compile(r"^\(\d$")

And note S Lott _I already tagged it beginner so you can't pick up any cheap points Not that I don't appreciate your insights I keep meaning to read your book, it probably has the answer

+1  A: 

Okay sorry for using this a a stream of consciousness thinking stimulator but it appears that writing out my original question got me on the path. It seems to me that this is a solution for what I am trying to do:

  missingParen=re.compile(r"^\(\d$")
PyNEwbie
+1  A: 

Are the answers to my question Regex and Unicode relevant?

dbr