ansaurus

Question

Python raw strings and unicode : how to use Web input as regexp patterns ?

Answer 1

+3 A:

Apart from possibly having to encode Unicode properly (in Python 2.*), no processing is needed because there is no specific type for "raw strings" -- it's just a syntax for literals, i.e. for string constants, and you don't have any string constants in your code snippet, so there's nothing to "process".

Alex Martelli 2010-01-17 16:44:33

Yeah, I should have asked the other question first. Now I got it, I understand this one makes no sense.

e-satis 2010-01-17 16:58:40

Answer 2

A:

"r" flags just prevent Python from interpreting "\" in a string. Since the Web doesn't care about what kind of data it carries, your web input will be a bunch of bytes you are free to interpret the way you want.

So to address this problem :

be sure you use unicode (e.g. utf-8) all long the way
when you get the string, it will be unicode and "\n", "\t" and "\a" will be litterals, so you don't need to care about if you need to escape them of not.

e-satis 2010-01-17 17:06:35

»from interpreting "/" in a string« – it's "\".

poke 2010-01-17 17:11:32

God, how do you manage to read this quickly ? And find the mistakes on the fly ?

e-satis 2010-01-17 17:13:06

Answer 3

+1 A:

Note the following in your first example:

>>> p1 = "pattern"
>>> p2 = u"pattern"
>>> p3 = r"pattern"
>>> p4 = ur"pattern" # it's ur"", not ru"" btw
>>> p1 == p2 == p3 == p4
True

While these constructs look different, they all do the same thing, they create a string object (p1 and p3 a str and p2 and p4 a unicode object in Python 2.x), containing the value "pattern". The u, r and ur just tell the parser, how to interpret the following quoted string, namely as a unicode text (u) and/or a raw text (r) where backslashes to encode other characters are ignored. However in the end it doesn't matter how a string was created, being it a raw string or not, internally it is stored the same.

When you get unicode text as input, you have to differ (in Python 2.x) if it is a unicode text or a str object. If you want to work with the unicode content, you should internally work only with those, and convert all str objects to unicode objects (either with str.decode() or with the u'text' syntax for hard-coded texts). If you however encode it to your local encoding, you will get problems with unicode symbols.

A different approach would be using Python 3, which str object supports unicode directly and stores everything as unicode and where you simply don't need to care about the encoding.

poke 2010-01-17 17:10:05

ansaurus

tags:

views:

answers:

Python raw strings and unicode : how to use Web input as regexp patterns ?

related questions