ansaurus

Question

Python: 2.6 and 3.1 string matching inconsistencies

Answer 1

+3 A:

You are printing i[1:-3] but comparing i[1:-2] in the loop.

Very Important Question

Why are you writing code to parse XML when lxml will do all that for you? The point of unit tests is to test your code, not to ensure that the libraries you are using work!

katrielalex 2010-10-05 21:50:10

The XML schema is absolutely terrible. In the above example, when I search all I know is 9626 (the rest of the title should be stored in several other fields...). I need to make sure that there aren't any applications that have crazy names that my xpath query won't find.

fandingo 2010-10-05 23:00:10

You are perfectly right about comparing the wrong slice. However, once I change it, python2.6 works, but python3 has the problem now (i.e. it doesn't match any objects).

fandingo 2010-10-05 23:01:33

Answer 2

A:

I don't understand what you're doing exactly, but would you try using strip() instead of slicing and see whether it helps?

for i in d:
    stripped = i.strip()                                                                                                      
    if stripped != d[i].get('id'):                                                                                                                                  
         print('X%sX Y%sY' % (stripped, d[i].get('id')))

Alan Franzoni 2010-10-05 22:05:31

Answer 3

A:

repr() and %r format are your friends ... they show you (for basic types like str/unicode/bytes) exactly what you've got, including type.

Instead of

print('X%sX Y%sY' % (i[1:-3], d[i].get('id')))

do

print('%r %r' % (i, d[i].get('id')))

Note leaving off the [1:-3] so that you can see what is in i before you slice it.

Update after comment "You are perfectly right about comparing the wrong slice. However, once I change it, python2.6 works, but python3 has the problem now (i.e. it doesn't match any objects)":

How are you opening the file (two answers please, for Python 2 and 3). Are you running on Windows? Have you tried getting the repr() as I suggested?

Update after actual input finally provided by OP:

If, as it appears, your input file was created on Windows (lines are separated by "\r\n"), you can read Windows and *x text files portably by using the "universal newlines" option ... open('datafile.txt', 'rU') on Python2 -- read this. Universal newlines mode is the default in Python3. Note that the Python3 docs say that you can use 'rU' also in Python3; this would save you having to test which Python version you are using.

John Machin 2010-10-05 23:13:37

I'm opening the file with 'for i in open('file.xml', 'r'):". Linux system. With the new slices (where python3 doesn't work), I get '9930-encounterlcd_Prod-O' '9930-encounterlcd_Prod-OT' calling repr() on both strings. Without taking any slices, I get '"9930-encounterlcd_Prod-OT"\n' '9930-encounterlcd_Prod-OT'. Thanks for your comments.

fandingo 2010-10-06 00:52:11

@fandingo: Pls edit your question with the repr() results. So far we have: with Python3, `i` contains <doublequote> <data> <doublequote> <newline>, and the 2nd string contains <data>, so slicing [1:-2] is required. Please tell us the repr() results from Python2 -- we need to investigate the 2-3 difference.

John Machin 2010-10-06 01:44:25

@John, I updated my post with the repr() results. They format better than in a comment.

fandingo 2010-10-07 16:08:54

@fandingo: Are you aware that you can change the accepted answer?

John Machin 2010-10-17 06:28:12

Answer 4

A:

Russell Borogrove is right.

Python 3 defaults to unicode, and the newline character is correctly interpreted as one character. That's why my offset of [1:-2] worked in 3 because I needed to eliminate three characters: ", ", and \n.

In Python 2, the newline is being interpreted as two characters, meaning I have to eliminate four characters and use [1:-3].

I just added a manual check for the Python major version.

Here is the fixed code:

    for i in d:
    # The keys in D contain quotes and a newline which need                                                                                                                                                                              
    # to be removed. In v3, newline = 1 char and in v2,                                                                                                                                                                                  
    # newline = 2 char.                                                                                                                                                                                                                  
    if sys.version_info[0] < 3:
        if i[1:-3] != d[i].get('id'):
            print('%s %s' % (i[1:-3], d[i].get('id')))
    else:
        if i[1:-2] != d[i].get('id'):
             print('%s %s' % (i[1:-2], d[i].get('id')))

Thanks for the responses everyone! I appreciate your help.

fandingo 2010-10-06 03:46:06

@fandingo: "In Python 2, the newline is being interpreted as two characters" -- this will be regarded with complete surprise by the average reader, especially the ones who have noted that you are running on Linux (i.e. a non-Windows operating system). I dare to suggest that there is a problem in your code (or in code that it calls) further upstream. By the way, what are the two characters that correspond to the newline in Python 2, and are they 'str' or 'unicode'?

John Machin 2010-10-06 08:30:16

Yeah, what John said. Also, strip() as suggested above is a much much better option than testing the version info.

Russell Borogove 2010-10-07 01:30:54

ansaurus

tags:

views:

answers:

Python: 2.6 and 3.1 string matching inconsistencies

Very Important Question

related questions