tags:

views:

112

answers:

4

I wrote my module in Python 3.1.2, but now I have to validate it for 2.6.4.

I'm not going to post all my code since it may cause confusion.

Brief explanation: I'm writing a XML parser (my first interaction with XML) that creates objects from the XML file. There are a lot of objects, so I have a 'unit test' that manually scans the XML and tries to find a matching object. It will print out anything that doesn't have a match.

I open the XML file and use a simple 'for' loop to read line-by-line through the file. If I match a regular expression for an 'application' (XML has different 'application' nodes), then I add it to my dictionary, d, as the key. I perform a lxml.etree.xpath() query on the title and store it as the value. After I go through the whole thing, I iterate through my dictionary, d, and try to match the key to my value (I have to use the get() method from my 'application' class). Any time a mismatch is found, I print the key and title. Python 3.1.2 has all matching items in the dictionary, so nothing is printed. In 2.6.4, every single value is printed (~600) in all. I can't figure out why my string comparisons aren't working.

Without further ado, here's the relevant code:

    for i in d:                                                                                                        
     if i[1:-2] != d[i].get('id'):                                                                                                                                  
         print('X%sX Y%sY' % (i[1:-3], d[i].get('id')))                                                            

I slice the strings because the strings are different. Where the key would be "9626-2008olympics_Prod-SH"\n the value would be 9626-2008olympics_Prod-SH, so I have to cut the quotes and newline. I also added the Xs and Ys to the print statements to make sure that there wasn't any kind of whitespace issues. Here is an example line of output:

X9626-2008olympics_Prod-SHX Y9626-2008olympics_Prod-SHY

Remember to ignore the Xs and Ys. Those strings are identical. I don't understand why Python2 can't match them.


Edit: So the problem seems to be the way that I am slicing. In Python3,

if i[1:-2] != d[i].get('id'):

this comparison works fine.

In Python2,

if i[1:-3] != d[i].get('id'):

I have to change the offset by one.

Why would strings need different offsets? The only possible thing that I can think of is that Python2 treats a newline as two characters (i.e. '\' + 'n').

Edit 2: Updated with requested repr() information.

I added a small amount of code to produce the repr() info from the "2008olympics" exmpale above. I have not done any slicing. It actually looks like it might not be a unicode issue. There is now a "\r" character. Python2:

'"9626-2008olympics_Prod-SH"\r\n' '9626-2008olympics_Prod-SH'

Python3:

'"9626-2008olympics_Prod-SH"\n' '9626-2008olympics_Prod-SH'

Looks like this file was created/modified on Windows. Is there a way in Python2 to automatically suppress '\r'?

+3  A: 

You are printing i[1:-3] but comparing i[1:-2] in the loop.


Very Important Question

Why are you writing code to parse XML when lxml will do all that for you? The point of unit tests is to test your code, not to ensure that the libraries you are using work!

katrielalex
The XML schema is absolutely terrible. In the above example, when I search all I know is 9626 (the rest of the title should be stored in several other fields...). I need to make sure that there aren't any applications that have crazy names that my xpath query won't find.
fandingo
You are perfectly right about comparing the wrong slice. However, once I change it, python2.6 works, but python3 has the problem now (i.e. it doesn't match any objects).
fandingo
A: 

I don't understand what you're doing exactly, but would you try using strip() instead of slicing and see whether it helps?

for i in d:
    stripped = i.strip()                                                                                                      
    if stripped != d[i].get('id'):                                                                                                                                  
         print('X%sX Y%sY' % (stripped, d[i].get('id')))    
Alan Franzoni
A: 

repr() and %r format are your friends ... they show you (for basic types like str/unicode/bytes) exactly what you've got, including type.

Instead of

print('X%sX Y%sY' % (i[1:-3], d[i].get('id')))  

do

print('%r %r' % (i, d[i].get('id'))) 

Note leaving off the [1:-3] so that you can see what is in i before you slice it.

Update after comment "You are perfectly right about comparing the wrong slice. However, once I change it, python2.6 works, but python3 has the problem now (i.e. it doesn't match any objects)":

How are you opening the file (two answers please, for Python 2 and 3). Are you running on Windows? Have you tried getting the repr() as I suggested?

Update after actual input finally provided by OP:

If, as it appears, your input file was created on Windows (lines are separated by "\r\n"), you can read Windows and *x text files portably by using the "universal newlines" option ... open('datafile.txt', 'rU') on Python2 -- read this. Universal newlines mode is the default in Python3. Note that the Python3 docs say that you can use 'rU' also in Python3; this would save you having to test which Python version you are using.

John Machin
I'm opening the file with 'for i in open('file.xml', 'r'):". Linux system. With the new slices (where python3 doesn't work), I get '9930-encounterlcd_Prod-O' '9930-encounterlcd_Prod-OT' calling repr() on both strings. Without taking any slices, I get '"9930-encounterlcd_Prod-OT"\n' '9930-encounterlcd_Prod-OT'. Thanks for your comments.
fandingo
@fandingo: Pls edit your question with the repr() results. So far we have: with Python3, `i` contains <doublequote> <data> <doublequote> <newline>, and the 2nd string contains <data>, so slicing [1:-2] is required. Please tell us the repr() results from Python2 -- we need to investigate the 2-3 difference.
John Machin
@John, I updated my post with the repr() results. They format better than in a comment.
fandingo
@fandingo: Are you aware that you can change the accepted answer?
John Machin
A: 

Russell Borogrove is right.

Python 3 defaults to unicode, and the newline character is correctly interpreted as one character. That's why my offset of [1:-2] worked in 3 because I needed to eliminate three characters: ", ", and \n.

In Python 2, the newline is being interpreted as two characters, meaning I have to eliminate four characters and use [1:-3].

I just added a manual check for the Python major version.

Here is the fixed code:

    for i in d:
    # The keys in D contain quotes and a newline which need                                                                                                                                                                              
    # to be removed. In v3, newline = 1 char and in v2,                                                                                                                                                                                  
    # newline = 2 char.                                                                                                                                                                                                                  
    if sys.version_info[0] < 3:
        if i[1:-3] != d[i].get('id'):
            print('%s %s' % (i[1:-3], d[i].get('id')))
    else:
        if i[1:-2] != d[i].get('id'):
             print('%s %s' % (i[1:-2], d[i].get('id')))

Thanks for the responses everyone! I appreciate your help.

fandingo
@fandingo: "In Python 2, the newline is being interpreted as two characters" -- this will be regarded with complete surprise by the average reader, especially the ones who have noted that you are running on Linux (i.e. a non-Windows operating system). I dare to suggest that there is a problem in your code (or in code that it calls) further upstream. By the way, what are the two characters that correspond to the newline in Python 2, and are they 'str' or 'unicode'?
John Machin
Yeah, what John said. Also, strip() as suggested above is a much much better option than testing the version info.
Russell Borogove