views:

63

answers:

2

heya,

I have a Excel .CSV file I'm attempting to read in with DictReader.

All seems to be well, except it seems to omit rows, specifically those with missing columns.

Our input looks like:

mail,givenName,sn,lorem,ipsum,dolor,telephoneNumber
[email protected],ian,bay,3424,8403,2535,+65(2)34523534545
[email protected],mike,gibson,3424,8403,2535,+65(2)34523534545
[email protected],ross,martin,,,,+65(2)34523534545
[email protected],david,connor,,,,+65(2)34523534545
[email protected],chris,call,3424,8403,2535,+65(2)34523534545

So some of the rows have missing lorem/ipsum/dolor columns, and it's just a string of commas for those.

We're reading it in with:

def read_gd_dump(input_file="blah 20100423.csv"):
    gd_extract = csv.DictReader(open('blah 20100423.csv'), restval='missing', dialect='excel')
    return dict([(row['something'], row) for row in gd_extract])

And I checked that "something" (the key for our dict) isn't one of the missing columns, I had originally suspected it might be that. It's one of the columns after that.

However, DictReader seems to completely skip over the rows. I tried setting restval to something, didn't seem to make any difference. I can't seem to find anything in Python's CSV docs (http://docs.python.org/library/csv.html) that would explain this behaviour, but I may have misread something.

Any ideas?

Thanks, Victor

EDIT:

It turns out I was pretty stupid - I was indexing the dict on a column ("something") that was empty for some rows in the input CSV file, a fact I didn't even notice in the mass of data (basically there were two ID columns, and I was using the wrong one).

Hence, Alex was right, there were duplicates in "something", and hence each subsequent entry with an empty "something" was overwriting the previous one.

I've awarded the answer to Alex Martelli.

+1  A: 

Can't reproduce your problem -- when I save that data and then assign list(gd_extract), I see:

[{'telephoneNumber': '+65(2)34523534545', 'ipsum': '8403', 'sn': 'bay', 'dolor': '2535', 'mail': '[email protected]', 'givenName': 'ian', 'lorem': '3424'}, {'telephoneNumber': '+65(2)34523534545', 'ipsum': '8403', 'sn': 'gibson', 'dolor': '2535', 'mail': '[email protected]', 'givenName': 'mike', 'lorem': '3424'}, {'telephoneNumber': '+65(2)34523534545', 'ipsum': '', 'sn': 'martin', 'dolor': '', 'mail': '[email protected]', 'givenName': 'ross', 'lorem': ''}, {'telephoneNumber': '+65(2)34523534545', 'ipsum': '', 'sn': 'connor', 'dolor': '', 'mail': '[email protected]', 'givenName': 'david', 'lorem': ''}, {'telephoneNumber': '+65(2)34523534545', 'ipsum': '8403', 'sn': 'call', 'dolor': '2535', 'mail': '[email protected]', 'givenName': 'chris', 'lorem': '3424'}]

five dicts, including those with missing ipsum etc. I fear that in your laudable attempt at simplifying the problem you've simplified it excessively, so that your bug has gone away.

If you have duplicates in column something (can't check, since you don't have that column in your sample data) that would of course explain the "apparently missing" rows -- they're not missing from the csv reader's returned stream, they get "overwritten" in the dict you're returning. Could that be the issue?

Alex Martelli
Woah, *the* Alex Materlli *grins*. Hi =). Are you ever updating Python in a Nutshell, btw?Yeah, I simplified it both to make it easier to read, and also for privacy/compliance reasons (the place I work for is rather...stringent).Turns out you were right, see Edit above.
victorhooi
@victor, I was scheduled to update the Nutshell but I found out I can't do a really great job unless I'm really using the Python release I'm supposed to update the Nutshell to -- I mean in real life as opposed to playing around with it and answering SO questions;-). We're soon upgrading at work so that should eventually change!
Alex Martelli
Alex, aha, looking forward to it =). You should have both a vote and a tick (that's what I did on my end). Check again?Also, I posted a followup question http://stackoverflow.com/questions/2901872/python-checking-for-membership-inside-nested-dict I think it has a fairly easy solution, I'm probably just a bit sleep-deprived/retarded right now. Thanks.
victorhooi
@victor, checked at the time, I didn't, when I found that changed I fixed my comment. I'd love to help on your other Q but I'm getting pretty sleep-deprived too and I find it just too hard to grasp -- if you edited that Q to add a toy example of what the employees dict may look like and what output you want in a couple examples of your target function that would probably help a lot (but I'm going to sleep soon anyway now;-).
Alex Martelli
@Alex Martelli: Yeah, I'm sure it's some crazy hour wherever you are =). Thanks for your help. And I followed your advice, and added some dict samples to my other question. Hopefully it'll be a little clearer for whoevers reading it.
victorhooi
A: 

This may be nothing to do with your problem, and Alex's analysis is quite plausible given the lack of information, but you should ALWAYS open a csv file with "rb" or "wb" mode (assuming Python 2.X). If you don't, you run the risk of various mysterious happenings. A csv file is not a text file, it's a BINARY file.

In any case, please edit your question to show:
(1) (a) a sample file (b) a script (c) output -- which together demonstrate the alleged problem
(2) what version of Python you are running
(3) what OS

Update: For Python 3.X, do as the blessed manual says: """If csvfile is a file object, it should be opened with newline=''. Although this advice is included only with csv.reader, it applies equally to csv.writer, csv.DictReader, and csv.DictWriter.

John Machin
I'm using Python 3.x - sorry, I should have stated that. Does that change your advice regarding rb/wb?
victorhooi