ansaurus

Question

Answer 1

A:

Sounds like homework. Since this is a CSV file (and thus changing the record size is next to impossible) you are best off loading the whole file into memory and manipulating it there before writing it out to a new file. Create a list of strings which is the original lines of the file. Then create a map, insert into the the phone number (the key) and the value (the id). Before the insert you look for the number if it already exists, you update the line containing the duplicate phone number. If it isn't already in the map, you insert the (phone, id) pair.

jmucchiello 2009-11-14 03:56:18

Yes it does sound like one, but it is not. Real work on real job in a non-programmers work environment. I just simplified the code to focus on a way to express my problem. And I was glad to find out using dicts is the way to go.

Eduardo 2009-11-16 16:26:47

Answer 2

A:

from operator import itemgetter
from itertools import groupby

import csv
verum = csv.reader(open('data.csv','rb'))

verum.sort(key=itemgetter(2,0))
def grouper( verum ):
    for key, grp in groupby(verum,itemgetter(2)):
        # key = phone number, grp = records with that number
        first = grp.next()
        # first item gets its id written into the 4th column
        yield [first[0],first[1],first[2],first[0],''] #or list(itemgetter(0,1,2,0,4)(first)) 
        for x in grp:
            # all others get the first items id as ref
            yield [x[0],x[1],x[2], first[0], "Yes"]

for line in sorted(grouper(verum), key=itemgetter(0)):
    print line

Outputs:

['1', 'JOHN', '12345', '1', '']
['2', 'PETER', '6232', '2', '']
['3', 'JON', '12345', '1', 'Yes']
['4', 'PETERSON', '6232', '2', 'Yes']
['5', 'ALEX', '7854', '5', '']
['6', 'JON', '12345', '1', 'Yes']

Writing the data back is left to the reader ;-)

THC4k 2009-11-14 03:57:06

-1 Reads file into memory. Code legibility is low.

John Machin 2009-11-15 01:24:51

I definitely need to learn more about itertools.grouby. But I still don't understand it well.

Eduardo 2009-11-16 16:28:55

Answer 3

A:

I know one thing. I know you don't have to read the entire file into memory to accomplish this.

import csv
myfile = "C:\Users\Eduardo\Documents\TEST2.csv"

dest = csv.writer(open("C:\Users\Eduardo\Documents\TESTFIXED.csv", "wb"), dialect="excel")

phonedict = {}

for row in cvs.reader(open(myfile, "r")):
    # setdefault sets the value to the second argument if it hasn't been set, and then
    # returns what the value in the dictionary is.
    firstid = phonedict.setdefault(row[2], row[0])
    row[3] = firstid
    if firstid is not row[0]:
       row[4] = "Yes"
    dest.writerow(row)

Omnifarious 2009-11-14 03:57:56

-1 ... reasons (1) using dict.setdefault (which needs explanation) (2) using subscripts instead of meaningful column names (3) using `is not` instead of '!=' (which requires a careful reader to analyse the code to ensure that it works) (4) not using `r` prefix on string constants that are filenames containing backslashes (5) setting row[3] unconditionally instead of only when there's a duplicate (as specified by the OP).

John Machin 2009-11-16 00:47:21

Thanks Omnifarious, for being the first to answer and pointing me to a solution using a dictionary. Also for even correcting your first attempt to help me.

Eduardo 2009-11-16 16:24:05

@John Machin, The OP is wrong on the last point. The output clearly shows that the id column is set unconditionally.

Omnifarious 2009-11-19 06:59:55

Answer 4

+3 A:

The only thing you have to keep in memory while this is running is a map of phone numbers to their IDs.

map = {}
with open(r'c:\temp\input.csv', 'r') as fin:
    reader = csv.reader(fin)
    with open(r'c:\temp\output.csv', 'w') as fout:
        writer = csv.writer(fout)
        # omit this if the file has no header row
        writer.writerow(next(reader))
        for row in reader:
            (id, name, phone, ref, discard) = row
            if map.has_key(phone):
                ref = map[phone]
                discard = "YES"
            else:
                map[phone] = id
            writer.writerow((id, name, phone, ref, discard))

Robert Rossney 2009-11-14 07:11:01

+1 A very clear and practical solution: doesn't use obfuscatory sophistry (dict.setdefault and itertools.groupby) and uses names instead of numbers for columns.

John Machin 2009-11-15 01:20:19

Yes, setdefault does make mine a little less clear. I was attempting to be as efficient as possible, but that's likely not necessary here.

Omnifarious 2009-11-15 18:34:02

Your 'with' statements will possibly cause the files to close while a reader or writer is still attached to them.

Omnifarious 2009-11-15 18:36:10

What do you mean?

Robert Rossney 2009-11-15 18:43:20

When the scope of the second 'with' is over, fout will be closed. But someone could still call writer.writerow().

Omnifarious 2009-11-15 21:57:48

@omnifarious: you should have kept quiet and not risked my reading your answer again :-)

John Machin 2009-11-16 00:21:20

That's hardly an issue unique to `with` blocks. If you didn't use a `with` block, and closed `fout` explicitly, it would be just as possible for someone to call `writer.writerow()` and discover that the underlying file object was closed.

Robert Rossney 2009-11-16 08:08:11

In my case I would change 'w' to 'wb' to avoid a blank line between each row.

Eduardo 2009-11-16 16:50:02

That's a platform-specific issue; on Windows either setting produces the same result.

Robert Rossney 2009-11-16 18:47:21

@Robert Rossney: for correct operation on any platform, the 'b' flag should be used for both csv readers and csv writers in Python 2.x -- and Windows is the OS where the same behaviour is NOT guaranteed if you leave out the 'b'.

John Machin 2009-11-17 13:50:42

I know I've needed to use the 'b' setting in the past on Windows. But I wonder why changing the setting doesn't appear to produce different results when I test this specific code.

Robert Rossney 2009-11-17 20:28:15

@Robert Rossney: "doesn't appear to" != "doesn't". Don't wonder, debug! What happens when you use 'w' for mode instead of 'wb' on Windows is that every `\n` in the file is preceded by a spurious `\r`. Use something like `repr(open('x.csv', 'rb').read())` to see exactly what you have in the file with each mode option. Note that csv is in effect a BINARY format which uses `\r\n` as a platform-independent row terminator, and quotes `\r` and `\n` when they appear as part of the data. You don't want Windows to delete `\r` on input and insert it on output. You should use 'b' unconditionally.

John Machin 2009-11-18 06:59:12

So it does, and you're quite right.

Robert Rossney 2009-11-18 17:47:48

@Robert Rossney, the answer is to not close the file, or make sure the csv reader and writer are closed and no longer using the file before closing it. If you are going to use `with` blocks, you should've had them clean up the csv reader and writer first, the files second.

Omnifarious 2009-11-19 07:03:50

Could you possibly point me to an example? The documentation for the csv module gives no indication that it's even *possible* to close a reader or writer.

Robert Rossney 2009-11-19 08:18:24

Answer 5

A:

I work with large 40k plus record csv files, the easiest way to get rid of dupes it with Access. 1. Create new database, 2, Tables tab Get external Data 3. Save Table. 4. Queries tab New find dupe wizard ( Match on phone field, show all fields and count) 5. Save Query ( export has .txt but name dupes.txt ) 6. Import Query result as new table, do not import field with dupe count.. 7. Query Find unmatched (match by phone field, show all fields in result. save query then Export has .txt but name unique.txt) 8. Import unique file in to existing table ( dupes ) 9.You can now save and export again into what ever files type you need and not have any dupes

Ray Larson 2009-11-24 20:40:43

ansaurus

tags:

views:

answers:

marking duplicates in a csv file

related questions