views:

365

answers:

5

Hello, I'm stumped with a problem illustrated in the sample below:

"ID","NAME","PHONE","REF","DISCARD"
1,"JOHN",12345,,
2,"PETER",6232,,
3,"JON",12345,,
4,"PETERSON",6232,,
5,"ALEX",7854,,
6,"JON",12345,,

I want to detect duplicates in column "PHONE", and mark the subsequent duplicates using the column "REF", with a value pointing to the "ID" of the first item and the value "Yes" for the "DISCARD" column

"ID","NAME","PHONE","REF","DISCARD"
1,"JOHN",12345,1,
2,"PETER",6232,2,
3,"JON",12345,1,"Yes"
4,"PETERSON",6232,2,"Yes"
5,"ALEX",7854,,
6,"JON",12345,1,"Yes"

So, how do I go about it? I tried this code but my logic wasn't right, of course.

import csv
myfile = open("C:\Users\Eduardo\Documents\TEST2.csv", "rb")
myfile1 = open("C:\Users\Eduardo\Documents\TEST2.csv", "rb")

dest = csv.writer(open("C:\Users\Eduardo\Documents\TESTFIXED.csv", "wb"), dialect="excel")

reader = csv.reader(myfile)
verum = list(reader)
verum.sort(key=lambda x: x[2])
for i, row in enumerate(verum):
    if row[2] == verum[i][2]:
        verum[i][3] = row[0]

print verum

Your direction and help would be much appreciated.

A: 

Sounds like homework. Since this is a CSV file (and thus changing the record size is next to impossible) you are best off loading the whole file into memory and manipulating it there before writing it out to a new file. Create a list of strings which is the original lines of the file. Then create a map, insert into the the phone number (the key) and the value (the id). Before the insert you look for the number if it already exists, you update the line containing the duplicate phone number. If it isn't already in the map, you insert the (phone, id) pair.

jmucchiello
Yes it does sound like one, but it is not. Real work on real job in a non-programmers work environment. I just simplified the code to focus on a way to express my problem. And I was glad to find out using dicts is the way to go.
Eduardo
A: 
from operator import itemgetter
from itertools import groupby

import csv
verum = csv.reader(open('data.csv','rb'))

verum.sort(key=itemgetter(2,0))
def grouper( verum ):
    for key, grp in groupby(verum,itemgetter(2)):
        # key = phone number, grp = records with that number
        first = grp.next()
        # first item gets its id written into the 4th column
        yield [first[0],first[1],first[2],first[0],''] #or list(itemgetter(0,1,2,0,4)(first)) 
        for x in grp:
            # all others get the first items id as ref
            yield [x[0],x[1],x[2], first[0], "Yes"]

for line in sorted(grouper(verum), key=itemgetter(0)):
    print line

Outputs:

['1', 'JOHN', '12345', '1', '']
['2', 'PETER', '6232', '2', '']
['3', 'JON', '12345', '1', 'Yes']
['4', 'PETERSON', '6232', '2', 'Yes']
['5', 'ALEX', '7854', '5', '']
['6', 'JON', '12345', '1', 'Yes']

Writing the data back is left to the reader ;-)

THC4k
-1 Reads file into memory. Code legibility is low.
John Machin
I definitely need to learn more about itertools.grouby. But I still don't understand it well.
Eduardo
A: 

I know one thing. I know you don't have to read the entire file into memory to accomplish this.

import csv
myfile = "C:\Users\Eduardo\Documents\TEST2.csv"

dest = csv.writer(open("C:\Users\Eduardo\Documents\TESTFIXED.csv", "wb"), dialect="excel")

phonedict = {}

for row in cvs.reader(open(myfile, "r")):
    # setdefault sets the value to the second argument if it hasn't been set, and then
    # returns what the value in the dictionary is.
    firstid = phonedict.setdefault(row[2], row[0])
    row[3] = firstid
    if firstid is not row[0]:
       row[4] = "Yes"
    dest.writerow(row)
Omnifarious
-1 ... reasons (1) using dict.setdefault (which needs explanation) (2) using subscripts instead of meaningful column names (3) using `is not` instead of '!=' (which requires a careful reader to analyse the code to ensure that it works) (4) not using `r` prefix on string constants that are filenames containing backslashes (5) setting row[3] unconditionally instead of only when there's a duplicate (as specified by the OP).
John Machin
Thanks Omnifarious, for being the first to answer and pointing me to a solution using a dictionary. Also for even correcting your first attempt to help me.
Eduardo
@John Machin, The OP is wrong on the last point. The output clearly shows that the id column is set unconditionally.
Omnifarious
+3  A: 

The only thing you have to keep in memory while this is running is a map of phone numbers to their IDs.

map = {}
with open(r'c:\temp\input.csv', 'r') as fin:
    reader = csv.reader(fin)
    with open(r'c:\temp\output.csv', 'w') as fout:
        writer = csv.writer(fout)
        # omit this if the file has no header row
        writer.writerow(next(reader))
        for row in reader:
            (id, name, phone, ref, discard) = row
            if map.has_key(phone):
                ref = map[phone]
                discard = "YES"
            else:
                map[phone] = id
            writer.writerow((id, name, phone, ref, discard))
Robert Rossney
+1 A very clear and practical solution: doesn't use obfuscatory sophistry (dict.setdefault and itertools.groupby) and uses names instead of numbers for columns.
John Machin
Yes, setdefault does make mine a little less clear. I was attempting to be as efficient as possible, but that's likely not necessary here.
Omnifarious
Your 'with' statements will possibly cause the files to close while a reader or writer is still attached to them.
Omnifarious
What do you mean?
Robert Rossney
When the scope of the second 'with' is over, fout will be closed. But someone could still call writer.writerow().
Omnifarious
@omnifarious: you should have kept quiet and not risked my reading your answer again :-)
John Machin
That's hardly an issue unique to `with` blocks. If you didn't use a `with` block, and closed `fout` explicitly, it would be just as possible for someone to call `writer.writerow()` and discover that the underlying file object was closed.
Robert Rossney
In my case I would change 'w' to 'wb' to avoid a blank line between each row.
Eduardo
That's a platform-specific issue; on Windows either setting produces the same result.
Robert Rossney
@Robert Rossney: for correct operation on any platform, the 'b' flag should be used for both csv readers and csv writers in Python 2.x -- and Windows is the OS where the same behaviour is NOT guaranteed if you leave out the 'b'.
John Machin
I know I've needed to use the 'b' setting in the past on Windows. But I wonder why changing the setting doesn't appear to produce different results when I test this specific code.
Robert Rossney
@Robert Rossney: "doesn't appear to" != "doesn't". Don't wonder, debug! What happens when you use 'w' for mode instead of 'wb' on Windows is that every `\n` in the file is preceded by a spurious `\r`. Use something like `repr(open('x.csv', 'rb').read())` to see exactly what you have in the file with each mode option. Note that csv is in effect a BINARY format which uses `\r\n` as a platform-independent row terminator, and quotes `\r` and `\n` when they appear as part of the data. You don't want Windows to delete `\r` on input and insert it on output. You should use 'b' unconditionally.
John Machin
So it does, and you're quite right.
Robert Rossney
@Robert Rossney, the answer is to not close the file, or make sure the csv reader and writer are closed and no longer using the file before closing it. If you are going to use `with` blocks, you should've had them clean up the csv reader and writer first, the files second.
Omnifarious
Could you possibly point me to an example? The documentation for the csv module gives no indication that it's even *possible* to close a reader or writer.
Robert Rossney
A: 

I work with large 40k plus record csv files, the easiest way to get rid of dupes it with Access. 1. Create new database, 2, Tables tab Get external Data 3. Save Table. 4. Queries tab New find dupe wizard ( Match on phone field, show all fields and count) 5. Save Query ( export has .txt but name dupes.txt ) 6. Import Query result as new table, do not import field with dupe count.. 7. Query Find unmatched (match by phone field, show all fields in result. save query then Export has .txt but name unique.txt) 8. Import unique file in to existing table ( dupes ) 9.You can now save and export again into what ever files type you need and not have any dupes

Ray Larson