views:

97

answers:

3

I have two long list, one from a log file that contains lines formatted like

201001050843 blah blah blah <[email protected]> blah blah

and a second file in csv format. I need to generate a list of all the entries in file2 that do not contain a email address in the log file, while maintaining the csv format.

Example
Log file contains:

201001050843 blah blah blah <[email protected]> blah blah
201001050843 blah blah blah <[email protected]> blah blah

File2 contains:

156456,bob,sagget,[email protected],4564456
156464,bob,otherguy,[email protected],45644562

the output should be:

156464,bob,otherguy,[email protected],45644562

Currently I grab the emails from the log and load them into another list with:

sent_emails =[]
for line in sent:
    try:
        temp1= line.index('<')
        temp2 = line.index('>')
        sent_emails.append(line[temp1+1:temp2])
    except ValueError:
        pass

And then compare to file2 with either:

lista = mail_lista.readlines()
for line in lista:
    temp = line.split()
    for thing in temp:
        try:
            if thing.index('@'):
                if thing in sent_emails:
                    lista.remove(temp)
        except ValueError:
            pass
newa.writelines(lista)

or:

for line in mail_listb:
    temp = line.split()
    for thing in temp:
        try:
            if thing.index('@'):
                if thing not in sent_emails:
                    newb.write(line)
        except ValueError:
            pass

However both return all of file2!

Thanks for any help you can give.

EDIT: Thanks for the recommendations for sets, it made a larger speed difference than I would have thought possible. Way to go hash tables! I will definitively be using sets more often from now on.

+1  A: 

line.split() splits at whitespace. Use line.split(',') instead.

Also: Does the order of the lines matter? If not, then you should really use a set() instead of a list. That will make the code much faster.

Aaron Digulla
*facepalm* Can't believe I missed that!
Chance
Now my code works, mere hours after I first said "I'll just write a quick script" Thanks for saving me from myself!
Chance
+1  A: 

You could create the set of emails as you do and then:

# emails is a set of emails
for line in fileinput.input("csvfile.csv",inplace =1):
    parts = line.split(',')
    if parts[3] not in emails:
        print line

This only works, if the email in the CSV file is always at position 4.

fileinput enables in place editing.

And use a set for the emails instead of a list as Aaron said, not only because of speed but also to eliminate duplicates.

Felix Kling
perfect, although my problem was actually a typo pointed out by Aaron Digulla, this answers the question I asked in a very clear way, and taught me something.
Chance
A: 

here's another way, with minimalistic check on email addr's position.

import fileinput
emails=[]
for line in open("file1"):
    start=line.find("<")
    end=line.find(">")
    if start != -1 and end !=-1:
        emails.append(line[start+1:end])

for line in fileinput.FileInput("file2",inplace=1):
    p = line.split(",")
    for item in p:
        if "@" in item and item not in emails:
            print line.strip()

output

$ ./python.py
156464,bob,otherguy,[email protected],45644562
ghostdog74