I have two long list, one from a log file that contains lines formatted like
201001050843 blah blah blah <[email protected]> blah blah
and a second file in csv format. I need to generate a list of all the entries in file2 that do not contain a email address in the log file, while maintaining the csv format.
Example
Log file contains:
201001050843 blah blah blah <[email protected]> blah blah
201001050843 blah blah blah <[email protected]> blah blah
File2 contains:
156456,bob,sagget,[email protected],4564456
156464,bob,otherguy,[email protected],45644562
the output should be:
156464,bob,otherguy,[email protected],45644562
Currently I grab the emails from the log and load them into another list with:
sent_emails =[]
for line in sent:
try:
temp1= line.index('<')
temp2 = line.index('>')
sent_emails.append(line[temp1+1:temp2])
except ValueError:
pass
And then compare to file2 with either:
lista = mail_lista.readlines()
for line in lista:
temp = line.split()
for thing in temp:
try:
if thing.index('@'):
if thing in sent_emails:
lista.remove(temp)
except ValueError:
pass
newa.writelines(lista)
or:
for line in mail_listb:
temp = line.split()
for thing in temp:
try:
if thing.index('@'):
if thing not in sent_emails:
newb.write(line)
except ValueError:
pass
However both return all of file2!
Thanks for any help you can give.
EDIT: Thanks for the recommendations for sets, it made a larger speed difference than I would have thought possible. Way to go hash tables! I will definitively be using sets more often from now on.