I have a txt file of repeating lines like this:
Host: http://de.wikipedia.org Referer: http://www.wikipedia.org Host: answers.yahoo.com/ Referer: http://www.yahoo.com Host: http://de.wikipedia.org Referer: http://www.wikipedia.org Host: http://maps.yahoo.com/ Referer: http://www.yahoo.com Host: http://pt.wikipedia.org Referer: http://www.wikipedia.org Host: answers.yahoo.com/ Referer: http://www.yahoo.com Host: mail.yahoo.com Referer: http://www.yahoo.com Host: http://fr.wikipedia.org Referer: http://www.wikipedia.org Host: mail.yahoo.com Referer: http://www.yahoo.com
I am trying with this piece of code to go through the lines and see how many hosts have been accessed through the same referrer:
dd = {}
for line in open('hosts.txt'):
if line.startswith('Host'):
host = line.split(':')[1].strip('\n')
elif line.startswith('Referer'):
referer = line.split(': ')[1].strip('\n')
dd.setdefault(referer, [0 , host])
dd[referer][0] += 1
print dd
e.g.from wikipedia.org, how many links or domains have been accessed.
I want only the first occurrence of any referrer, and for the hosts belonging to that referrer I want the sum of all of them, ignoring the host that has been already counted for the same referrer, so basically whenever the referrer and the host are the same and they have been already counted, I want them to be ignored, to have 'referrer' as key and sum of unique hosts as values, as in below:
{'http://www.wikipedia.org': 3 , 'www.yahoo.com' : 2}
The problem with my code is that it sums all the repeating hosts for the same referrer because I can't figure out how to relate the Host and Referer lines. So any hint or help is highly appreciated.