ansaurus

Question

Python - Most efficient way to find how often each possible pair of words occurs in the same line in a text file?

Answer 1

+4 A:

from collections import defaultdict
import itertools as it
import re

pairs = defaultdict(int)

for line in lines:
    for pair in it.combinations(re.findall('\w+', line), 2):
        pairs[tuple(pair)] += 1

resultList = [pair + (occurences, ) for pair, occurences in pairs.iterkeys()]

eumiro 2010-10-01 18:59:48

That was amazing, and fast! Would it be hard to actually change the last bit I mentioned, and make it so that order *does* matter (i.e., (this, is) and (is, this) are actually distinct entries in the dict?

Georgina 2010-10-01 19:10:44

@Georgina, if you want to keep the order, then replace the `combinations` with `permutations` and it should work. And how about case of letters in the words? 'This' == 'this'?

eumiro 2010-10-01 19:14:05

Case is irrelevant, so I suppose I ought to zap everything to lowercase.

Georgina 2010-10-02 02:46:14

ansaurus

tags:

views:

answers:

Python - Most efficient way to find how often each possible pair of words occurs in the same line in a text file?

related questions