views:

62

answers:

1

Hi all,

This particular problem is easy to solve, but I'm not so sure that the solution I'd arrive at would be computationally efficient. So I'm asking the experts!

What would be the best way to go through a large file, collecting stats (for the entire file) on how often two words occur in the same line?

For instance, if the text contained only the following two lines:

"This is the white baseball." "These guys have white baseball bats."

You would end up collecting the following stats: (this, is: 1), (this, the: 1), (this, white: 1), (this, baseball: 1), (is, the: 1), (is, white: 1), (is, baseball: 1) ... and so forth.

For the entry (baseball, white: 2), the value would be 2, since this pair of words occurs in the same line a total of 2 times.

Ideally, the stats should be placed in a dictionary, where the keys are alphabetized at the tuple level (i.e., you wouldn't want separate entries for "this, is" and "is, this." We don't care about order here: we just want to find how often each possible pair of words occurs in the same line throughout the text.

+4  A: 
from collections import defaultdict
import itertools as it
import re

pairs = defaultdict(int)

for line in lines:
    for pair in it.combinations(re.findall('\w+', line), 2):
        pairs[tuple(pair)] += 1

resultList = [pair + (occurences, ) for pair, occurences in pairs.iterkeys()]
eumiro
That was amazing, and fast! Would it be hard to actually change the last bit I mentioned, and make it so that order *does* matter (i.e., (this, is) and (is, this) are actually distinct entries in the dict?
Georgina
@Georgina, if you want to keep the order, then replace the `combinations` with `permutations` and it should work. And how about case of letters in the words? 'This' == 'this'?
eumiro
Case is irrelevant, so I suppose I ought to zap everything to lowercase.
Georgina