Hi all,
This particular problem is easy to solve, but I'm not so sure that the solution I'd arrive at would be computationally efficient. So I'm asking the experts!
What would be the best way to go through a large file, collecting stats (for the entire file) on how often two words occur in the same line?
For instance, if the text contained only the following two lines:
"This is the white baseball." "These guys have white baseball bats."
You would end up collecting the following stats: (this, is: 1), (this, the: 1), (this, white: 1), (this, baseball: 1), (is, the: 1), (is, white: 1), (is, baseball: 1) ... and so forth.
For the entry (baseball, white: 2), the value would be 2, since this pair of words occurs in the same line a total of 2 times.
Ideally, the stats should be placed in a dictionary, where the keys are alphabetized at the tuple level (i.e., you wouldn't want separate entries for "this, is" and "is, this." We don't care about order here: we just want to find how often each possible pair of words occurs in the same line throughout the text.