views:

997

answers:

3

OK guys/gals stuck again on something simple
I have a text file which has multiple lines per entry, the data is in the following format

firstword word word word
wordx word word word interesting1 word word word word
wordy word word word
wordz word word word interesting2 word word word lastword

this sequence repeats a hundred or so times, all other words are the same apart from interesting1 and interesting2, no blank lines. The interesting2 is pertinent to interesting1 but not to anything else and I want to link the two interesting items together, discarding the rest such as

interesting1 = interesting2
interesting1 = interesting2
interesting1 = interesting2
etc, 1 lne per sequence

Each line begins with a different word
my attempt was to read the file and do an "if wordx in line" statement to identify the first interesting line, slice out the value, find the second line, ("if wordz in line) slice out the value and concatenate the second with the first.
It's clumsy though, I had to use global variables, temp variables etc, and I'm sure there must be a way of identifying the range between firstword and lastword and placing that into a single list, then slicing both values out together.

Any suggestions gratefully acknowledged, thanks for your time

A: 

In that case, make a regexp that matches the repeating text, and has groups for the interesting bits. Then you should be able to use findall to find all cases of interesting1 and interesting2.

Like so: import re

text = open("foo.txt").read()
RE = re.compile('firstword.*?wordx word word word (.*?) word.*?wordz word word word (.*?) word', re.DOTALL)
print RE.findall(text)

Although as mentioned in the comments, the islice is definitely a neater solution.

Lennart Regebro
Presuming you mean a four-line re.VERBOSE-style regexp, with the second line something like \s* wordx \S+ \s+ \S+ \s+ \S+ \s+ (\S+) \s+ \S+ \s+ \S+ \s+ \S+ \s+ \S+ \s* \n ... the OP might need a bit of help with that. Spelling it out with a bit of explanation and how to tweak it should get you at least one up-vote ;-)
John Machin
Eh... no, you'd only need a regexp that actually matches the text in question, but do not match parts of it or several repetitions. I don't see a need for it to be four lines on contain long lines on \s+...In any case, the islice is a better solution. Still, I updated it with a complete solution.
Lennart Regebro
+6  A: 
from itertools import izip, tee, islice

i1, i2 = tee(open("foo.txt"))

for line2, line4 in izip(islice(i1,1, None, 4), islice(i2, 3, None, 4)) :
    print line2.split(" ")[4], "=", line4.split(" ")[4]
truppo
A: 

I've thrown in a bagful of assertions to check the regularity of your data layout.

C:\SO>type words.py

# sample pseudo-file contents
guff = """\
firstword word word word
wordx word word word interesting1-1 word word word word
wordy word word word
wordz word word word interesting2-1 word word word lastword

miscellaneous rubbish

firstword word word word
wordx word word word interesting1-2 word word word word
wordy word word word
wordz word word word interesting2-2 word word word lastword
firstword word word word
wordx word word word interesting1-3 word word word word
wordy word word word
wordz word word word interesting2-3 word word word lastword

"""

# change the RHS of each of these to reflect reality
FIRSTWORD = 'firstword'
WORDX = 'wordx'
WORDY = 'wordy'
WORDZ = 'wordz'
LASTWORD = 'lastword'

from StringIO import StringIO
f = StringIO(guff)

while True:
    a = f.readline()
    if not a: break # end of file
    a = a.split()
    if not a: continue # empty line
    if a[0] != FIRSTWORD: continue # skip extraneous matter
    assert len(a) == 4
    b = f.readline().split(); assert len(b) == 9
    c = f.readline().split(); assert len(c) == 4
    d = f.readline().split(); assert len(d) == 9
    assert a[0] == FIRSTWORD
    assert b[0] == WORDX
    assert c[0] == WORDY
    assert d[0] == WORDZ
    assert d[-1] == LASTWORD
    print b[4], d[4]

C:\SO>\python26\python words.py
interesting1-1 interesting2-1
interesting1-2 interesting2-2
interesting1-3 interesting2-3

C:\SO>
John Machin