views:

98

answers:

3

I have been given a file that I would like to extract the useful data from. The format of the file goes something like this:

LINE: 1
TOKENKIND: somedata
TOKENKIND: somedata
LINE: 2
TOKENKIND: somedata
LINE: 3

etc...

What I would like to do is remove LINE: and the line number as well as TOKENKIND: so I am just left with a string that consists of 'somedata somedate somedata...'

I'm using Python to do this, using regular expressions (that I'm not sure are correct) to match the bits of the file I'd like removing.

My question is, how can I get Python to match multiple regex groups and ignore them, adding anything that isn't matched by my regex to my output string? My current code looks like this:

import re
import sys

ignoredTokens = re.compile('''
    (?P<WHITESPACE>      \s+             ) |
    (?P<LINE>            LINE:\s[0-9]+   ) |
    (?P<TOKEN>           [A-Z]+:         )
''', re.VERBOSE)

tokenList = open(sys.argv[1], 'r').read()
cleanedList = ''

scanner = ignoredTokens.scanner(tokenList)

for line in tokenList:
    match = scanner.match()

    if match.lastgroup not in ('WHITESPACE', 'LINE', 'TOKEN'):
        cleanedList = cleanedList + match.group(match.lastindex) + ' '

print cleanedList
+1  A: 

How about replacing (^LINE: \d+$)|(^\w+:) with an empty string ""?

Use \n instead of ^ and $ to remove unwanted empty lines also.

Amarghosh
Sorry I don't think I was being precise enough. What I would like to know is that in my for loop, is that the correct way of ignoring anything matched by WHITESPACE, LINE and TOKEN?
greenie
Alex has posted the improvised and pythonified version of this.
Amarghosh
+2  A: 
import re

x = '''LINE: 1
TOKENKIND: somedata
TOKENKIND: somedata
LINE: 2
TOKENKIND: somedata
LINE: 3'''

junkre = re.compile(r'(\s*LINE:\s*\d*\s*)|(\s*TOKENKIND:)', re.DOTALL)

print junkre.sub('', x)
Alex Martelli
Perfect. Removing my for loop and using sub() worked fine. Thanks for your help.
greenie
+1  A: 

no need to use regex in Python. Its Python after all, not Perl. Think simple and use its string manipulation capabilities

f=open("file")
for line in f:
    if line.startswith("LINE:"): continue
    if "TOKENKIND" in line:
        print line.split(" ",1)[-1].strip()
f.close()