views:

210

answers:

4

How would I parse the following input (either going line by line or via regex... or combination of both):

Table[
    Row[
        C_ID[Data:12345.0][Sec:12345.0][Type:Double]
     F_ID[Data:17660][Sec:17660][Type:Long]
     NAME[Data:Mike Jones][Sec:Mike Jones][Type:String]
    ]

    Row[
     C_ID[Data:2560.0][Sec:2560.0][Type:Double]
 ...
    ]
]

there is indentation in there, of course, so it can be split by \n\t (and then cleaned up for the extra tabs \t in C_ID, F_ID lines and such...

The desired output is something more usable in python:

{'C_ID': 12345, 'F_ID': 17660, 'NAME': 'Mike Jones',....} {'C_ID': 2560, ....}

I've tried going line by line, and then using multiple splits() to throw away what I don't need and keep what I do need, but I'm sure there is a much more elegant and faster way of doing it...

+3  A: 

Parsing recursive structures with regex is a pain because you have to keep state.

Instead, use pyparsing or some other real parser.

Some folks like PLY because it follows the traditional Lex/Yacc architecture.

nosklo
A: 

This excellent page lists many parsers available to Python programmers. Regexes are unsuitable for "balanced parentheses" matching, but any of the third party packages reviewed on that page will serve you well.

Alex Martelli
A: 

This regex:

Row\[[\s]*C_ID\[[\W]*Data:([0-9.]*)[\S\W]*F_ID\[[\S\W]*Data:([0-9.]*)[\S\W]*NAME\[[\S\W]*Data:([\w ]*)[\S ]*

for the first row will match:

$1=12345.0 $2=17660 $3=Mike Jones

Then you can use something like this:

{'C_ID': $1, 'F_ID': $2, 'NAME': '$3'}

to produce:

{'C_ID': 12345.0, 'F_ID': 17660, 'NAME': 'Mike Jones'}

So you need to iterate through your input until it stops matching your rows... Does it make sense?

DmitryK
btw, an alternative solution can be to convert the whole lot to XML and use XSLT to construct output you need.
DmitryK
That will work... kind of.What if I wanted to execute that regex for each row so that it just matched C_ID as $1 and 12345.0 as $2, and then repeat for the next row (with $1 and $2 holding the variable name and value respectively)?
Crazy Serb
then you will need 3 different regex for C_ID, F_ID and NAME respectively. I think you will be better of parsing your input on a per row basis.
DmitryK
A: 

There really isn't a lot of unpredictable nesting going on here, so you could do this with regex's. But pyparsing is my tool of choice, so here is my solution:

from pyparsing import *

LBRACK,RBRACK,COLON = map(Suppress,"[]:")
ident = Word(alphas, alphanums+"_")
datatype = oneOf("Double Long String Boolean")

# define expressions for pieces of attribute definitions
data = LBRACK + "Data" + COLON + SkipTo(RBRACK)("contents") + RBRACK
sec = LBRACK + "Sec" + COLON + SkipTo(RBRACK)("contents") + RBRACK
type = LBRACK + "Type" + COLON + datatype("datatype") + RBRACK

# define entire attribute definition, giving each piece its own results name
attrDef = Group(ident("key") + data("data") + sec("sec") + type("type"))

# now a row is just a "Row[" and one or more attrDef's and "]"
rowDef = Group("Row" + LBRACK + Group(OneOrMore(attrDef))("attrs") + RBRACK)

# this method will process each row, and convert the key and data fields
# to addressable results names
def assignAttrs(tokens):
    ret = ParseResults(tokens.asList())
    for attr in tokens[0].attrs:
        # use datatype mapped to function to convert data at parse time
        value = {
            'Double' : float,
            'Long' : int,
            'String' : str,
            'Boolean' : bool,
            }[attr.type.datatype](attr.data.contents)
        ret[attr.key] = value
    # replace parse results created by pyparsing with our own named results
    tokens[0] = ret
rowDef.setParseAction(assignAttrs)

# a TABLE is just "Table[", one or more rows and "]"
tableDef = "Table" + LBRACK + OneOrMore(rowDef)("rows") + RBRACK

test = """
Table[    
  Row[
    C_ID[Data:12345.0][Sec:12345.0][Type:Double]
    F_ID[Data:17660][Sec:17660][Type:Long]
    NAME[Data:Mike Jones][Sec:Mike Jones][Type:String]
  ]    
  Row[
    C_ID[Data:2560.0][Sec:2560.0][Type:Double] 
    NAME[Data:Casey Jones][Sec:Mike Jones][Type:String]
  ]
]"""

# now parse table, and access each row and its defined attributes
results = tableDef.parseString(test)
for row in results.rows:
    print row.dump()
    print row.NAME, row.C_ID
    print

prints:

[[[['C_ID', 'Data', '12345.0', 'Sec', '12345.0', 'Type', 'Double'],...
- C_ID: 12345.0
- F_ID: 17660
- NAME: Mike Jones
Mike Jones 12345.0

[[[['C_ID', 'Data', '2560.0', 'Sec', '2560.0', 'Type', 'Double'], ...
- C_ID: 2560.0
- NAME: Casey Jones
Casey Jones 2560.0

The results names assigned in assignAttrs give you access to each of your attributes by name. To see if a name has been omitted, just test "if not row.F_ID:".

Paul McGuire