views:

55

answers:

5

I am trying to parse the output of a statistical program (Mplus) using Python.

The format of the output (example here) is structured in blocks, sub-blocks, columns, etc. where the whitespace and breaks are very important. Depending on the eg. options requested you get an addional (sub)block or column here or there.

Approaching this using regular expressions has been a PITA and completely unmaintainable. I have been looking into parsers as a more robust solution, but

  1. am a bit overwhelmed by all the possible tools and approaches;
  2. have the impression that they are not well suited for this kind of output.

E.g. LEPL has something called line-aware parsing, which seems to go in the right direction (whitespace, blocks, ...) but is still geared to parsing programming syntax, not output.

Suggestion in which direction to look would be appreciated.

+1  A: 

Based on your example, what you have is a bunch of different, nested sub-formats that, individually, are very easily parsed. What can be overwhelming is the sheer number of formats and the fact that they can be nested in different ways.

At the lowest level you have a set of whitespace-separated values on a single line. Those lines combine into blocks, and how the blocks combine and nest within each other is the complex part. This type of output is designed for human reading and was never intended to be "scraped" back into machine-readable form.

First, I would contact the author of the software and find out if there is an alternate output format available, such as XML or CSV. If done correctly (i.e. not just the print-format wrapped in clumsy XML, or with commas replacing whitespace), this would be much easier to handle. Failing that I would try to come up with a hierarchical list of formats and how they nest. For example,

  1. ESTIMATED SAMPLE STATISTICS begins a block
  2. Within that block MEANS/INTERCEPTS/THRESHOLDS begins a nested block
  3. The next two lines are a set of column headings
  4. This is followed by one (or more?) rows of data, with a row header and data values

And so on. If you approach each of these problems separately, you will find that it's tedious but not complex. Think of each of the above steps as modules that test the input to see if it matches and if it does, then call other modules to test further for things that can occur "inside" the block, backtracking if you get to something that doesn't match what you expect (this is called "recursive descent" by the way).

Note that you will have to do something like this anyway, in order to build an in-memory version of the data (the "data model") on which you can operate.

Jim Garrison
A: 

You could try PyParsing. It enables you to write a grammar for what you want to parse. It has other examples than parsing programming languages. But I agree with Jim Garrison that your case doesn't seem to call for a real parser, because writing the grammar would be cumbersome. I would try a brute-force solution, e.g. splitting lines at whitespaces. It's not foolproof, but we can assume the output is correct, so if a line has n headers, the next line will have exactly n values.

Adam Schmideg
+1  A: 

Yes, this is a pain to parse. You don't -- however -- actually need very many regular expressions. Ordinary split may be sufficient for breaking this document into manageable sequences of strings.

These are a lot of what I call "Head-Body" blocks of text. You have titles, a line of "--"'s and then data.

What you want to do is collapse a "head-body" structure into a generator function that yields individual dictionaries.

def get_means_intecepts_thresholds( source_iter ):
    """Precondition: Current line is a "MEANS/INTERCEPTS/THRESHOLDS" line"""
    head= source_iter.next().strip().split()
    junk= source_iter.next().strip()
    assert set( junk ) == set( [' ','-'] )
    for line in source_iter:
        if len(line.strip()) == 0: continue
        if line.strip() == "SLOPES": break
        raw_data= line.strip().split()
        data = dict( zip( head, map( float, raw_data[1:] ) ) )
        yield int(raw_data[0]), data 

def get_slopes( source_iter ):
    """Precondition: Current line is a "SLOPES" line"""
    head= source_iter.next().strip().split()
    junk= source_iter.next().strip()
    assert set( junk ) == set( [' ','-'] )
    for line in source_iter:
        if len(line.strip()) == 0: continue
        if line.strip() == "SLOPES": break
        raw_data= line.strip().split() )
        data = dict( zip( head, map( float, raw_data[1:] ) ) )
        yield raw_data[0], data

The point is to consume the head and the junk with one set of operations.

Then consume the rows of data which follow using a different set of operations.

Since these are generators, you can combine them with other operations.

def get_estimated_sample_statistics( source_iter ):
    """Precondition: at the ESTIMATED SAMPLE STATISTICS line"""
    for line in source_iter:
        if len(line.strip()) == 0: continue
    assert line.strip() == "MEANS/INTERCEPTS/THRESHOLDS"
    for data in get_means_intercepts_thresholds( source_iter ):
        yield data
    while True:
        if len(line.strip()) == 0: continue
        if line.strip() != "SLOPES": break
        for data in get_slopes( source_iter ): 
            yield data

Something like this may be better than regular expressions.

S.Lott
+1  A: 

My suggestion is to do rough massaging of the lines to more useful form. Here is some experiments with your data:

from __future__ import print_function
from itertools import groupby
import string
counter = 0

statslist = [ statsblocks.split('\n')
            for statsblocks in  open('mlab.txt').read().split('\n\n')
            ]
print(len(statslist), 'blocks')

def blockcounter(line):
    global counter
    if not line[0]:
        counter += 1
    return counter

blocklist = [ [block, list(stats)] for block, stats in groupby(statslist, blockcounter)]

for blockno,block in enumerate(blocklist):
    print(120 * '=')
    for itemno,line in enumerate(block[1:][0]):
        if len(line)<4 and any(line[-1].endswith(c) for c in string.letters) :
            print('\n** DATA %i, HEADER (%r)**' % (blockno,line[-1]))
        else:
            print('\n** DATA %i, item %i, length %i **' % (blockno, itemno, len(line)))
        for ind,subdata in enumerate(line):
            if '___' in subdata:
                print(' *** Numeric data starts: ***')
            else:
                if 6 < len(subdata)<16:
                    print( '** TYPE: %s **' % subdata)
                print('%3i : %s' %( ind, subdata))
Tony Veijalainen
A: 

It turns out that tabular program output like this was one of my earliest applications of pyparsing. Unfortunately, that exact example dealt with a proprietary format that I can't publish, but there is a similar example posted here: http://pyparsing.wikispaces.com/file/view/dictExample2.py .

Paul McGuire