ansaurus

Question

How to extract data from an irregularly formatted data file in python

Answer 1

A:

You're going to need to define the file format explicitly, and then you should be able to parse that easily.

The first step is figuring out where the data you need is defined. Then throw away everything up to that point. Then start reading.

If the eng_tot can move, you need to figure out where in the block of useful data it is. So, read a line, entries = line.split(); location = entries.index('eng_tot'), then read th entry out of that location in the associated line in the output data.

The key is that you need to break down your problem into steps that you know you can do. When looking at something new it's easy to get overwhelmed. If you can just start doing something, you'll find that you can reach the solution without too much trouble after all.

dash-tom-bang 2010-06-21 16:12:15

The problem is the quantity isn't marked by the string eng_tot. While it is always in the same place relative to that string, I don't know how to access it in that way.

Maimon 2010-06-21 16:31:03

Answer 2

+1 A:

I'd probably do this:

iterate over lines in the output
search for one containing eng_tot:
- if 'eng_tot' in line.split(): process_blocks
gobble up lines until one matches all dashes (with optional spaces on either side)
- if re.match("\s+-+\s+", line): proccess_metrics_block
process the first line of metrics:
- cut the first column off the line (it makes it harder to parse, because it might not be there)
  - sanitized_line = line[8:]
  - eng_total = line.split()[0] , the first column is now eng_total
skip lines until you reach another line of dashes, then start again

After seeing your edits:

You need to import the re (regular expression) module, at the top of the file : import re
The process_blocks and process_metrics_block were pseudo code. Those don't exist unless you define them. :) You don't need those functions exactly, you can avoid them using basic looping (while) and conditional (if) statements.
You'll have to make sure you understand what you're doing, not just copy from stack overflow! :)

It looks like you're trying to do something like this. It seems to work, but I'm sure with some effort, you can come up with something nicer:

import re

def find_header(lines):
  for (i, line) in enumerate(lines):
    if 'eng_tot' in line.split():
      return i
  return None

def find_next_separator(lines, start):
  for (i, line) in enumerate(lines[start+1:]):
    if re.match("\s*-+\s*", line):
      return i + start + 1
  return None

if __name__ == '__main__':
  totals = []
  lines = open('so.txt').readlines()

  header = find_header(lines)
  start = find_next_separator(lines, header+1)

  while True:
    end = find_next_separator(lines, start+1)
    if end is None: break

    # Pull out block, after line of dashes.
    metrics_block = lines[start+1:end]

    # Pull out 2nd column from 1st line of metrics.
    eng_total = metrics_block[0].split()[1]
    totals.append(eng_total)

    start = end

  print totals

You can use a generator to be a little more pythonic:

def metric_block_iter(lines):
  start = find_next_separator(lines, find_header(lines)+1)
  while True:
    end = find_next_separator(lines, start+1)
    if end is None: break
    yield (start, end)
    start = end


if __name__ == '__main__':
  totals = []
  lines = open('so.txt').readlines()

  for (start, end) in metric_block_iter(lines):
    # Pull out block, after line of dashes.
    metrics_block = lines[start+1:end]

    # Pull out 2nd column from 1st line of metrics.
    eng_total = metrics_block[0].split()[1]
    totals.append(eng_total)

  print totals

Stephen 2010-06-21 16:19:45

Here's the code I came up with, but I keep getting a syntax error when I run it:infile = OUTPUT Eng_Total = [] for line in OUTPUT: if 'eng_tot' in line.split(): process_blocks if re.match("\s+-+\s+", line): proccess_metrics_block sanitized_line = line[8:] eng_total = line.split()[0] Eng_Total.append(eng_total)

Maimon 2010-06-21 16:38:54

Here's the code I came up with, but it keeps giving me a sytax error:infile = OUTPUTEng_Total = []for line in OUTPUT: if 'eng_tot' in line.split(): process_blocks if re.match("\s+-+\s+", line): proccess_metrics_block sanitized_line = line[8:] eng_total = line.split()[0] Eng_Total.append(eng_total)

Maimon 2010-06-21 16:40:49

@Maimon : update your question with the code so you can format it properly. Comment the code with the places you see syntax errors.

Stephen 2010-06-21 16:51:57

@Maimon : wrote some notes based on your edits.

Stephen 2010-06-21 17:10:45

Well that was kind of my question: how do I process those blocks so that I can get the values I want? Like I said, I'm pretty new to this. I'm still learning the basics and I need this specific piece of code, and not much more.

Maimon 2010-06-21 17:18:40

@Maimon : updated.

Stephen 2010-06-21 20:52:23

For both of these forms I'm getting this error:eng_total = metrics_block[0].split()[1]IndexError: list index out of range

Maimon 2010-06-22 14:40:55

@Maimon : dunno, it worked on my testdata. Looks like you're going to have to debug. Try `print metrics_block` to see what it's failing to parse.

Stephen 2010-06-22 14:57:18

when I run print metrics_block[0] and comment out the last part of the code, I get the first line of data (like I want) but every other line is an empty set. However when I run print metrics_block[0].split()[1] it prints out until the 9th line. I looked back to my previous run with just metrics_block[0], and the 9th line is this: ['\n'].

Maimon 2010-06-22 15:41:54

ansaurus

tags:

views:

answers:

How to extract data from an irregularly formatted data file in python

related questions