ansaurus

Question

Parsing fixed-format data embedded in HTML in python

Answer 1

+2 A:

Only suggestion I can think of is to parse it as if it has fixed width columns. Newlines are not taken into consideration for HTML.

If you have control of the source data, put it into a text file rather than HTML.

Jimmy2Times 2009-01-03 20:36:30

I don't have control over the source data... Maybe I can try to parse it, since the data is in fixed width format.

BrianH 2009-01-03 20:57:09

Answer 2

+2 A:

I understand that the format of the document is the one you have posted. In that case, I agree that a parser like Beautiful Soup may not be a good solution.

I assume that you are already getting the interesting data (between the BODY tags) with a regular expression like

import re
data = re.findall('<body>([^\<]*)</body>', result)[0]

then, it should be as easy as:

start = 0
end = 5
while (end<len(data)):
   print data[start:end]
   start = end+1
   end = end+5
print data[start:]

(note: I did not check this code against boundary cases, and I do expect it to fail. It is only here to show the generic idea)

Roberto Liffredo 2009-01-03 21:13:07

Answer 3

+1 A:

Once you have the body text as a single, long string, you can break it up as follows. This presumes that each record is 26 characters.

body= "AAA 123 888 2008-10-30 ABCBBB 987     2009-01-02 JSE...A4A     288            AAA"
for i in range(0,len(body),26):
    line= body[i:i+26]
    # parse the line

S.Lott 2009-01-04 00:18:04

Answer 4

A:

EDIT: Reading comprehension is a desirable thing. I missed the bit about the lines being run together with no separator between them, which would kinda be the whole point of this, wouldn't it? So, nevermind my answer, it's not actually relevant.

If you know that each line is 5 space-separated columns, then (once you've stripped out the html) you could do something like (untested):

def generate_lines(datastring):
    while datastring:
        splitresult = datastring.split(' ', 5)
        if len(splitresult) >= 5:
            datastring = splitresult[5]
        else:
            datastring = None
        yield splitresult[:5]

for line in generate_lines(data):
    process_data_line(line)

Of course, you can change the split character and number of columns as needed (possibly even passing them into the generator function as additional parameters), and add error handling as appropriate.

Jeff Shannon 2009-01-04 01:08:14

Answer 5

+1 A:

Further suggestions for splitting the string s into 26-character blocks:

As a list:

>>> [s[x:x+26] for x in range(0, len(s), 26)]
['AAA 123 888 2008-10-30 ABC',
 'BBB 987     2009-01-02 JSE',
 'A4A     288            AAA']

As a generator:

>>> for line in (s[x:x+26] for x in range(0, len(s), 26)): print line
AAA 123 888 2008-10-30 ABC
BBB 987     2009-01-02 JSE
A4A     288            AAA

Replace range() with xrange() in Python 2.x if s is very long.

akaihola 2009-01-27 21:52:46

ansaurus

tags:

views:

answers:

Parsing fixed-format data embedded in HTML in python

related questions