views:

1430

answers:

5

I am using google's appengine api

from google.appengine.api import urlfetch

to fetch a webpage. The result of

result = urlfetch.fetch("http://www.example.com/index.html")

is a string of the html content (in result.content). The problem is the data that I want to parse is not really in HTML form, so I don't think using a python HTML parser will work for me. I need to parse all of the plain text in the body of the html document. The only problem is that urlfetch returns a single string of the entire HTML document, removing all newlines and extra spaces.

EDIT: Okay, I tried fetching a different URL and apparently urlfetch does not strip the newlines, it was the original webpage I was trying to parse that served the HTML file that way... END EDIT

If the document is something like this:

<html><head></head><body>
AAA 123 888 2008-10-30 ABC
BBB 987 332 2009-01-02 JSE
...
A4A       288        AAA
</body></html>

result.content will be this, after urlfetch fetches it:

'<html><head></head><body>AAA 123 888 2008-10-30 ABCBBB 987     2009-01-02 JSE...A4A     288            AAA</body></html>'

Using an HTML parser will not help me with the data between the body tags, so I was going to use regular expresions to parse my data, but as you can see the last part of one line gets combined with the first part of the next line, and I don't know how to split it. I tried

result.content.split('\n')

and

result.content.split('\r')

but the resulting list was all just 1 element. I don't see any options in google's urlfetch function to not remove newlines.

Any ideas how I can parse this data? Maybe I need to fetch it differently?

Thanks in advance!

+2  A: 

Only suggestion I can think of is to parse it as if it has fixed width columns. Newlines are not taken into consideration for HTML.

If you have control of the source data, put it into a text file rather than HTML.

Jimmy2Times
I don't have control over the source data... Maybe I can try to parse it, since the data is in fixed width format.
BrianH
+2  A: 

I understand that the format of the document is the one you have posted. In that case, I agree that a parser like Beautiful Soup may not be a good solution.

I assume that you are already getting the interesting data (between the BODY tags) with a regular expression like

import re
data = re.findall('<body>([^\<]*)</body>', result)[0]

then, it should be as easy as:

start = 0
end = 5
while (end<len(data)):
   print data[start:end]
   start = end+1
   end = end+5
print data[start:]

(note: I did not check this code against boundary cases, and I do expect it to fail. It is only here to show the generic idea)

Roberto Liffredo
+1  A: 

Once you have the body text as a single, long string, you can break it up as follows. This presumes that each record is 26 characters.

body= "AAA 123 888 2008-10-30 ABCBBB 987     2009-01-02 JSE...A4A     288            AAA"
for i in range(0,len(body),26):
    line= body[i:i+26]
    # parse the line
S.Lott
A: 

EDIT: Reading comprehension is a desirable thing. I missed the bit about the lines being run together with no separator between them, which would kinda be the whole point of this, wouldn't it? So, nevermind my answer, it's not actually relevant.


If you know that each line is 5 space-separated columns, then (once you've stripped out the html) you could do something like (untested):

def generate_lines(datastring):
    while datastring:
        splitresult = datastring.split(' ', 5)
        if len(splitresult) >= 5:
            datastring = splitresult[5]
        else:
            datastring = None
        yield splitresult[:5]

for line in generate_lines(data):
    process_data_line(line)

Of course, you can change the split character and number of columns as needed (possibly even passing them into the generator function as additional parameters), and add error handling as appropriate.

Jeff Shannon
+1  A: 

Further suggestions for splitting the string s into 26-character blocks:

As a list:

>>> [s[x:x+26] for x in range(0, len(s), 26)]
['AAA 123 888 2008-10-30 ABC',
 'BBB 987     2009-01-02 JSE',
 'A4A     288            AAA']

As a generator:

>>> for line in (s[x:x+26] for x in range(0, len(s), 26)): print line
AAA 123 888 2008-10-30 ABC
BBB 987     2009-01-02 JSE
A4A     288            AAA

Replace range() with xrange() in Python 2.x if s is very long.

akaihola