views:

225

answers:

1

Hi there

I am trying to build a parser and save the results as an xml file but i have problems..

Would you experts please have a look at my code ?

Traceback :TypeError: expected string or buffer

import urllib2, re
from xml.dom.minidom import Document
from BeautifulSoup import BeautifulSoup as bs
osc = open('OSCTEST.html','r')
oscread = osc.read()
soup=bs(oscread)
doc = Document()
root = doc.createElement('root')
doc.appendChild(root)
countries = doc.createElement('countries')
root.appendChild(countries)
findtags1 = re.compile ('<h1 class="title metadata_title content_perceived_text(.*?)`</h1>', re.DOTALL |  re.IGNORECASE).findall(soup)
findtags2 = re.compile ('<span class="content_text">(.*?)</span>', re.DOTALL |  re.IGNORECASE).findall(soup)
for header in findtags1:
title_elem = doc.createElement('title')
countries.appendChild(title_elem)
header_elem = doc.createTextNode(header)
title_elem.appendChild(header_elem)
 for item in findtags2:
    art_elem = doc.createElement('artikel')
    countries.appendChild(art_elem)
    s = item.replace('<P>','')
    t = s.replace('</P>','')
    text_elem = doc.createTextNode(t)
    art_elem.appendChild(text_elem)    

print doc.toprettyxml()
+3  A: 

It's good that you're trying to using BeautifulSoup to parse HTML but this won't work:

re.compile('<h1 class="title metadata_title content_perceived_text(.*?)`</h1>',
           re.DOTALL | re.IGNORECASE).findall(soup)

You're trying to parse a BeautifulSoup object using a regular expression. Instead you should be using the findAll method on the soup, like this:

regex = re.compile('^title metadata_title content_perceived_text', re.IGNORECASE)
for tag in soup.findAll('h1', attrs = { 'class' : regex }):
    print tag.contents

If you do actually want to parse the document as text with a regular expression then don't use BeautifulSoup - just read the document into a string and parse that. But I'd suggest you take the time to learn how BeautifulSoup works as this is the preferred way to do it. See the documentation for more details.

Mark Byers
ah yes BUT it won't find the rest.. I have real problems getting BS to find the contents from within the tags..
Peter Nielsen
@Peter Nielsen: Can you explain what you mean by 'it won't find the rest'? Does my update answer your question?
Mark Byers
well, using bs and not regex gives me the problem as to how i find the contents inside the tags and not just the entire tag + content..Ty for answering so speedy, by the way:-)
Peter Nielsen
@Peter Nielsen: "how i find the contents inside the tags". Try this: `for tag in soup.findAll('h1'): print tag.contents`
Mark Byers
Uhhhhh.. very , very , very nice..I just got tingly all over.. ;-)Ty very much..
Peter Nielsen
@Peter, since you like the answer you should upvote and accept it -- this is really fundamental SO etiquette!
Alex Martelli
Ah, thank you.. Got it ..
Peter Nielsen
The thing is, though, what to do is one of the tags that I am looking for with BS does not have an end tag ? It would seem that BS fails in such a case..
Peter Nielsen
@Peter Nielsen: I'm not exactly sure what the problem is. I know that BeautifulSoup can handle invalid HTML but I don't know all the details of how it handles missing end tags. It's rather difficult to go into details in comments due to length limits, lack of formatting, etc. I would suggest that you create a new question describing what new issue you have with some examples of how it fails and what you want, then I am sure that I or someone else on Stack Overflow will be able to help you.
Mark Byers