views:

103

answers:

1

I am using BeautifulSoup in Python and am having trouble replacing some tags. I am finding <div> tags and checking for children. If those children do not have children (are a text node of NODE_TYPE = 3), I am copying them to be a <p>.

from BeautifulSoup import Tag, BeautifulSoup

class bar:

 self.soup = BeautifulSoup(self.input)
 foo()
 def foo(self):    
  elements = soup.findAll(True)

  for node in elements:

    # ....other stuff here if not <div> tags.

    if node.name.lower() == "div":
      if not node.find('a'):
        newTag = Tag(self.soup, "p")
        newTag.setString(node.text)
        node.replaceWith(newTag)
        nodesToScore.append(newTag)
      else:
        for n in node.findAll(True):
          if n.getString():  # False if has children
            newTag = Tag(self.soup, "p")
            newTag.setString(n.text)
            n.replaceWith(newTag)

I'm getting an AttributeError:

  File "file.py", line 125, in function
    node.replaceWith(newTag)
  File "BeautifulSoup.py", line 131, in replaceWith
    myIndex = self.parent.index(self)
AttributeError: 'NoneType' object has no attribute 'index'

I do the same replacing on node higher up in the for loop and it works correctly. I'm assuming it's having problems because of the additional iterating through node as n.

What am I doing wrong or what would be a better way to do this? Thanks! PS. I'm using Python 2.5 for Google Appengine and BeautifulSoup 3.0.8.1

A: 

The error says:

    myIndex = self.parent.index(self)
AttributeError: 'NoneType' object has no attribute 'index'

This code occurs on line 131 of BeautifulSoup.py. It says that self.parent is None.

Looking at the surrounding code shows that self should equal node in your code, since node is calling its replaceWith method.(Note: The error message says node.replaceWith, but the code you posted shows n.replaceWith. The code you posted does not correspond to the error message/traceback.) So apparently node.parent is None.

You could probably avoid the error by placing

if node.parent is not None:

at some point in the code before node.replaceWith is called.

Edit: I suggest you use print statements to investigate where in the HTML you are when node.parent is None (i.e. where the error is occurring). Maybe use print node.contents or print node.previous.contents or print node.next.contents to see where you are. Once you see the HTML it might become obvious what pathological situation you are in which is causing node.parent to be None.

unutbu
Thanks for noticing the `node.replaceWith` vs the `n.replaceWith`. I added the additional code where that is referenced. The `if not` runs fine when the `else` is not present which is why I thought it wasn't relevant but I was wrong.
feesta
@feesta: This is hard to debug without seeing the HTML. I've added an edit (above) suggesting how you might be able to find the HTML which corresponds to the problem.
unutbu
@~ubuntu Thanks! it is working now!I added `if node.parent is None: (log node) else: (the rest)` I found the bad HTML was `div` tags with only whitespace. That is part of what I'm stripping out.Thanks again!
feesta
@feesta: Ah yes! That's a good way to debug it; better than my suggestion. Two thumbs up :)
unutbu