views:

214

answers:

1

I've got a comma separated list in a table cell in an HTML document, but some of items in the list are linked:

<table>
  <tr>
    <td>Names</td>
    <td>Fred, John, Barry, <a href="http://www.example.com/"&gt;Roger&lt;/a&gt;, James</td>
  </tr>
</table>

I've been using beautiful soup to parse the html, and I can get to the table, but what is the best way to split it and return a data structure roughly like:

[
  {'name':'Fred'},
  {'name':'John'},
  {'name':'Barry'},
  {'name':'Roger', 'url':'http://www.example.com/'},
  {'name':'James'},
]
+10  A: 

This is one way you could do it:

import BeautifulSoup

soup = BeautifulSoup.BeautifulSoup('''<table>
  <tr>
    <td>Names</td>
    <td>Fred, John, Barry, <a href="http://www.example.com/"&gt;Roger&lt;/a&gt;, James</td>
  </tr>
</table>''')

result = []
for tag in soup.table.findAll('td')[1]:
  if isinstance(tag, BeautifulSoup.NavigableString):
    for name in tag.string.split(','):
      name = name.strip()
      if name:
        result.append({ 'name': name })
  else:
    result.append({ 'name': tag.string.strip(), 'url': tag["href"] })

print result
Mark Byers
+1 nice solution
vikingosegundo
+1 its really cool
atv
Nice solution indeed!One small note: I would replace "type(tag) is BeautifulSoup.NavigableString" with "isinstance(tag, BeautifulSoup.NavigableString)".
taleinat
I updated it with your suggestion.
Mark Byers