views:

58

answers:

4

A webpage has a product code I need to retrive, and it is in the following HTML section:

<table...>
<tr>
 <td>
 <font size="2">Product Code#</font>
 <br>
 <font size="1">2342343</font>
 </td>

</tr>
</table>

So I guess the best way to do this would be first to reference the html element with the text value 'Product Code#', and then reference the 2nd font tag in the TD.

Ideas?

A: 

You could use this regex (or something similar):

<td>\n\ <font\ size="2">Product\ Code\#</font>\n\ <br>\n\ <font\ size="1">(?<ProductCode>.+?)</font>\n\ </td>

You could probably remove some of the escapes depending on your RegExp engine... I was being cautious.

Tommy
+1  A: 

Assuming soup is your BeautifulSoup instance:

int(''.join(soup("font", size="1")[0](text=True)))

Or, if you need to get multiple product codes:

[int(''.join(font(text=True))) for font in soup("font", size="1")]
icktoofay
Fails if there are other 'size="1"' columns.
Paul McGuire
@Paul: True, but there isn't, and it could be restricted to the table it's in if necessary.
icktoofay
A: 

Don't use regular expressions to parse HTML. I would use the following XPATH for this task:

//TABLE/TR/TD/FONT[@size='1']

Or, if the font size attribute is not guaranteed to be there and equal to 1:

//FONT[text()='Product Code#']/parent::*/FONT[2]
jhandl
+1  A: 

My strategy is:

  • Find text nodes matching the string "Product Code#"
  • For each such node, get the parent <font> element and find the parent's next sibling <font> element
  • Insert the contents of the sibling element into a list

The code:

from BeautifulSoup import BeautifulSoup


html = open("products.html").read()
soup = BeautifulSoup(html)

product_codes = [tag.parent.findNextSiblings('font')[0].contents[0]
                 for tag in 
                 soup.findAll(text='Product Code#')]
Jesse Dhillon