views:

164

answers:

1

while using beautifulsoup to parse a table in html every other row starts with

<tr class="row_k">

instead of a tr tag without a class

Sample HTML

<tr class="row_k"> 
<td><img src="some picture url" alt="Item A"></td> 
<td><a href="some url"> Item A</a></td> 
<td>14.8k</td> 
<td><span class="drop">-555</span></td> 
<td> 
<img src="some picture url" alt="stuff" title="stuff"> 
</td> 
<td> 
<img src="some picture url" alt="Max llll"> 
</td> 
</tr> 
<tr> 
<td><img src="some picture url" alt="Item B"></td> 
<td><a href="some url"> Item B</a></td> 
<td>64.9k</td> 
<td><span class="rise">+165</span></td> 
<td> 
<img src="some picture url" alt="stuff" title="stuff"> 
</td> 
<td> 
<img src="some picture url" alt="max llll"> 
</td> 
</tr> 
<tr class="row_k"> 
<td><img src="some picture url" alt="Item C"></td> 
<td><a href="some url"> Item C</a></td> 
<td>4,000</td> 
<td><span class="rise">+666</span></td> 
<td> 
<img src="some picture url" title="stuff"> 
</td> 
<td> 
<img src="some picture url" alt="Maximum lllle"> 

Text I wish to extract is 14.8k, 64.9k, and 4,000

this1 = urllib2.urlopen('my url').read()
this_1 = BeautifulSoup(this1)
this_1a = StringIO.StringIO()
for row in this_1.findAll("tr", { "class" : "row_k" }):
  for col in row.findAll(re.compile('td')):
    this_1a.write(col.string if col.string else '')
Item_this1 = this_1a.getvalue()

I get the feeling that this code is poorly written, Is there a more flexible tool I can use such as an XML parser? that someone could suggest.

still open to any answers that still utilize beautifulsoup.

+2  A: 

I am still learning a lot but I am going to suggest you try lxml. I am going to make a stab at this and I think it will mostly get you there but there may be some niceties I am not certain about.

assuming this1 is a string

from lxml.html import fromstring
this1_tree=fromstring(this1)
all_cells=[(item[0], item[1]) for item in enumerate(this1_tree.cssselect('td'))] # I am hoping this gives you the cells with their relative position in the document)

The only thing I am not totally certain about is whether you test the key or value or text_content for each cell to find out if it has the string that you are seeking in the anchor reference or text. That is why I wanted a sample of your html. But one of those should work

the_cell_before_numbers=[]
for cell in all_cells:
    if 'Item' in cell[1].text_content():
        the_cell_before_numbers.append(cell[0])

Now that you have the cell before your can then get the value you need by getting the text content of the next cell

todays_price=all_cells[the_cell_before_number+1][1].text_content()

I am sure there is a prettier way but I think this will get you there.

I tested using your html and I got what you were looking for.

PyNEwbie
I updated with a sample of the html
Pevo
sorry I'm brand new to this. I'm not sure how to implement this? =/ where exactly do I put all of this?
Pevo
Well I am using lxml instead of BeautifulSoup. So you need to install lxml. You need to go back to an earlier version of this question as my answer was built using that description. But this code should get you there. It assumes that this1 is the htm page you pulled in using urllib and it is a string object.
PyNEwbie
ic, well my problems now are of another nature with installing lxml gives me an annoying error. But I believe this will get me were I want eventually. much thanks.
Pevo
What error did you get?
PyNEwbie
an error regarding Microsoft visual basic 9 and about how much it fails and failed with exit status 2
Pevo
PyNEwbie
I already have Microsoft visual basic 9 installed. there must be some problem with it im assuming.
Pevo