ansaurus

Question

python, lxml and xpath - html table parsing

Answer 1

A:

You need to use a loop to access the row's data, like this:

for row in data:  
    for col in row:
        print col

Calling next() once as you did will access only the first item, which is why you see one column.

Note that due to the nature of generators, you can only access them once. If you changed the call process_row(row) into list(process_row(row)), the generator would be converted to a list which can be reused.

Update: If you need just the 3rd row and on, use data[2:]

interjay 2009-10-16 12:29:58

Thanks, the nested loop and adding the list() call indeed did the trick. But it still doesn't work with the second xpath, which is the one I need (I guess)

2009-10-16 13:51:23

It isn't clear to me why you need the second xpath, see the update to my answer.

interjay 2009-10-16 14:28:43

I need all the table content starting from row 3, and the second xpath only returns one row. Of course I have done what you have suggested in your update, but I am curious to know what is wrong with the second xpath, as it would make my code for the following days cleaner

2009-10-16 15:34:11

Answer 2

+1 A:

This is a generator:

def process_row(row):  
     for cell in row.xpath('./td'):  
         print cell.text_content()  
         yield cell.text_content()

You're calling it as though you thought it returns a list. It doesn't. There are contexts in which it behaves like a list:

print [r for r in process_row(row)]

but that's only because a generator and a list both expose the same interface to for loops. Using it in a context where it gets evaluated just one time, e.g.:

return [process_row(row) for row in table.xpath('./tr')]

just calls a new instance of the generator once for each new value of row, returning the first result yielded.

So that's your first problem. Your second one is that you're expecting:

tbl = doc.xpath("//body/table[2]//tr[position()>2]")[0]

to give you the third and all subsequent rows, and it's only setting tbl to the third row. Well, the call to xpath is returning the third and all subsequent rows. It's the [0] at the end that's messing you up.

Robert Rossney 2009-10-16 21:04:38

Thanks for your answer. But removing the [0] at the end of the xpath raise the exeption: AttributeError: 'list' object has no attribute 'xpath'

2009-10-17 13:11:35

I do not believe that merely removing the `[0]` from the end of that statement caused that error. You've changed something else, or the error is being raised later.

Robert Rossney 2009-10-17 18:23:36

Forgive that poor soul, I have to admit that my python skills are very likely involved... Here is the actual code snippet bugging me: http://pastebin.com/m522b6970

2009-10-17 20:35:11

ansaurus

tags:

views:

answers:

python, lxml and xpath - html table parsing

related questions