views:

35

answers:

1

I'm doing some scraping of a page and I'm fine with getting most fields, but having some problems with the address.

<address>
  56 South Ave
  <br>
  Miami, FL 33131
  <br>
</address>

address = myWebPage.xpath("//div[contains(@class,'rightcol')]//address")

I can get the first line, 56 South Avenue, using the above code. But I can't get the city, state, zip. How would I change the code to get the full address?

+1  A: 
//div[contains(@class,'rightcol')]//address/text()[1]

selects the first text-node child of address:

"  
  56 South Ave   
  "

//div[contains(@class,'rightcol')]//address/text()[2]

selects the second text-node child of address:

"       
  Miami, FL 33131       
  "

//div[contains(@class,'rightcol')]//address/text()

selects both text-node children of address.

Dimitre Novatchev
Thanks a lot Dmitre. It works.Another question for you:Although I get ok results when I select for either node1 or node2, I realize that my results are terminated early if I do //address/text(). I only 3 results whereas there are 10 children of address.This may be due to extra non-alphanumeric characters in the address. I'm not sure. I'd normally do some regex parsing but not sure if i can do that within the xpath functions. How do you typically process multiline data to ensure results are well-formed?
DevX
@DevX: `//address/text()` selects all text nodes that are *immediate* children of an `address` element. In case you need all text-node *descendents` of any `address` node, use: `//address//text()`.
Dimitre Novatchev