views:

79

answers:

2

My html looks like:

<td>
   <table ..>
      <tr>
         <th ..>price</th>
         <th>$99.99</th>
      </tr>
   </table>
</td>

So I am in the current table cell, how would I get the 99.99 value?

I have so far:

td[3].findChild('th')

But I need to do:

Find th with text 'price', then get next th tag's string value.

A: 

With pyparsing, it's easy to reach into the middle of some HTML for a tag pattern like this:

from pyparsing import makeHTMLTags, Combine, Word, nums

th,thEnd = makeHTMLTags("TH")
floatnum = Combine(Word(nums) + "." + Word(nums))
priceEntry = (th + "price" + thEnd + 
              th + "$" + floatnum("price") + thEnd)

tokens,startloc,endloc = priceEntry.scanString(html).next()

print tokens.price

Pyparsing's makeHTMLTags helper returns a pair of pyparsing expressions, one for the start tag and one for the end tag. The start tag pattern is much more than just adding "<>"s around the given string, but also allows for extra whitespace, variable case, and the presence or absence of tag attributes. For instance, note that even though I specified "TH" as the table head tag, it will also match "th", "Th", "tH" and "TH". Pyparsing's default whitespace skipping behavior will also handle extra spaces, between tag and "$", between "$" and numeric price, etc., without having to sprinkle "zero or more whitespace chars could go here" indicators. Lastly, by assigning the results name "price" (following floatum in the definition of priceEntry), it makes it very simple to access that specific value from the full list of tokens matching the overall priceEntry expression.

(Combine is used for 2 purposes: it disallows whitespace between the components of the number; and returns a single combined token "99.99" instead of the list ["99", ".", "99"].)

Paul McGuire
+2  A: 

Think about it in "steps"... given that some x is the root of the subtree you're considering,

x.findAll(text='price')

is the list of all items in that subtree containing text 'price'. The parents of those items then of course will be:

[t.parent for t in x.findAll(text='price')]

and if you only want to keep those whose "name" (tag) is 'th', then of course

[t.parent for t in x.findAll(text='price') if t.parent.name=='th']

and you want the "next siblings" of those (but only if they're also 'th's), so

[t.parent.nextSibling for t in x.findAll(text='price')
 if t.parent.name=='th' and t.parent.nextSibling and t.parent.nextSibling.name=='th']

Here you see the problem with using a list comprehension: too much repetition, since we can't assign intermediate results to simple names. Let's therefore switch to a good old loop...:

Edit: added tolerance for a string of text between the parent th and the "next sibling" as well as tolerance for the latter being a td instead, per OP's comment.

for t in x.findAll(text='price'):
  p = t.parent
  if p.name != 'th': continue
  ns = p.nextSibling
  if ns and not ns.name: ns = ns.nextSibling
  if not ns or ns.name not in ('td', 'th'): continue
  print ns.string

I've added ns.string, that will give the next sibling's contents if and only if they're just text (no further nested tags) -- of course you can instead analize further at this point, depends on your application's needs!-). Similarly, I imagine you won't be doing just print but something smarter, but I'm giving you the structure.

Talking about the structure, notice that twice I use if...: continue: this reduces nesting compared to the alternative of inverting the if's condition and indenting all the following statements in the loop -- and "flat is better than nested" is one of the koans in the Zen of Python (import this at an interactive prompt to see them all and meditate;-).

Alex Martelli
great answer alex, I have just been using findAll so far and now I feel I can traverse the dom with this knowledge, go python! hehe
Blankman
in the link 'if not ns or ns.name...', if ns is None then ns.name will fail no?
Blankman
actually my HTML is like: <th>price</th><td>$99.99</td> and p.nextSibling is empty, I tried p.next and that doesn't work either.
Blankman
@Blackman, `or` _short-circuits_, so in `not ns or ns.name` the `ns.name` will not evaluate if `ns` is false -- so, to your first comment, nope, you don't need to worry.
Alex Martelli
@Blackman, wrt your second comment, see my edit -- one extra line of code to optionally skip some text between the parent `th` and the "next" (which then isn't "next"!-) sibling, and an altered check to accept either `th` or `td` as the tag we're looking for.
Alex Martelli