tags:

views:

97

answers:

1

I am trying to learn lxml after having used BeautifulSoup. However, I am not a strong programmer in general.

I have the following code in some source html:

<p style="font-family:times;text-align:justify"><font size="2"><b><i> The reasons to eat pickles include:  </i></b></font></p>

Because the text is bolded, I want to pull that text. I can't seem to be able to differentiate that that particular line is bolded.

When I started this work this evening I was working with a document that had the word bold in the style attrib like the following:

<p style="font-style:italic;font-weight:bold;margin:0pt 0pt 6.0pt;text-indent:0pt;"><b><i><font size="2" face="Times New Roman" style="font-size:10.0pt;">The reason I like tomatoes include:</font></i></b></p>

I should say that the document I am working from is a fragment that I read in the lines, joined the lines together and then used the html.fromstring function

txtFile=open(r'c:\myfile.htm','r').readlines()
strHTM=''.join(txtFile)
newHTM=html.fromstring(strHTM)

and so the first line of htm code I have above is newHTM[19]

Humm this seems to be getting me closer

newHTM.cssselect('b')

I don't fully understand yet but here is the solution:

for each in newHTM:
    if each.cssselect('b')
        each.text_content()
A: 

Using the CSS API really isn't the right approach. If you want to find all b elements, do

strHTM=open(r'c:\myfile.htm','r').read() # no need to split it into lines first
newHTM=html.fromString(strHTM)
bELements = newHTM.findall('b')
for b in bElements:
    print b.text_content()
Martin v. Löwis
This is where I started and it does not work. As near as I can figure it is because the newHTM is a class and but now I am lost. I am not sure why I decided to operate on each in newHTM but that was the key.
Burch Kealey
What do you mean, "it does not work"? It works fine for me.
Martin v. Löwis
Well I am wrong because both newHTM and the each in newHTM are the same type of objects so that is not it
PyNEwbie
Well I would edit but I can't fromString sb fromstring and your list is named differently. But when I run this code on my htm fragment bElements has a length of 0.
PyNEwbie