I am trying to learn lxml after having used BeautifulSoup. However, I am not a strong programmer in general.
I have the following code in some source html:
<p style="font-family:times;text-align:justify"><font size="2"><b><i> The reasons to eat pickles include: </i></b></font></p>
Because the text is bolded, I want to pull that text. I can't seem to be able to differentiate that that particular line is bolded.
When I started this work this evening I was working with a document that had the word bold in the style attrib like the following:
<p style="font-style:italic;font-weight:bold;margin:0pt 0pt 6.0pt;text-indent:0pt;"><b><i><font size="2" face="Times New Roman" style="font-size:10.0pt;">The reason I like tomatoes include:</font></i></b></p>
I should say that the document I am working from is a fragment that I read in the lines, joined the lines together and then used the html.fromstring function
txtFile=open(r'c:\myfile.htm','r').readlines()
strHTM=''.join(txtFile)
newHTM=html.fromstring(strHTM)
and so the first line of htm code I have above is newHTM[19]
Humm this seems to be getting me closer
newHTM.cssselect('b')
I don't fully understand yet but here is the solution:
for each in newHTM:
if each.cssselect('b')
each.text_content()