lxml

How to add a namespace to an attribute in lxml

I'm trying to create an xml entry that looks like this using python and lxml: <resource href="Unit 4.html" adlcp:scormtype="sco"> I'm using python and lxml. I'm having trouble with the adlcp:scormtype attribute. I'm new to xml so please correct me if I'm wrong. adlcp is a namespace and scormtype is an attribute that is defined in t...

lxml equivalent to BeautifulSoup "OR" syntax?

I'm converting some html parsing code from BeautifulSoup to lxml. I'm trying to figure out the lxml equivalent syntax for the following BeautifullSoup statement: soup.find('a', {'class': ['current zzt', 'zzt']}) Basically I want to find all of the "a" tags in the document that have a class attribute of either "current zzt" or "zzt". ...

Python web scraping involving HTML tags with attributes

I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following: <html> <body> <div id="container"> <div id="contents"> <table> <tbody> <tr> <td class="author">####I want whatever is located here ###</td> </tr> </tbody> </table> </div> </div> </...

Difference between attributes and style tags in lxml

I am trying to learn lxml after having used BeautifulSoup. However, I am not a strong programmer in general. I have the following code in some source html: <p style="font-family:times;text-align:justify"><font size="2"><b><i> The reasons to eat pickles include: </i></b></font></p> Because the text is bolded, I want to pull that tex...

can't install lxml (python 2.6.3, osx 10.6 snow leopard)

I try to: easy_install lxml and I get this error: File "build/bdist.macosx-10.3-fat/egg/setuptools/command/build_ext.py", line 85, in get_ext_filename KeyError: 'etree' any hints? ...

How can I view a text representation of an lxml element?

If I'm parsing an XML document using lxml, is it possible to view a text representation of an element? I tried to do : print repr(node) but this outputs <Element obj at b743c0> What can I use to see the node like it exists in the XML file? Is there some to_xml method or something? ...

Clojure equivalent to Python's lxml library?

I'm looking for the Clojure/Java equivalent to Python's lxml library. I've used it a ton in the past for parsing all sorts of html (as a replacement for BeautifulSoup) and it's great to be able to use the same elementtree api for xml as well -- really a trusted friend! Can anyone recommend a similar Java/Clojure library? About lxml ...

How to get path of an element in lxml?

Hello, I'm searching in a HTML document using XPath from lxml in python. How can I get the path to a certain element? Here's the example from ruby nokogiri: page.xpath('//text()').each do |textnode| path = textnode.path puts path end print for example '/html/body/div/div[1]/div[1]/p/text()[1]' and this is the string I want to ...

python, lxml and xpath - html table parsing

Hello, I 'am new to lxml, quite new to python and could not find a solution to the following: I need to import a few tables with 3 columns and an undefined number of rows starting at row 3. When the second column of any row is empty, this row is discarded and the processing of the table is aborted. The following code prints the table...

Weird lxml behavior

Hello. Consider the following snippet: import lxml.html html = '<div><br />Hello text</div>' doc = lxml.html.fromstring(html) text = doc.xpath('//text()')[0] print lxml.html.tostring(text.getparent()) #prints <br>Hello text I was expecting to see '<div><br />Hello text</div>', because br can't have nested text and is "self-closed" (I...

how to extract some text by use lxml?

hello. i want to extract some text in certain website. here is web address what i want to extract some text to make scraper. http://news.search.naver.com/search.naver?sm=tab%5Fhty&amp;where=news&amp;query=times&amp;x=0&amp;y=0 in this page, i want to extract some text with subject and content field separately. for example,if you open tha...

PAMIE and lxml related question

Hello, im making web scraper now. i was received many help from here Stackoverflow. now almost finished my scraper except some related with serveral problem :) i was uploaded my script source to http://elca.pastebin.com/m52e7d8e0 current problem is , if you see my script source line 74, you can see this line "thepage = urllib.urlopen(the...

How to use lxml to get a message from a website?

At exam.com is not about the weather: Tokyo: 25°C I want to use Django 1.1 and lxml to get information at the website. I want to get information that is of "25" only. HTML exam.com structure as follows: <p id="resultWeather"> <b>Weather</b> Tokyo: <b>25</b>°C </p> I'm a student. I'm doing a small project with my friend...

How can you select a node that's an unknown number of levels deep from a tag in XPath?

Example, if I have <form name="blah"> <input name="1"/> <input name="2"/> <table> <tr> <td> <unkown number of levels more> <input name="3"/> </td> </tr> <table> </form> How can I put together a query that will return input 1,2 and 3? Edit: I should note I'm not interested i...

The choice of XML/XSL lib for Python 2.6.x

Currently I have 2 varieties, LXML and libXML2 that both seem to work. I have tried benchmarking both, specifically for parsing memory string and files into XML and importing XSLT stylesheets and applying them. While pure performance based tests indicate that LXML comes on top (applying stylesheets specifically) libxml2 seems to have bee...

Lxml html xpath context

Hello, I'm using lxml to parse a HTML file and I'd like to know how can I set the context of xpath search. What I mean I that I have a node element and want to make xpath search only inside this node as if it was the root one. For example, I have a form node and xpath search //input return only inputs of the given form as opposed to all ...

Is it possible for lxml to work in a case-insensitive manner?

I'm trying to scrape META keywords and description tags from arbitrary websites. I obviusly have no control over said website, so have to take what I'm given. They have a variety of casings for the tag and attributes, which means I need to work case-insensitively. I can't believe that the lxml authors are as stubborn as to insist on full...

Pass XML fragments as stylesheet paramters with lxml?

I'm starting to use lxml in Python for processing XML/XSL documents, and in general it seems very straight forward. However, I'm not able to find a way to pass an XML fragment as a stylesheet parameter when doing a translation. For example, in PHP it is possible to pass DOMDocument XML fragments as stylesheet parameters, so that one can...

Is there a way to parse html with lxml, but manipulate it with minidom?

I have an application where I've been using html5lib to liberally parse html. I use the minidom interface, because I need a real DOM API and ElementTree is not appropriate for what I'm doing. Here's how I do this: parser = html5lib.XHTMLParser(tree=html5lib.treebuilders.getTreeBuilder('dom')) parser.parse(html) However, parsing huge ...

Replacing elements with lxml.html

Hello, I'm fairly new to lxml and HTML Parsers as a whole. I was wondering if there is a way to replace an element within a tree with another element... For example I have: body = """<code> def function(arg): print arg </code> Blah blah blah <code> int main() { return 0; } </code> """ doc = lxml.html.fromstring(body) codeblocks = do...