questions about lxml | ansaurus

lxml

How to validate XML file against DTD/XMLSchema using lxml.etree.iterparse()

lxml easily validates XML files against any DTD or XMLSchema if only you're using etree.XML(). I need to do the same trick with etree.iterparse(), so that whole XML file won't be put into memory. There are two problems here: 1. DTD is ignored by iterparse (be it internal or external) 2. XML is validated against XMLSchema, but errors ...

lxml removing <?xml ...> tags when parsing?

I'm currently working with parsing XML documents (adding elements, adding attributes, etc). So I first need to parse the XML in before working on it. However, lxml seems to be removing the element <?xml ...>. For example from lxml import etree tree = etree.fromstring('<?xml version="1.0" encoding="utf-8"?><dmodule>test</dmodule>', etr...

Adding attributes to existing elements, removing elements, etc with lxml

I parse in the XML using from lxml import etree tree = etree.parse('test.xml', etree.XMLParser()) Now I want to work on the parsed XML. I'm having trouble removing elements with namespaces or just elements in general such as <rdf:description><dc:title>Example</dc:title></rdf:description> and I want to remove that entire element as...

Using lxml to find order of text and sub-elements

Let's say I have the following HTML: <div> text1 <div> t1 </div> text2 <div> t2 </div> text3 </div> I know of how to get the text and subelements of the enclosing div using lxml.html. But is there a way to access both text and sub elements in an iterative manner, that preserves order? In other words, I want to know where the "fr...

Error building 'lxml.etree' extension

I'm trying to install lxml, on an Ubuntu server running Python 2.6 (in a virtualenv - the system Python is 2.5). I've checked out via svn and as a result I've also install Cython, as per the instructions. However, I get the following error when running python setup.py build: Building lxml version 2.3.alpha1-76211. Building with Cython...

Building a graph of the structure of an XML document

I'd like to build a graph showing which tags are used as children of which other tags in a given XML document. I've written this function to get the unique set of child tags for a given tag in an lxml.etree tree: def iter_unique_child_tags(root, tag): """Iterates through unique child tags for all instances of tag. Iteration st...

Python parsing: lxml to get just part of a tag's text

I'm working in Python with HTML that looks like this. I'm parsing with lxml, but could equally happily use pyquery: <p><span class="Title">Name</span>Dave Davies</p> <p><span class="Title">Address</span>123 Greyfriars Road, London</p> Pulling out 'Name' and 'Address' is dead easy, whatever library I use, but how do I get the remainder...

screen-scraping

XPath: match multiple elements in one expression

Hi, I'm trying to do this (using lxml): //*[@id="32808345" or @id="33771423" or @id="15929470" or @id="33771117" or @id="15929266"] in order to get all elements, no matter what tag, with the specified id's. I'm getting the following traceback: invalid attribute predicate this is how I'm generating the str (if that is relevant to t...

lxml version problem - unable to call fndall method !

lxml gives following error on version 1.3 for the below line.. self.doc.findall('.//field[@on_change]') File "/home/.../code_generator/xmlGenerator.py", line 158, in processOnChange onchangeNodes = self.doc.findall('.//field[@on_change]') File "etree.pyx", line 1042, in etree._Element.findall File "/usr/lib/python2.5/site-packages/lxm...

Using a remote stylesheet that includes other stylesheets with relative paths

I'd like to do an XSL transform on a DocBook document using lxml.etree.XSLT. Although the documentation mentions that etree.XSLT() takes a first parameter of xslt_input, I can't seem to find any docs on what this parameter is meant to be. Passing it a file that is open for reading seems to work; passing it a filename in a string does n...

How can I remove all elements matching an xpath in python using lxml?

So I have some XML like this: <bar> <foo>Something</foo> <baz> <foo>Hello</foo> <zap>Another</zap> <baz> <bar> And I want to remove all the foo nodes. Something like this doesn't work params = xml.xpath('//foo') for n in params: xml.getroot().remove(n) Giving ValueError: Element is not a child of this node. Wha...

Doing an XSL transform of a branch of a Docbook element tree

I'd like to use the docbook XSL stylesheets to render various parts of a document, without transforming the entire thing. The complication is that some of these parts have <footnoteref> elements whose linkend attributes are not located within the same chunk. In other words, I want to process a branch of the tree which includes the <foo...

Installing Python extensions on OS X, missing MacOSX10.4u.sdk error

I'm attempting to install various python extensions on OS X (10.6.4), with a python.org python (Python 2.6.4 (r264:75821M, Oct 27 2009, 19:48:32)). Consistently running into a problem on the gcc step. Here's a sample from compiling Cython (btw, I'm attempting to install Cython in order to install lxml): In file included from /usr/includ...

remove everything between 2 tags that span branches of an xml tree.

I'm trying to remove everything in an XML Document between 2 tags, using python & lxml. the problem is that the tags can be in different branches of the tree (but always at the same depth) an example document might look like this. <root> <p> Hello world <start />this is a paragraph </p> <p> Goodbye world. <end />I'm leaving now...

Is there a way to force lxml to parse Unicode strings that specify an encoding in a tag?

I have an XML file that specifies an encoding, and I use UnicodeDammit to convert it to unicode (for reasons of storage, I can't store it as a string). I later pass it to lxml but it refuses to ignore the encoding specified in the file and parse it as Unicode, and it raises an exception. How can I force lxml to parse the document? This ...

Manipulating list from lxml xpath queries

Today I tried lxml as I got very nasty html output from particular web service, and I didn't want to go with re module, just for change and to learn something new. And I did, browsing http://codespeak.net/lxml/ and http://stackoverflow.com in parallel I won't try to explain above html template, but just for overview it's full of deliber...

LXml Xpath processing of multi-line field

I'm doing some scraping of a page and I'm fine with getting most fields, but having some problems with the address. <address> 56 South Ave <br> Miami, FL 33131 <br> </address> address = myWebPage.xpath("//div[contains(@class,'rightcol')]//address") I can get the first line, 56 South Avenue, using the above code. But I can't...

screen-scraping

Weird behaviour with lxml getiterator()

Hi all. I have the following XML document: <x> <a>Some text</c> <b>Some text 2</b> <c>Some text 3</c> </x> I want to get the text of all the tags, so I decided to use getiterator(). My problem is, it adds up blank lines for a reason I can't understand. Consider this: >>> for text in document_root.getiterator(): ... print t...

python setuptool how can I add dependency for libxml2-dev and libxslt1-dev?

My application needs lxml >= 2.1, but to install lxml its requied to install libxml2-dev libxslt1-dev else it raises error while installing the lxml, is there a way that using python setup tool I can give this as dependency in my setup.py.... ...

Once I have identified the beginning and end parts of a section of an html document using lxml, how do I get everything between them

I am working with some html files. I am trying to figure out a way to consistently get to some text that exists in the documents. I know that the section I want begins with some bolded words and I know that the section ends with other bolded words. bolded_item=atree.cssselect('b') myKeys=[item for item in bolded_items if item.text if...

1
...
5
6
7
8
9