lxml easily validates XML files against any DTD or XMLSchema if only you're using etree.XML().
I need to do the same trick with etree.iterparse(), so that whole XML file won't be put into memory. There are two problems here:
1. DTD is ignored by iterparse (be it internal or external)
2. XML is validated against XMLSchema, but errors ...
I'm currently working with parsing XML documents (adding elements, adding attributes, etc). So I first need to parse the XML in before working on it. However, lxml seems to be removing the element <?xml ...>. For example
from lxml import etree
tree = etree.fromstring('<?xml version="1.0" encoding="utf-8"?><dmodule>test</dmodule>', etr...
I parse in the XML using
from lxml import etree
tree = etree.parse('test.xml', etree.XMLParser())
Now I want to work on the parsed XML. I'm having trouble removing elements with namespaces or just elements in general such as
<rdf:description><dc:title>Example</dc:title></rdf:description>
and I want to remove that entire element as...
Let's say I have the following HTML:
<div>
text1
<div>
t1
</div>
text2
<div>
t2
</div>
text3
</div>
I know of how to get the text and subelements of the enclosing div using lxml.html. But is there a way to access both text and sub elements in an iterative manner, that preserves order? In other words, I want to know where the "fr...
I'm trying to install lxml, on an Ubuntu server running Python 2.6 (in a virtualenv - the system Python is 2.5).
I've checked out via svn and as a result I've also install Cython, as per the instructions.
However, I get the following error when running python setup.py build:
Building lxml version 2.3.alpha1-76211.
Building with Cython...
I'd like to build a graph showing which tags are used as children of which other tags in a given XML document.
I've written this function to get the unique set of child tags for a given tag in an lxml.etree tree:
def iter_unique_child_tags(root, tag):
"""Iterates through unique child tags for all instances of tag.
Iteration st...
I'm working in Python with HTML that looks like this. I'm parsing with lxml, but could equally happily use pyquery:
<p><span class="Title">Name</span>Dave Davies</p>
<p><span class="Title">Address</span>123 Greyfriars Road, London</p>
Pulling out 'Name' and 'Address' is dead easy, whatever library I use, but how do I get the remainder...
Hi,
I'm trying to do this (using lxml):
//*[@id="32808345" or @id="33771423" or @id="15929470" or @id="33771117" or @id="15929266"]
in order to get all elements, no matter what tag, with the specified id's. I'm getting the following traceback:
invalid attribute predicate
this is how I'm generating the str (if that is relevant to t...
lxml gives following error on version 1.3 for the below line..
self.doc.findall('.//field[@on_change]')
File "/home/.../code_generator/xmlGenerator.py", line 158, in processOnChange
onchangeNodes = self.doc.findall('.//field[@on_change]')
File "etree.pyx", line 1042, in etree._Element.findall
File "/usr/lib/python2.5/site-packages/lxm...
I'd like to do an XSL transform on a DocBook document using lxml.etree.XSLT.
Although the documentation mentions that etree.XSLT() takes a first parameter of xslt_input, I can't seem to find any docs on what this parameter is meant to be. Passing it a file that is open for reading seems to work; passing it a filename in a string does n...
So I have some XML like this:
<bar>
<foo>Something</foo>
<baz>
<foo>Hello</foo>
<zap>Another</zap>
<baz>
<bar>
And I want to remove all the foo nodes. Something like this doesn't work
params = xml.xpath('//foo')
for n in params:
xml.getroot().remove(n)
Giving
ValueError: Element is not a child of this node.
Wha...
I'd like to use the docbook XSL stylesheets to render various parts of a document, without transforming the entire thing.
The complication is that some of these parts have <footnoteref> elements whose linkend attributes are not located within the same chunk. In other words, I want to process a branch of the tree which includes the <foo...
I'm attempting to install various python extensions on OS X (10.6.4), with a python.org python (Python 2.6.4 (r264:75821M, Oct 27 2009, 19:48:32)). Consistently running into a problem on the gcc step. Here's a sample from compiling Cython (btw, I'm attempting to install Cython in order to install lxml):
In file included from /usr/includ...
I'm trying to remove everything in an XML Document between 2 tags, using python & lxml. the problem is that the tags can be in different branches of the tree (but always at the same depth) an example document might look like this.
<root>
<p> Hello world <start />this is a paragraph </p>
<p> Goodbye world. <end />I'm leaving now...
I have an XML file that specifies an encoding, and I use UnicodeDammit to convert it to unicode (for reasons of storage, I can't store it as a string). I later pass it to lxml but it refuses to ignore the encoding specified in the file and parse it as Unicode, and it raises an exception.
How can I force lxml to parse the document? This ...
Today I tried lxml as I got very nasty html output from particular web service, and I didn't want to go with re module, just for change and to learn something new. And I did, browsing http://codespeak.net/lxml/ and http://stackoverflow.com in parallel
I won't try to explain above html template, but just for overview it's full of deliber...
I'm doing some scraping of a page and I'm fine with getting most fields, but having some problems with the address.
<address>
56 South Ave
<br>
Miami, FL 33131
<br>
</address>
address = myWebPage.xpath("//div[contains(@class,'rightcol')]//address")
I can get the first line, 56 South Avenue, using the above code. But I can't...
Hi all. I have the following XML document:
<x>
<a>Some text</c>
<b>Some text 2</b>
<c>Some text 3</c>
</x>
I want to get the text of all the tags, so I decided to use getiterator().
My problem is, it adds up blank lines for a reason I can't understand. Consider this:
>>> for text in document_root.getiterator():
... print t...
My application needs lxml >= 2.1,
but to install lxml its requied to install libxml2-dev libxslt1-dev
else it raises error while installing the lxml,
is there a way that using python setup tool I can give this as dependency in my setup.py....
...
I am working with some html files. I am trying to figure out a way to consistently get to some text that exists in the documents. I know that the section I want begins with some bolded words and I know that the section ends with other bolded words.
bolded_item=atree.cssselect('b')
myKeys=[item for item in bolded_items if item.text if...