lxml

Using an XML catalog with Python's lxml?

Is there a way, when I parse an XML document using lxml, to validate that document against its DTD using an external catalog file? I need to be able to work the fixed attributes defined in a document’s DTD. ...

Why doesn't xpath work when processing an XHTML document with lxml (in python)?

I am testing against the following test document: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"&gt; <html xmlns="http://www.w3.org/1999/xhtml"&gt; <head> <title>hi there</title> </head> <body> ...

How to match a text node then follow parent nodes using XPath

I'm trying to parse some HTML with XPath. Following the simplified XML example below, I want to match the string 'Text 1', then grab the contents of the relevant content node. <doc> <block> <title>Text 1</title> <content>Stuff I want</content> </block> <block> <title>Text 2</title> <content>S...

Need python lxml syntax help for parsing html

Hello! I am brand new to python, and I need some help with the syntax for finding and iterating through html tags using lxml. Here are the use-cases I am dealing with: HTML file is fairly well formed (but not perfect). Has multiple tables on screen, one containing a set of search results, and one each for a header and footer. Each ...

How can I print entity numbers in my xml document instead of entity names using python's lxml?

Heyas I'm using lxml and python to generate xml documents (just using etree.tostring(root) ) but at the moment the resulting xml displays html entities as with named entities ( &lt ; ) rather than their numeric values ( &#60 ; ). How exactly do I go about changing this so that the result uses the numeric values instead of the names? T...

Decoding problems in Django and lxml

I have a strange problem with lxml when using the deployed version of my Django application. I use lxml to parse another HTML page which I fetch from my server. This works perfectly well on my development server on my own computer, but for some reason it gives me UnicodeDecodeError on the server. ('utf8', "\x85why hello there!", 0, 1,...

Python lxml screen scraping?

I need to do some HTML parsing with python. After some research lxml seems to be my best choice but I am having a hard time finding examples that help me with what I am trying to do. this is why i am hear. I need to scrape a page for all of its viewable text.. strip out all tags and javascript.. I need it to leave me with what text is vi...

Python Iterator Help + lxml

I have this script- import lxml from lxml.cssselect import CSSSelector from lxml.etree import fromstring from lxml.html import parse website = parse('http://xxx.com').getroot() selector = website.cssselect('.name') for i in range(0,18): print selector[i].text_content() As you can see the for loop stops after a number of ti...

Python: adding namespaces in lxml

I'm trying to specify a namespace using lxml similar to this example (taken from here): <TreeInventory xsi:noNamespaceSchemaLocation="Trees.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"&gt;"&gt; </TreeInventory> I'm not sure how to add the Schema instance to use and also the Schema location. The documentation got me start...

Problem using py2app with the lxml package

I am trying to use 'py2app' to generate a standalone application from some Python scripts. The Python uses the 'lxml' package, and I've found that I have to specify this explicitly in the setup.py file that 'py2app' uses. However, the resulting application program still won't run on machines that haven't had 'lxml' installed. My Setup.p...

How to parse malformed HTML in python

I need to browse the DOM tree of a parsed HTML document. I'm using uTidyLib before parsing the string with lxml a = tidy.parseString(html_code, options) dom = etree.fromstring(str(a)) sometimes I get an error, it seems that tidylib is not able to repair malformed html. how can I parse every HTML file without getting an error (parsing...

How can I instantiate a comment element programatically using lxml?

Hi there I'm using lxml to programatically build HTML and I need to include a custom comment in the output. Whilst there is code in lxml to cope with comments (they can be instantiated when parsing existing HTML code) I cannot find a way to instantiate one programatically. Can anyone help? ...

How can I make lxml's parser preserve whitespace outside of the root element?

I am using lxml to manipulate some existing XML documents, and I want to introduce as little diff noise as possible. Unfortunately by default lxml.etree.XMLParser doesn't preserve whitespace before or after the root element of a document: >>> xml = '\n <etaoin>shrdlu</etaoin>\n' >>> lxml.etree.tostring(lxml.etree.fromstring(xml)) '<e...

Creating a doctype with lxml's etree

I want to add doctypes to my XML documents that I'm generating with LXML's etree. However I cannot figure out how to add a doctype. Hardcoding and concating the string is not an option. I was expecting something along the lines of how PI's are added in etree: pi = etree.PI(...) doc.addprevious(pi) But it's not working for me. How ...

Finding the parent tag of a text string with ElementTree/lxml

I'm trying to take a string of text, and "extract" the rest of the text in the paragraph/document from the html. My current is approach is trying to find the "parent tag" of the string in the html that has been parsed with lxml. (if you know of a better way to tackle this problem, I'm all ears!) For example, search the tree for "TEXT S...

lxml retrieving odd items with cssselector

In my test document I have a few classes labeled "item", currently I'm using the following to parse everything in the html file with this class with Selection = html.cssselect(".item") I'd like it to select all the odd items, like this in javascript using JQuery Selection = $(".item:odd"); Trying that verbatim I get the following e...

How to get lxml working under IronPython?

I need to port some code that relies heavily on lxml from a CPython application to IronPython. lxml is very Pythonic and I would like to keep using it under IronPython, but it depends on libxslt and libxml2, which are C extensions. Does anyone know of a workaround to allow lxml under IronPython or a version of lxml that doesn't have th...

Changing the default indentation of etree.tostring in lxml

I have an XML document which I'm pretty-printing using lxml.etree.tostring print etree.tostring(doc, pretty_print=True) The default level of indentation is 2 spaces, and I'd like to change this to 4 spaces. There isn't any argument for this in the tostring function; is there a way to do this easily with lxml? ...

How do you install lxml on OS X Leopard without using MacPorts or Fink?

I've tried this and run in to problems a bunch of times in the past. Does anyone have a recipe for installing lxml on OS X without MacPorts or Fink that definitely works? Preferably with complete 1-2-3 steps for downloading and building each of the dependencies. ...

Python packages depending on libxml2 and libxslt

Apart from lxml, is anyone aware of Python packages that depend on libxml2 and libxslt? ...