Is there a way, when I parse an XML document using lxml, to validate that document against its DTD using an external catalog file? I need to be able to work the fixed attributes defined in a document’s DTD.
...
I am testing against the following test document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>hi there</title>
</head>
<body>
...
I'm trying to parse some HTML with XPath. Following the simplified XML example below, I want to match the string 'Text 1', then grab the contents of the relevant content node.
<doc>
<block>
<title>Text 1</title>
<content>Stuff I want</content>
</block>
<block>
<title>Text 2</title>
<content>S...
Hello!
I am brand new to python, and I need some help with the syntax for finding and iterating through html tags using lxml. Here are the use-cases I am dealing with:
HTML file is fairly well formed (but not perfect). Has multiple tables on screen, one containing a set of search results, and one each for a header and footer. Each ...
Heyas
I'm using lxml and python to generate xml documents (just using etree.tostring(root) ) but at the moment the resulting xml displays html entities as with named entities ( < ; ) rather than their numeric values ( < ; ). How exactly do I go about changing this so that the result uses the numeric values instead of the names?
T...
I have a strange problem with lxml when using the deployed version of my Django application. I use lxml to parse another HTML page which I fetch from my server. This works perfectly well on my development server on my own computer, but for some reason it gives me UnicodeDecodeError on the server.
('utf8', "\x85why hello there!", 0, 1,...
I need to do some HTML parsing with python. After some research lxml seems to be my best choice but I am having a hard time finding examples that help me with what I am trying to do. this is why i am hear. I need to scrape a page for all of its viewable text.. strip out all tags and javascript.. I need it to leave me with what text is vi...
I have this script-
import lxml
from lxml.cssselect import CSSSelector
from lxml.etree import fromstring
from lxml.html import parse
website = parse('http://xxx.com').getroot()
selector = website.cssselect('.name')
for i in range(0,18):
print selector[i].text_content()
As you can see the for loop stops after a number of ti...
I'm trying to specify a namespace using lxml similar to this example (taken from here):
<TreeInventory xsi:noNamespaceSchemaLocation="Trees.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">">
</TreeInventory>
I'm not sure how to add the Schema instance to use and also the Schema location.
The documentation got me start...
I am trying to use 'py2app' to generate a standalone application from some Python scripts. The Python uses the 'lxml' package, and I've found that I have to specify this explicitly in the setup.py file that 'py2app' uses. However, the resulting application program still won't run on machines that haven't had 'lxml' installed.
My Setup.p...
I need to browse the DOM tree of a parsed HTML document.
I'm using uTidyLib before parsing the string with lxml
a = tidy.parseString(html_code, options)
dom = etree.fromstring(str(a))
sometimes I get an error, it seems that tidylib is not able to repair malformed html.
how can I parse every HTML file without getting an error (parsing...
Hi there
I'm using lxml to programatically build HTML and I need to include a custom comment in the output. Whilst there is code in lxml to cope with comments (they can be instantiated when parsing existing HTML code) I cannot find a way to instantiate one programatically.
Can anyone help?
...
I am using lxml to manipulate some existing XML documents, and I want to introduce as little diff noise as possible. Unfortunately by default lxml.etree.XMLParser doesn't preserve whitespace before or after the root element of a document:
>>> xml = '\n <etaoin>shrdlu</etaoin>\n'
>>> lxml.etree.tostring(lxml.etree.fromstring(xml))
'<e...
I want to add doctypes to my XML documents that I'm generating with LXML's etree.
However I cannot figure out how to add a doctype. Hardcoding and concating the string is not an option.
I was expecting something along the lines of how PI's are added in etree:
pi = etree.PI(...)
doc.addprevious(pi)
But it's not working for me. How ...
I'm trying to take a string of text, and "extract" the rest of the text in the paragraph/document from the html.
My current is approach is trying to find the "parent tag" of the string in the html that has been parsed with lxml. (if you know of a better way to tackle this problem, I'm all ears!)
For example, search the tree for "TEXT S...
In my test document I have a few classes labeled "item", currently I'm using the following to parse everything in the html file with this class with
Selection = html.cssselect(".item")
I'd like it to select all the odd items, like this in javascript using JQuery
Selection = $(".item:odd");
Trying that verbatim I get the following e...
I need to port some code that relies heavily on lxml from a CPython application to IronPython.
lxml is very Pythonic and I would like to keep using it under IronPython, but it depends on libxslt and libxml2, which are C extensions.
Does anyone know of a workaround to allow lxml under IronPython or a version of lxml that doesn't have th...
I have an XML document which I'm pretty-printing using lxml.etree.tostring
print etree.tostring(doc, pretty_print=True)
The default level of indentation is 2 spaces, and I'd like to change this to 4 spaces. There isn't any argument for this in the tostring function; is there a way to do this easily with lxml?
...
I've tried this and run in to problems a bunch of times in the past. Does anyone have a recipe for installing lxml on OS X without MacPorts or Fink that definitely works?
Preferably with complete 1-2-3 steps for downloading and building each of the dependencies.
...
Apart from lxml, is anyone aware of Python packages that depend on libxml2 and libxslt?
...