I'm trying to wean myself from BeautifulSoup, which I love but seems to be (aggressively) unsupported. I'm trying to work with html5lib and lxml, but I can't seem to figure out how to use the "find" and "findall" operators.
By looking at the docs for html5lib, I came up with this for a test program:
import cStringIO
f = cStringIO.S...
I have a xml file, book.xml (http://msdn.microsoft.com/en-us/library/ms762271(VS.85).aspx)
I would like to cat books.xml and get all book ids and genres for the book id.
Similar to
cat books.xml | python reader.py
Any tips or help would be appreciated. Thanks.
...
I am using Cherrypy in a RESTful web service and server returns XML as a result (lxml is being used to create XML). Some of those XMLs are quite large. I have noticed that memory is not being released after such request (that return large XML) has been processed.
So, I have isolated a problem and created this one very short dummy examp...
How can I iterate over all tags which have a specific attribute with a specific value? For instance, let's say we need the data1, data2 etc... only.
<html>
<body>
<invalid html here/>
<dont care> ... </dont care>
<invalid html here too/>
<interesting attrib1="naah, it is not this"> ... </interesting t...
I am trying to parse a file encoded in utf-8. No operation has problem apart from write to file (or at least I think so). A minimum working example follows:
from lxml import etree
parser = etree.HTMLParser()
tree = etree.parse('example.txt', parser)
tree.write('aaaaaaaaaaaaaaaaa.html')
example.txt:
<html>
<body>
<invalid ...
I have been looking for a while a python module/API that does something I believe is quite simple:
Read an XML file
Add/Edit/Remove entries
So far I've found several snippets that interface with complicated object oriented databases, but nothing dead simple as:
xml = etree.parse ('file.xml')
xml.add(xpath, new_node(attrs))
xml.remo...
For a customer I have to install a django webserver on SUSE Linux Enterprise Server 11, 64 bit (short: SLES 11).
When I add repositories from http://software.opensuse.org I can install python-lxml:
sudo zypper install python-lxml
The result is that the site-packages are installed in /usr/lib/python2.6/site-packages. However when I tr...
I'm adapting the following code (created via advice in this question), that took an XML file and it's DTD and converted them to a different format. For this problem only the loading section is important:
xmldoc = open(filename)
parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
tree = etree.parse(xmldoc, parser)
This wo...
I'd like to store some relatively simple stuff in XML in a cascading manner. The idea is that a build can have a number of parameter sets and the Python scripts creates the necessary build artifacts (*.h etc) by reading these sets and if two sets have the same parameter, the latter one replaces the former.
There are (at least) two diffe...
I have a tool producing NewsML type XML files and I want to validate them after producing the files.
I'm receiving an error:
Attempt to load network entity http://www.w3.org/TR/ruby/xhtml-ruby-1.mod
The python call is:
parser = etree.XMLParser(load_dtd=True, dtd_validation=True)
treeObject = etree.parse(f, parser)
First I'm not sure...
In [1]: from lxml import etree
I've got an HTML document:
In [2]: root = etree.fromstring(u'''<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">\n<HTML></HTML>''', etree.HTMLParser())
Its doctype is parsed correctly:
In [3]: root.getroottree().docinfo.doctype
Out[3]: u'<!DOCTYPE html PUBLIC "-//IETF//DTD HTML//EN">'
But when serializ...
I want to use lxml cleaner to get rid of all html, but then a regex to autolink something:
[ABC] -> <a href="bah bah bah">ABC</a>
what is the right way to handle this without xss and such?
...
Im using lxml.html.cleaner to clean html from an input text. how can i change \n to <br /> in lxml.html?
...
Hello
I am working on project using lxml. here is a sample xml
<PatientsTree>
<Patient PatientID="SKU065427">
<Study StudyInstanceUID="25.2.9.2.1107.5.1.4.49339.30000006050107501192100000001">
<Series SeriesInstanceUID="2.16.840.1.113669.1919.1176798690"/>
<Series SeriesInstanceUID="2.16.840.1.113669.1919.117708...
Is there a way to remove/escape html tags using lxml.html and not beautifulsoup which has some xss issues? I tried using cleaner, but i want to remove all html.
...
Hi,
I'm trying to parse large XML files (>3GB) like this:
context = lxml.etree.iterparse(path)
for action,el in self.context:
# do sth. with el
With iterparse I thought the data is not completely loaded into RAM, but according to this article I'm wrong:
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ (see Listing 4)
...
I'm using the following code to locate a div:
parser = etree.HTMLParser()
tree = etree.parse(StringIO(page), parser)
div = tree.xpath("//div[@class='content']")[0]
My only problem is, that after doing this I do not want to rely on lxml to extract the contents of said div: I just want to get back the raw XML the div contains. Is this ...
I've got the following BeautifulSoup code, a bit simplified.
soup = BeautifulSoup(html)
for item in soup.findAll('div',id=compile('^result_')):
q = item.find('a',{'class':'title'})
if q:
...
q = item.find('div',{'class':['one','two']})
if q:
...
I profiled it, and it's quite slow. I want to try lxml instead but it seem...
Hi there.
I am using lxml.html to parse some hmtl to get links, however when it hits a link which contains an image it just returns blank, what it'd really like is to be able to detect if it's an image, and then try and return the image alt text.
So it looks like this...
from lxml.html import parse, fromstring
doc = fromstring('<a hr...