lxml

html5lib/lxml examples for BeautifulSoup users?

I'm trying to wean myself from BeautifulSoup, which I love but seems to be (aggressively) unsupported. I'm trying to work with html5lib and lxml, but I can't seem to figure out how to use the "find" and "findall" operators. By looking at the docs for html5lib, I came up with this for a test program: import cStringIO f = cStringIO.S...

Python lxml and stdin

I have a xml file, book.xml (http://msdn.microsoft.com/en-us/library/ms762271(VS.85).aspx) I would like to cat books.xml and get all book ids and genres for the book id. Similar to cat books.xml | python reader.py Any tips or help would be appreciated. Thanks. ...

Memory consumption in Cherrypy

I am using Cherrypy in a RESTful web service and server returns XML as a result (lxml is being used to create XML). Some of those XMLs are quite large. I have noticed that memory is not being released after such request (that return large XML) has been processed. So, I have isolated a problem and created this one very short dummy examp...

Find all tags with a specific attribute value

How can I iterate over all tags which have a specific attribute with a specific value? For instance, let's say we need the data1, data2 etc... only. <html> <body> <invalid html here/> <dont care> ... </dont care> <invalid html here too/> <interesting attrib1="naah, it is not this"> ... </interesting t...

Character encoding is violated

I am trying to parse a file encoded in utf-8. No operation has problem apart from write to file (or at least I think so). A minimum working example follows: from lxml import etree parser = etree.HTMLParser() tree = etree.parse('example.txt', parser) tree.write('aaaaaaaaaaaaaaaaa.html') example.txt: <html> <body> <invalid ...

simple python lxml CRUD ?

I have been looking for a while a python module/API that does something I believe is quite simple: Read an XML file Add/Edit/Remove entries So far I've found several snippets that interface with complicated object oriented databases, but nothing dead simple as: xml = etree.parse ('file.xml') xml.add(xpath, new_node(attrs)) xml.remo...

How to install python-lxml on SLES 11, 64 bit?

For a customer I have to install a django webserver on SUSE Linux Enterprise Server 11, 64 bit (short: SLES 11). When I add repositories from http://software.opensuse.org I can install python-lxml: sudo zypper install python-lxml The result is that the site-packages are installed in /usr/lib/python2.6/site-packages. However when I tr...

Python xml etree DTD from a StringIO source?

I'm adapting the following code (created via advice in this question), that took an XML file and it's DTD and converted them to a different format. For this problem only the loading section is important: xmldoc = open(filename) parser = etree.XMLParser(dtd_validation=True, load_dtd=True) tree = etree.parse(xmldoc, parser) This wo...

Which XML style is better when handling it with Python's ElementTree?

I'd like to store some relatively simple stuff in XML in a cascading manner. The idea is that a build can have a number of parameter sets and the Python scripts creates the necessary build artifacts (*.h etc) by reading these sets and if two sets have the same parameter, the latter one replaces the former. There are (at least) two diffe...

Validating XML with DTD fails to import entity using lxml

I have a tool producing NewsML type XML files and I want to validate them after producing the files. I'm receiving an error: Attempt to load network entity http://www.w3.org/TR/ruby/xhtml-ruby-1.mod The python call is: parser = etree.XMLParser(load_dtd=True, dtd_validation=True) treeObject = etree.parse(f, parser) First I'm not sure...

lxml, missing doctype when serialized

In [1]: from lxml import etree I've got an HTML document: In [2]: root = etree.fromstring(u'''<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">\n<HTML></HTML>''', etree.HTMLParser()) Its doctype is parsed correctly: In [3]: root.getroottree().docinfo.doctype Out[3]: u'<!DOCTYPE html PUBLIC "-//IETF//DTD HTML//EN">' But when serializ...

lxml cleaner with a custom tag?

I want to use lxml cleaner to get rid of all html, but then a regex to autolink something: [ABC] -> <a href="bah bah bah">ABC</a> what is the right way to handle this without xss and such? ...

Python lxml.html linebreaks?

Im using lxml.html.cleaner to clean html from an input text. how can i change \n to <br /> in lxml.html? ...

path to element with conditions on parent(s) attributes using xpath,lxml,python

Hello I am working on project using lxml. here is a sample xml <PatientsTree> <Patient PatientID="SKU065427"> <Study StudyInstanceUID="25.2.9.2.1107.5.1.4.49339.30000006050107501192100000001"> <Series SeriesInstanceUID="2.16.840.1.113669.1919.1176798690"/> <Series SeriesInstanceUID="2.16.840.1.113669.1919.117708...

Remove all html in python?

Is there a way to remove/escape html tags using lxml.html and not beautifulsoup which has some xss issues? I tried using cleaner, but i want to remove all html. ...

lxml memory problem

Hi, I'm trying to parse large XML files (>3GB) like this: context = lxml.etree.iterparse(path) for action,el in self.context: # do sth. with el With iterparse I thought the data is not completely loaded into RAM, but according to this article I'm wrong: http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ (see Listing 4) ...

How to get raw XML back from lxml?

I'm using the following code to locate a div: parser = etree.HTMLParser() tree = etree.parse(StringIO(page), parser) div = tree.xpath("//div[@class='content']")[0] My only problem is, that after doing this I do not want to rely on lxml to extract the contents of said div: I just want to get back the raw XML the div contains. Is this ...

BeautifulSoup is too slow. Can lxml do this?

I've got the following BeautifulSoup code, a bit simplified. soup = BeautifulSoup(html) for item in soup.findAll('div',id=compile('^result_')): q = item.find('a',{'class':'title'}) if q: ... q = item.find('div',{'class':['one','two']}) if q: ... I profiled it, and it's quite slow. I want to try lxml instead but it seem...

Using Python lxml.html how can I find images within link tags?

Hi there. I am using lxml.html to parse some hmtl to get links, however when it hits a link which contains an image it just returns blank, what it'd really like is to be able to detect if it's an image, and then try and return the image alt text. So it looks like this... from lxml.html import parse, fromstring doc = fromstring('<a hr...