lxml

Python: Import Data from Open Office calc with lxml

How can I import data for example for the field A1? When I use etree.parse() I get an error, because I dont have a xml file. ...

Is there a way to specify a fixed (or variable) number of elements for lxml in Python

There must be an easier way to do this. I need some text from a large number of html documents. In my tests the most reliable way to find it is to look for specific word in the text_content of the div elements. If I want to inspect a specific element above the one that has my text I have been enumerating my list of div elements and us...

Installing easy_install... to get to installing lxml

I've come to grips with the fact that ElementTree isn't going to do what I want it to do. I've checked out the documentation for lxml, and it appears that it will serve my purposes. To get lxml, I need to get easy_install. So I downloaded it from here, and put it in /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-pac...

Lxml or Xpath content print

I have the following function def parseTitle(self, post): """ Returns title string with spaces replaced by dots "" return post.xpath('h2')[0].text.replace('.', ' ') I would to see the content of post. I have tried everything I can think of. How can I properly debug the content? This is an website of movi...

Is there a better way to parse html tables than lxml

I am working with html documents and ripping out tables to parse them if they turn out to be the correct tables. I am happy with the results - my extraction process successfully maps row labels and column headings in over 95% of the cases and in the cases it does not we can identify the problems and use other approaches. In my scanni...

Is the content between anchor tags (a) in html seen as a branch in lxml?

I am trying to get some content in html documents. Some of the documents have a table of contents that very nicely indicates where in the document the content I want to strip out is located. That is either the value or text_content of the tag are easily identifiable and point to what I need. For example I might have two anchor tags in...

Best way to get back to using the power of lxml after having to use a regex to find something in an html document

I am trying to rip some text out of a large number of html documents (numbers in the hundreds of thousands). The documents are really forms but they are prepared by a very large group of different organizations so there is significant variation in how they create the document. For example, the documents are divided into chapters. I mi...

Encrypting XML database in python

i am using XML as my backend for the application... LXML is used to parse the xml. How can i encrypt this xml file to make sure that the data is protected...... thanks in advance. ...

etree.findall: 'OR'-lookup?

I want to find all stylesheet definitions in a XHTML file with lxml.etree.findall. This could be as simple as elems = tree.findall('link[@rel="stylesheet"]') + tree.findall('style') But the problem with CSS style definitions is that the order matters, e.g. <link rel="stylesheet" type="text/css" href="/media/css/first.css" /> <style>b...

Should I strip the XML declaration from suds output before parsing with lxml?

I’m trying to implement a SOAP webservice in Python 2.6 using the suds library. That is working well, but I’ve run into a problem when trying to parse the output with lxml. Suds returns a suds.sax.text.Text object with the reply from the SOAP service. The suds.sax.text.Text class is a subclass of the Python built-in Unicode class. In es...

From escaped html -> to regular html? - Python

Hi folks, I used BeautifulSoup to handle XML files that I have collected through a REST API. The responses contain HTML code, but BeautifulSoup escapes all the HTML tags so it can be displayed nicely. Unfortunately I need the HTML code. How would I go on about transforming the escaped HTML into proper markup? Help would be very ...

LXML E builder for java?

There is one thing I really love about LXML, and that the E builder. I love that I can throw XML together like this: message = E.Person( E.Name( E.First("jack") E.Last("Ripper") ) E.PhoneNumber("555-555-5555") ) To make: <Person> <Name> <First>Jack</First> <Last>Ripper</Last> </Name> <PhoneNumber>555-555-5...

Entity references and lxml

Here's the code I have: from cStringIO import StringIO from lxml import etree xml = StringIO('''<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE root [ <!ENTITY test "This is a test"> ]> <root> <sub>&test;</sub> </root>''') d1 = etree.parse(xml) print '%r' % d1.find('/sub').text parser = etree.XMLParser(resolve_entities=False) d2 =...

Matching first set of elements with xpath...

I have an xml document that looks like this. <foo> <bar type="artist"/> Bob Marley </bar> <bar type="artist"/> Peter Tosh </bar> <bar type="artist"/> Marlon Wayans </bar> </foo> <foo> <bar type="artist"/> Bob Marley </bar> <bar type="artist"/> Peter Tosh </bar> <bar type="artist"/> Marlon Wayans </bar> </foo> <fo...

Find elements based on xsd type with lxml

I am trying to get a list of elements with a specific xsd type with lxml 2.x and I can't figure out how to traverse the xsd for specific types. Example of schema: <xsd:element name="ServerOwner" type="srvrs:string90" minOccurs="0"> <xsd:element name="HostName" type="srvrs:string35" minOccurs="0"> Example xml data: <srvrs:ServerOwner...

Parse html and find data in the html

Hi all. I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table: <html> <table> <tr><td>Header</td></tr> <tr><td>Want This</...

Finding inline style with lxml.cssselector

New to this library (no more familiar with BeautifulSoup either, sadly), trying to do something very simple (search by inline style): <td style="padding: 20px">blah blah </td> I just want to select all tds where style="padding: 20px", but I can't seem to figure it out. All the examples show how to select td, such as: for col in page....

extract specific element from nested elements using lxml html

Hi all I am having some problems that I think can be attributed to xpath problems. I am using the html module from the lxml package to try and get at some data. I am providing the most simplified situation below, but keep in mind the html I am working with is much uglier. <table> <tr> <td> <table> <tr><td></t...

How do I match contents of an element in XPath (lxml)?

I want to parse HTML with lxml using XPath expressions. My problem is matching for the contents of a tag: For example given the <a href="http://something"&gt;Example&lt;/a&gt; element I can match the href attribute using .//a[@href='http://something'] but the given the expression .//a[.='Example'] or even .//a[contains(.,'Examp...

Encoding in python with lxml - complex solution

Hi, I need to download and parse webpage with lxml and build UTF-8 xml output. I thing schema in pseudocode is more illustrative: from lxml import etree webfile = urllib2.urlopen(url) root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True)) txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=utf8)...