Python: Import Data from Open Office calc with lxml
How can I import data for example for the field A1? When I use etree.parse() I get an error, because I dont have a xml file. ...
How can I import data for example for the field A1? When I use etree.parse() I get an error, because I dont have a xml file. ...
There must be an easier way to do this. I need some text from a large number of html documents. In my tests the most reliable way to find it is to look for specific word in the text_content of the div elements. If I want to inspect a specific element above the one that has my text I have been enumerating my list of div elements and us...
I've come to grips with the fact that ElementTree isn't going to do what I want it to do. I've checked out the documentation for lxml, and it appears that it will serve my purposes. To get lxml, I need to get easy_install. So I downloaded it from here, and put it in /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-pac...
I have the following function def parseTitle(self, post): """ Returns title string with spaces replaced by dots "" return post.xpath('h2')[0].text.replace('.', ' ') I would to see the content of post. I have tried everything I can think of. How can I properly debug the content? This is an website of movi...
I am working with html documents and ripping out tables to parse them if they turn out to be the correct tables. I am happy with the results - my extraction process successfully maps row labels and column headings in over 95% of the cases and in the cases it does not we can identify the problems and use other approaches. In my scanni...
I am trying to get some content in html documents. Some of the documents have a table of contents that very nicely indicates where in the document the content I want to strip out is located. That is either the value or text_content of the tag are easily identifiable and point to what I need. For example I might have two anchor tags in...
I am trying to rip some text out of a large number of html documents (numbers in the hundreds of thousands). The documents are really forms but they are prepared by a very large group of different organizations so there is significant variation in how they create the document. For example, the documents are divided into chapters. I mi...
i am using XML as my backend for the application... LXML is used to parse the xml. How can i encrypt this xml file to make sure that the data is protected...... thanks in advance. ...
I want to find all stylesheet definitions in a XHTML file with lxml.etree.findall. This could be as simple as elems = tree.findall('link[@rel="stylesheet"]') + tree.findall('style') But the problem with CSS style definitions is that the order matters, e.g. <link rel="stylesheet" type="text/css" href="/media/css/first.css" /> <style>b...
I’m trying to implement a SOAP webservice in Python 2.6 using the suds library. That is working well, but I’ve run into a problem when trying to parse the output with lxml. Suds returns a suds.sax.text.Text object with the reply from the SOAP service. The suds.sax.text.Text class is a subclass of the Python built-in Unicode class. In es...
Hi folks, I used BeautifulSoup to handle XML files that I have collected through a REST API. The responses contain HTML code, but BeautifulSoup escapes all the HTML tags so it can be displayed nicely. Unfortunately I need the HTML code. How would I go on about transforming the escaped HTML into proper markup? Help would be very ...
There is one thing I really love about LXML, and that the E builder. I love that I can throw XML together like this: message = E.Person( E.Name( E.First("jack") E.Last("Ripper") ) E.PhoneNumber("555-555-5555") ) To make: <Person> <Name> <First>Jack</First> <Last>Ripper</Last> </Name> <PhoneNumber>555-555-5...
Here's the code I have: from cStringIO import StringIO from lxml import etree xml = StringIO('''<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE root [ <!ENTITY test "This is a test"> ]> <root> <sub>&test;</sub> </root>''') d1 = etree.parse(xml) print '%r' % d1.find('/sub').text parser = etree.XMLParser(resolve_entities=False) d2 =...
I have an xml document that looks like this. <foo> <bar type="artist"/> Bob Marley </bar> <bar type="artist"/> Peter Tosh </bar> <bar type="artist"/> Marlon Wayans </bar> </foo> <foo> <bar type="artist"/> Bob Marley </bar> <bar type="artist"/> Peter Tosh </bar> <bar type="artist"/> Marlon Wayans </bar> </foo> <fo...
I am trying to get a list of elements with a specific xsd type with lxml 2.x and I can't figure out how to traverse the xsd for specific types. Example of schema: <xsd:element name="ServerOwner" type="srvrs:string90" minOccurs="0"> <xsd:element name="HostName" type="srvrs:string35" minOccurs="0"> Example xml data: <srvrs:ServerOwner...
Hi all. I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table: <html> <table> <tr><td>Header</td></tr> <tr><td>Want This</...
New to this library (no more familiar with BeautifulSoup either, sadly), trying to do something very simple (search by inline style): <td style="padding: 20px">blah blah </td> I just want to select all tds where style="padding: 20px", but I can't seem to figure it out. All the examples show how to select td, such as: for col in page....
Hi all I am having some problems that I think can be attributed to xpath problems. I am using the html module from the lxml package to try and get at some data. I am providing the most simplified situation below, but keep in mind the html I am working with is much uglier. <table> <tr> <td> <table> <tr><td></t...
I want to parse HTML with lxml using XPath expressions. My problem is matching for the contents of a tag: For example given the <a href="http://something">Example</a> element I can match the href attribute using .//a[@href='http://something'] but the given the expression .//a[.='Example'] or even .//a[contains(.,'Examp...
Hi, I need to download and parse webpage with lxml and build UTF-8 xml output. I thing schema in pseudocode is more illustrative: from lxml import etree webfile = urllib2.urlopen(url) root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True)) txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=utf8)...