lxml

When parsing html why do I need item.text sometimes and item.text_content() others

Still learning lxml. I discovered that sometimes I cannot get to the text of an item from a tree using item.text. If I use item.text_content() I am good to go. I am not sure I see why yet. Any hints would be appreciated Okay I am not sure exactly how to provide an example without making you handle a file: here is some code I wrote ...

lxml problem: is there a switch to ignore undefined namespace prefixes?

I'm parsing a non-compliant xml file (Sphinx's[1] xmlpipe2 format) and would like lxml parser to ignore the fact that there are unresolved namespace prefixes. An example of the Sphinx XML: <sphinx:schema> <sphinx:field name="subject"/> <sphinx:field name="content"/> <sphinx:attr name="published" type="timestamp"/...

Multiple tag names in lxml's iterparse?

Is there a way to get multiple tag names from lxml's lxml.etree.iterparse? I have a file-like object with an expensive read operation and many tags, so getting all tags or doing two passes is suboptimal. Edit: It would be something like Beautiful Soup's find(['tag-1', 'tag-2]), except as an argument to iterparse. Imagine parsing an HTML...

lxml bug with descendant::*/td or not?

Hello I'm use lxml to parsing big table and now have trouble: >>> winvps[0].getnext().xpath("descendant::*") 118: [<Element td at 3a30180>, <Element a at 3a301b0>, <Element font at 3a301e0>, <Element b at 3a30210>, <Element td at 3a30240>, <Element td at 3a30270>, <Element font at 3a302a0>, <Element td at 3a302d0>, <Element td ...

Can I look at the actual line that was the source of an element parsed from an html document using lxml

I have been having fun manipulating html with lxml. Now I want to do some manipulation of the actual file, after finding a particular element that meets my needs I want to know if it is possible to retrieve the source of the element. I jumped up and down in my chair after seeing sourceline as a method of my element but that did not giv...

Any one have an example that uses the element.sourceline method from lxml.html

I hope I asked that correctly. I am trying to figure out what element.sourceline does and if there is some way I can use its features. I have tried building my elements from the html a number of ways but every time I iterate through my elements and ask for sourceline I always get None. When I tried to use the built-in help I done't ge...

Installing lxml when Codespeak.net is down.

Codespeak.net is down and something, somewhere in my buildout wants to easy_install lxml from it, despite me boopstrapping with pip, having it installed already and removing it from my buildout files. How else can I get round this? ...

How to prevent XMLSerializer.serializeToString() from re-ordering attributes?

I'm using jQuery to load arbitrary XML strings (fragments of a larger document) into the browser DOM and manipulate them, then using XMLSerializer to load them back to strings and send them back to the server, where they are processed (by python and lxml) and re-integrated into a full XML document. The XML starts and ends in a git repos...

Parsing HTML with Lxml

I need help parsing out some text from a page with lxml. I tried beautifulsoup and the html of the page I am parsing is so broken, it wouldn't work. So I have moved on to lxml, but the docs are a little confusing and I was hoping someone here could help me. Here is the page I am trying to parse: http://bit.ly/bf1T12. I need to get ...

Preserving subelement namespace serialization with lxml

I have a few different XML documents that I'm trying to combine into one using lxml. The problem is that I need the result to preserve the namespaces on each of the sub-documents' root nodes. Lxml seems to want to push any namespace declarations used more than once to the root of the new document, which breaks in my application (it is ...

Accessing first element of output in lxml.html

With lxml.html, how do I access single elements without using a for loop? This is the HTML: <tr class="headlineRow"> <td> <span class="headline">This is some awesome text</span> </td> </tr> For example, this will fail with IndexError: for row in doc.cssselect('tr.headlineRow'): headline = row.cssselect('td span.headlin...

How i parse with lxml a result page with form ?

I try to parse a secondary page with form . I use example code source from this link : http://blog.ianbicking.org/2007/09/24/lxmlhtml/ On my test i use this url: http://www.infofer.ro/ Like on example , I use this values : >>> pprint(form.form_values()) [('cboData', '8/30/2010'), ('txtPlecare', 'Bucuresti Nord'), ('txtSosire', 'Const...

LXML library in PHP ?

Hi, Is anyone find the class for LXML in PHP. I have no idea about python. If anyone find the class or library or tutorials, please share with me Thanks, Nithish ...

Problems installing lxml on a Mac, it installs but module not found

The code from lxml import etree produces the error ImportError: No module named lxml Running sudo easy_install lxml results in lxml 2.2.7 is already the active version in easy-install.pth Removing lxml-2.2.7-py2.5-macosx-10.3-i386.egg from site-packages and rerunning sudo easy_install lxml results in Adding lxml 2.2.7 to ea...

python parsing xml

hi i have xml file whitch i want to parse, it looks something like this <?xml version="1.0" encoding="utf-8"?> <SHOP xmlns="http://www.w3.org/1999/xhtml" xmlns:php="http://php.net/xsl"&gt; <SHOPITEM> <ID>2332</ID> ... </SHOPITEM> <SHOPITEM> <ID>4433</ID> ... </SHOPITEM> </SHOP> my parsin...

Classify a table in lxml

I am working with a large set of html documents. One of my tasks is to extract all text from the documents. I have gotten pretty far but now I am stumped because of the use of tables as containers / formatting structures for information that is not numeric in nature My goal is to ignore - leave behind - not extract the 'table' if it i...

How to access comments using lxml

I am trying to remove comments from a list of elements that were obtained by using lxml The best I have been able to do is: no_comments=[element for element in element_list if 'HtmlComment' not in str(type(each))] I am wondering if there is a more direct way? I am going to add something based on Matthew's answer - he got me almost t...

Python Lxml - Append a existing xml with new data

I am new to python/lxml After reading the lxml site and dive into python I could not find the solution to my n00b troubles. I have the below xml sample: --------------- <addressbook> <person> <name>Eric Idle</name> <phone type='fix'>999-999-999</phone> <phone type='mobile'>555-555-555</phone> <address...

Getting non-contiguous text with lxml / ElementTree

Suppose I have this sort of HTML from which I need to select "text2" using lxml / ElementTree: <div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div> If I already have the div element as mydiv, then mydiv.text returns just "text1". Using itertext() seems problematic or cumbersome at best since it walks the entire tr...

how to set a namespace prefix in an attribute value using the lxml?

I'm trying to create XML Schema using lxml. For the begining something like this: <xs:schema xmlns="http://www.goo.com" xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" targetNamespace="http://www.goo.com"&gt; <xs:element type="xs:string" name="name"/> <xs:element type="xs:positiveInteger" name="age...