lxml

Python: Extract HTML from an XML file

My XML file looks like this: <strings> <string>Bla <b>One &amp; Two</b> Foo</string> </strings> I want to extract the content of each <string> while maintaining the inner tags. That is, I would like to see the following Python string: u"Bla <b>One & Two</b> Foo". Alternatively, I guess I could settle on u"Bla <b>One & Two</b> ...

lxml Changing Unicode Characters

I am using lxml to read through an xml file and change a few details. However, when running it I find that even if I just use lxml to read the file and then write it out again, as below: fil='iTunes Music Library.XML' tre=etree.parse(fil) tre.write('temp.xml') I find Queensrÿche converted to Queensr&#255;che. Anyone know how to fix th...

python lxml on app engine?

Can I use python lxml on google app engine? ( or do i have to use Beautiful Soup? ) I have started using Beautiful Soup but it seems slow. I am just starting to play with the idea of "screen scraping" data from other websites to create some sort of "mash-up". ...

lxml[.objectify] documentElement tagName

I'm receiving data packets in XML format, each with a specific documentRoot tag, and I'd like to delegate specialized methods to take care of those packets, based on the root tag name. This worked with xml.dom.minidom, something like this: dom = minidom.parseString(the_data) root = dom.documentElement deleg = getattr(self,'elem_' + str(...

Combine lxml XSLT pretty_print with strip-space

I'm cleaning up some gross XML, and so I've had pretty_print = True set in the call to etree.tostring() on my lxml output of the XSL transform. However, that left me with a few junk whitespace nodes from the original input, so I added <xsl:strip-space elements="*"/> ...but that completely collapses all whitespace, ignoring pretty prin...

Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

From what I can make out, the two main HTML parsing libraries in Python are lxml and BeautifulSoup. I've chosen BeautifulSoup for a project I'm working on, but I chose it for no particular reason other than finding the syntax a bit easier to learn and understand. But I see a lot of people seem to favour lxml and I've heard that lxml is f...

insert tags in ElementTree text

I am using the Python ElementTree module to manipulate HTML. I want to emphasize certain words, and my current solution is: for e in tree.getiterator(): for attr in 'text', 'tail': words = (getattr(e, attr) or '').split() change = False for i, word in enumerate(words): word = clean_word.sub('', wo...

Equivalent of Beautiful Soup's renderContents() method in lxml?

Is there an equivalent of Beautiful Soup's tag.renderContents() method in lxml? I've tried using element.text, but that doesn't render child tags, as well as ''.join(etree.tostring(child) for child in element), but that doesn't render child text. The closest I've been able to find is etree.tostring(element), but that renders the opening...

XML parsing with lxml and Python

Please help me to resolve my problem with lxml (I'm a newbie to lxml). How can I get "Comment 1" from the next file: <?xml version="1.0" encoding="windows-1251" standalone="yes" ?> <!--Comment 1--> <a> <!--Comment 2--> </a> ...

Extract links from a webpage using lxml, xpath and python

I've got this xpath query: /html/body//tbody/tr[*]/td[*]/a[@title]/@href It extracts all the links with the title attribute - and gives the href in FireFox's Xpath checker add-on. However, I cannot seem to use it with lxml. from lxml import etree parsedPage = etree.HTML(page) # Create parse tree from valid page. hyperlinks = parsedP...

lxml and loops to create xml rss in python

I have been using lxml to create the xml of rss feed. But I am having trouble with the tags and cant really figure out how to to add a dynamic number of elements. Given that lxml seems to just have functions as parameters of functions, I cant seem to figure out how to loop for a dynamic number of items without remaking the entire pa...

How to update xml file using lxml and python?

<example> <login> <id>1</id> <username>kites</username> <password>kites</password> </login> </example> How can i update password using lxml? and now can i add one more record to the same file? please provide me a sample code ...

XML to store system paths in Python with lxml

Hi, I'm using an xml file to store configurations for a software. One of theese configurations would be a system path like > set_value = "c:\\test\\3 tests\\test" i can store it by using: > setting = etree.SubElement(settings, > "setting", name=tmp_set_name, type = > set_type , value= set_value) If I use doc.write(output_file, m...

Is it possible get all context nodes used to evalute xpath result ?

Is it possible get all context nodes used to evalute xpath result ? In below code: test_xml = """ <r> <a/> <a> <b/> </a> <a> <b/> </a> </r> """ test_root = lxml.etree.fromstring(test_xml) res = test_root.xpath("//following-sibling::*[1]/b") for node in res: print test_root.getroottree().getpath(no...

replacing node text using lxml.objectify while preserving attributes

Using lxml.objectify like so: from lxml import objectify o = objectify.fromstring("<a><b atr='someatr'>oldtext</b></a>") o.b = 'newtext' results in <a><b>newtext</b></a>, losing the node attribute. It seems to be directly replacing the element with a newly created one, rather than simply replacing the text of the element. If I try ...

lxml has essentially nothing

The lxml package for Python seems to absolutely broken on my system. I am not sure of the problem, as all of the files are in place, it seems. My suspicion is that the problem is in __init__.py, but I don't have enough practice with the system to make an accurate diagnosis or fix the problem. Here is some code that I think will help dia...

HTML parser for GAE

Generally I use lxml for my HTML parsing needs, but that isn't available on Google App Engine. The obvious alternative is BeautifulSoup, but I find it chokes too easily on malformed HTML. Currently I am testing libxml2dom and have been getting better results. Which pure Python HTML parser have you found performs best? My priority is th...

delete xml node using lxml

hello friends, <login> <user> <userid>admin</userid> </user> . . . . <user> <userid>admin</userid> </user> </login> this my xml file. when i user clear()or del method it will clear all the child and a blank node is creating <user/> How can i avoid creating this blank node it will make problem when i use...

Python: Separating an HTML snippets to paragraphs

I have a snippet of HTML that contains paragraphs. (I mean p tags.) I want to split the string into the different paragraphs. For instance: ''' <p class="my_class">Hello!</p> <p>What's up?</p> <p style="whatever: whatever;">Goodbye!</p> ''' Should become: ['<p class="my_class">Hello!</p>', '<p>What's up?</p>' '<p style="whatever: w...

Parsing broken XML with lxml.etree.iterparse

Hi, I'm trying to parse a huge xml file with lxml in a memory efficient manner (ie streaming lazily from disk instead of loading the whole file in memory). Unfortunately, the file contains some bad ascii characters that break the default parser. The parser works if I set recover=True, but the iterparse method doesn't take the recover ...