views:

110

answers:

3

I am already aware of tag based HTML parsing in Python using BeautifulSoup, htmllib etc.

However, I want a powerful engine which can do complex tasks like read html tables, lists etc. and present these as simple to use objects within code. Does python have such powerful libraries?

+2  A: 

BeautifulSoup is a nice library and provides a good way to parse HTML with some handy ways to parse the data very easily.

What you are trying to do, can easily be done using some simple regular expressions. You can write regular expressions to search for a particular pattern of data and extract the data you need.

anand
I am aware of BeautifulSoup and as i have already mentioned i am not interested in building more logic around it to do table, list etc. detection and interpretation. Please read the question before posting.
demos
@demos: If you say, "I know about X, so don't tell me about X," And someone replies, "X does what you want," that doesn't mean they didn't read the question. It may mean they're wrong, but that's a different matter entirely. You should give people the benefit of the doubt, especially when they are making an effort to try to help you.
Marcelo Cantos
+2  A: 

You might consider lxml which has a powerful HTML processor. There is another complementary module that relies on lxml called pyquery that might be just what you're looking for.

PyQuery has jQuery-like syntax, so if you're used to jQuery you'll be able to jump right in.

Here is a simple example to get the first <ul> item from aol.com:

>>> from pyquery import PyQuery as pq
>>> import urllib
>>> data = urllib.urlopen('http://aol.com').read()
>>> d = pq(data)
>>> first_ul = d('ul:first')
>>> first_ul
[<ul#dhL2>]
>>> print first_ul
<ul id="dhL2"><li class="dhL1"><a accesskey="" href="https://new.aol.com/productsweb/?promocode=827693&amp;amp;ncid=txtlnkuswebr00000074" name="om_dirbtn1" class="_o4-0" id="om_dirbtn1">Get Free Mail</a></li>
            </ul>
jathanism
A: 

The standard HTML parsers are already pretty good at giving you simple objects (e.g. iterables). Creating anything more complex than a 2D list from a table would likely be dependent on the data that was in the page.

With that said...

Here's a link to a blog post by someone who wrote a script to convert html tables to python lists. The actual file is located here.

I've never heard of a standard python library that does these sorts of operations, so your best bet might be Googling each case as you need it. Chances are someone has done what you are trying to do.

Disclaimer: You should always read and understand any code you find online before pasting it into your own applications! Citing who/where it's from is good too!

tgray