ansaurus

Question

Answer 1

+2 A:

BeautifulSoup is a nice library and provides a good way to parse HTML with some handy ways to parse the data very easily.

What you are trying to do, can easily be done using some simple regular expressions. You can write regular expressions to search for a particular pattern of data and extract the data you need.

anand 2010-07-02 17:13:45

I am aware of BeautifulSoup and as i have already mentioned i am not interested in building more logic around it to do table, list etc. detection and interpretation. Please read the question before posting.

demos 2010-07-02 17:17:23

@demos: If you say, "I know about X, so don't tell me about X," And someone replies, "X does what you want," that doesn't mean they didn't read the question. It may mean they're wrong, but that's a different matter entirely. You should give people the benefit of the doubt, especially when they are making an effort to try to help you.

Marcelo Cantos 2010-07-02 17:27:58

Answer 2

+2 A:

You might consider lxml which has a powerful HTML processor. There is another complementary module that relies on lxml called pyquery that might be just what you're looking for.

PyQuery has jQuery-like syntax, so if you're used to jQuery you'll be able to jump right in.

Here is a simple example to get the first <ul> item from aol.com:

>>> from pyquery import PyQuery as pq
>>> import urllib
>>> data = urllib.urlopen('http://aol.com').read()
>>> d = pq(data)
>>> first_ul = d('ul:first')
>>> first_ul
[<ul#dhL2>]
>>> print first_ul
<ul id="dhL2"><li class="dhL1"><a accesskey="" href="https://new.aol.com/productsweb/?promocode=827693&amp;amp;ncid=txtlnkuswebr00000074" name="om_dirbtn1" class="_o4-0" id="om_dirbtn1">Get Free Mail</a></li>
            </ul>

jathanism 2010-07-02 19:05:01

Answer 3

A:

The standard HTML parsers are already pretty good at giving you simple objects (e.g. iterables). Creating anything more complex than a 2D list from a table would likely be dependent on the data that was in the page.

With that said...

Here's a link to a blog post by someone who wrote a script to convert html tables to python lists. The actual file is located here.

I've never heard of a standard python library that does these sorts of operations, so your best bet might be Googling each case as you need it. Chances are someone has done what you are trying to do.

Disclaimer: You should always read and understand any code you find online before pasting it into your own applications! Citing who/where it's from is good too!

tgray 2010-07-02 20:16:03

ansaurus

tags:

views:

answers:

Complex HTML parsing with Python

related questions