views:

645

answers:

5

I have a very large HTML file (several megabytes). I know the data I want is under something like <div class=someName>here</div>

What is a good library to parse through the HTML page so I can loop through elements and grab each someName? I want to do this in either C#, Python or C++.

+1  A: 

Xerces is well documented, supported and tested. (C++)

http://xerces.apache.org/xerces-c/

(yes, it's an XML parser but it should do the trick)

altCognito
+13  A: 

I would use Python and BeautifulSoup for the job. It is very solid at handling this kind of stuff.

For your case, you can use SoupStrainer to make BeautifulSoup only parse DIVs in the document that have the class you want, so it doesn't have to have the whole thing in memory.

For example, say your document looks like this:

<div class="test">Hello World</div>
<div class="hello">Aloha World</div>
<div>Hey There</div>

You can write this:

>>> from BeautifulSoup import BeautifulSoup, SoupStrainer
>>> doc = '''
...     <div class="test">Hello World</div>
...     <div class="hello">Aloha World</div>
...     <div>Hey There</div>
... '''
>>> findDivs = SoupStrainer('div', {'class':'hello'})
>>> [tag for tag in BeautifulSoup(doc, parseOnlyThese=findDivs)]
[<div class="hello">Aloha World</div>]
Paolo Bergantino
+2  A: 

The Html Agility Pack is a stellar option if you want to use C#

Chris Ballance
+1  A: 

Sounds like a case for good old regular expressions.

Input:

<div class="test">Hello World</div>
<div class="somename">Aloha World</div>
<div>Hey There</div>

RegEx:

\<div\sclass\=\"somename\"\>(?<Text>.*?)\<\/div\>

Yields:

Aloha World (note: In a single group named Text)

Probably need to account for enclosing quotes missing etc...

Although with regular expressions now you have two problems.

Codebrain
Irony is nice. But, it's not easy to upvote answers when they're of the form "don't do this"
S.Lott
...now with less irony
Codebrain
A: 

Give TinyXML a try. (C++ XML parser)