ansaurus

Question

Parsing specific elements out of a very large HTML file

Answer 1

+1 A:

Xerces is well documented, supported and tested. (C++)

http://xerces.apache.org/xerces-c/

(yes, it's an XML parser but it should do the trick)

altCognito 2009-04-11 01:29:01

Answer 2

+13 A:

I would use Python and BeautifulSoup for the job. It is very solid at handling this kind of stuff.

For your case, you can use SoupStrainer to make BeautifulSoup only parse DIVs in the document that have the class you want, so it doesn't have to have the whole thing in memory.

For example, say your document looks like this:

<div class="test">Hello World</div>
<div class="hello">Aloha World</div>
<div>Hey There</div>

You can write this:

>>> from BeautifulSoup import BeautifulSoup, SoupStrainer
>>> doc = '''
...     <div class="test">Hello World</div>
...     <div class="hello">Aloha World</div>
...     <div>Hey There</div>
... '''
>>> findDivs = SoupStrainer('div', {'class':'hello'})
>>> [tag for tag in BeautifulSoup(doc, parseOnlyThese=findDivs)]
[<div class="hello">Aloha World</div>]

Paolo Bergantino 2009-04-11 01:29:21

Answer 3

+2 A:

The Html Agility Pack is a stellar option if you want to use C#

Chris Ballance 2009-04-11 01:37:30

Answer 4

+1 A:

Sounds like a case for good old regular expressions.

Input:

<div class="test">Hello World</div>
<div class="somename">Aloha World</div>
<div>Hey There</div>

RegEx:

\<div\sclass\=\"somename\"\>(?<Text>.*?)\<\/div\>

Yields:

Aloha World (note: In a single group named Text)

Probably need to account for enclosing quotes missing etc...

Although with regular expressions now you have two problems.

Codebrain 2009-04-11 10:42:33

Irony is nice. But, it's not easy to upvote answers when they're of the form "don't do this"

S.Lott 2009-04-11 11:05:38

...now with less irony

Codebrain 2009-04-11 11:51:55

Answer 5

A:

Give TinyXML a try. (C++ XML parser)

2009-04-12 15:07:07

ansaurus

tags:

views:

answers:

Parsing specific elements out of a very large HTML file

related questions