Really fast C++ html parser

views:

324

answers:

Really fast C++ html parser

Hello to all, I'm doing a html text feature extractor in C++; the program need to be REALLY fast: i need to extract a this features in ms per html page and the memory usage needs to be good and finally unicode encoding well be nice.

I know how difficult is to have all of this things, but i want a parser close to these things at least.

Somebody have a suggestion?

+1 A:

I would run the HTML through Tidy first, and then use an XML/XHTML parser (Xerces) to parse the code.

Vivin Paliath 2010-04-08 18:08:14

Tidy is not really fast. Especially since it is a separate process.

EFraim 2010-04-08 18:20:16

Why do you think about libxml++ and Tidy? A lot of people recommend me to use Tidy to avoid problems with broken html

Alessandro 2010-04-08 18:30:43

@EFraim, Tidy has a C++ wrapper. http://users.rcn.com/creitzel/tidy.html#cplusplus. So it will not be a separate process and he can compile it to native code.

Vivin Paliath 2010-04-08 20:02:13

@Vivin: I stand corrected on the process thing, but will using two parsers be any faster than using one?

EFraim 2010-04-08 20:16:14

I think it's implementation-dependent, right? It depends on how that HTML parser is implemented, although IMO any parser that tries to parse HTML will probably tidy it up first so that it becomes much easier to deal with malformed HTML. I've heard that tidy and xerces are both pretty fast. I guess the other alternative is to use a DOM parser of some kind.

Vivin Paliath 2010-04-08 20:59:04

@Vivin: actually I'm thinking of use libxml++ because the DOM parser, it would be slower than Xerces?

Alessandro 2010-04-08 23:54:00

Not sure; I haven't seen any benchmarks. I think libxml++ should be fine. The main idea is that you will probably need to run it through tidy first, and then you can use an XML parser that you like.

Vivin Paliath 2010-04-09 02:30:35

Webkit has a reputation for being very fast.

StackedCrooked 2010-04-08 19:08:41

WebKit is not an HTML parser, for a start. It is a rendering engine. It has one inside, though, but using it for just the parser is an overkill.

EFraim 2010-04-08 19:39:46

ansaurus

tags:

views:

answers:

Really fast C++ html parser

related questions