views:

324

answers:

2

Hello to all, I'm doing a html text feature extractor in C++; the program need to be REALLY fast: i need to extract a this features in ms per html page and the memory usage needs to be good and finally unicode encoding well be nice.

I know how difficult is to have all of this things, but i want a parser close to these things at least.

Somebody have a suggestion?

+1  A: 

I would run the HTML through Tidy first, and then use an XML/XHTML parser (Xerces) to parse the code.

Vivin Paliath
Tidy is not really fast. Especially since it is a separate process.
EFraim
Why do you think about libxml++ and Tidy? A lot of people recommend me to use Tidy to avoid problems with broken html
Alessandro
@EFraim, Tidy has a C++ wrapper. http://users.rcn.com/creitzel/tidy.html#cplusplus. So it will not be a separate process and he can compile it to native code.
Vivin Paliath
@Vivin: I stand corrected on the process thing, but will using two parsers be any faster than using one?
EFraim
I think it's implementation-dependent, right? It depends on how that HTML parser is implemented, although IMO any parser that tries to parse HTML will probably tidy it up first so that it becomes much easier to deal with malformed HTML. I've heard that tidy and xerces are both pretty fast. I guess the other alternative is to use a DOM parser of some kind.
Vivin Paliath
@Vivin: actually I'm thinking of use libxml++ because the DOM parser, it would be slower than Xerces?
Alessandro
Not sure; I haven't seen any benchmarks. I think libxml++ should be fine. The main idea is that you will probably need to run it through tidy first, and then you can use an XML parser that you like.
Vivin Paliath
A: 

Webkit has a reputation for being very fast.

StackedCrooked
WebKit is not an HTML parser, for a start. It is a rendering engine. It has one inside, though, but using it for just the parser is an overkill.
EFraim