views:

58

answers:

1

I want to pull the text out of html files for indexing purposes, and do so as fast as possible. Rather than create something from scratch, I want to see how much I can find already done for me.

Currently I'm just piping the output of html2text, which works, but between being python and trying to prettify the text, I'm sure the speed could be improved.

So, with Linux/unix being priority, what (c/c++) libraries would be best suited to this kind of task?

+2  A: 

To extract the text you can use an HTML parser like htmlcxx or libxml. You can can also use any XML library after tidying up the HTML. For indexing the text you can use CLucene.

Vijay Mathew
libxml will do. Xapian is the indexer in this case.
Named