I want to pull the text out of html files for indexing purposes, and do so as fast as possible. Rather than create something from scratch, I want to see how much I can find already done for me.
Currently I'm just piping the output of html2text, which works, but between being python and trying to prettify the text, I'm sure the speed could be improved.
So, with Linux/unix being priority, what (c/c++) libraries would be best suited to this kind of task?