I code a lot of parsers. Up till now, I was using HtmlUnit headless browser for parsing and browser automation.
Now, I want to separate both the tasks.
As 80% of my work involves just parsing, I want to use a light html parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.
I want to know which html parser is best. The parser would be better if it is close to HtmlUnit parser.
==============================Edited=========================================
By best, I want at least the following features:
- Speed
- Ease to locate any HtmlElement by its "id" or "name" or "tag type".
It would be ok for me if it doesn't clean the dirty html code. I don't need to clean any html source. I just need an easiest way to move across HtmlElements and harvest data from them.