I want to use an html parser that does the following in a nice, elegant way
- Extract text (this is most important)
- Extract links, meta keywords
- Reconstruct original doc (optional but nice feature to have)
From my investigation so far jericho seems to fit. Any other open source libraries you guys would recommend?