I need a plain text representation of an arbitrary HTML file (e.g., a blog post). So far that's not a problem, there are dozens of HTML to txt converters. However, the text in paragraphs (read "p
elements") should be justified in the plain text view (to a certain amount of columns) and, if possible, hyphenated to give a better readable result. Also, the resulting text file must be UTF-8 or UTF-16.
Simple plain text conversation I can do with XSLT, that's near to trivial. But the justification of text is beyond its possibilities (not quite true, because XSLT is Turing complete, but close enough to reality).
FOP and XSL-FO don't work either. They do as requested, but FOP's plain text output is horrible (the developers say, that it is not intended for such usage).
I also experimented with HTML -> XSLT -> Roff, but I'm stuck with groff and its Unicode support is far from optimal. Since there are characters like ellipses ("...") and typographically correct quotaion marks, it is quite cumbersome to tell groff in the XSLT stylesheet the escape sequences for dozens of Unicode characters.
Another way could be conversion to TeX and output as plain text, but I have never tried this before with (La)TeX.
Perhaps I have missed something really simple. Has anyone an idea, how I could achieve the above? By the way: A solution should preferably work without root rights to install, with PHP, Python, Perl, XSLT or any program found in a half-decent Linux distro.