views:

527

answers:

3

I need a plain text representation of an arbitrary HTML file (e.g., a blog post). So far that's not a problem, there are dozens of HTML to txt converters. However, the text in paragraphs (read "p elements") should be justified in the plain text view (to a certain amount of columns) and, if possible, hyphenated to give a better readable result. Also, the resulting text file must be UTF-8 or UTF-16.

Simple plain text conversation I can do with XSLT, that's near to trivial. But the justification of text is beyond its possibilities (not quite true, because XSLT is Turing complete, but close enough to reality).

FOP and XSL-FO don't work either. They do as requested, but FOP's plain text output is horrible (the developers say, that it is not intended for such usage).

I also experimented with HTML -> XSLT -> Roff, but I'm stuck with groff and its Unicode support is far from optimal. Since there are characters like ellipses ("...") and typographically correct quotaion marks, it is quite cumbersome to tell groff in the XSLT stylesheet the escape sequences for dozens of Unicode characters.

Another way could be conversion to TeX and output as plain text, but I have never tried this before with (La)TeX.

Perhaps I have missed something really simple. Has anyone an idea, how I could achieve the above? By the way: A solution should preferably work without root rights to install, with PHP, Python, Perl, XSLT or any program found in a half-decent Linux distro.

+4  A: 

Try Python. Use BeautifulSoup to parse the HTML. The textwrap module will allow you to format the text.

There are two features missing, though. To justify the text, you'll need to add spaces to each line but that shouldn't be a big issue (see this code example).

For hyphenation, try this project.

Aaron Digulla
It's a bit BYO, but, yes, with a bit of logic implemented between textwrap and the hyphenation it could work. I like the pure Python approach.
Boldewyn
A: 

Links or lynx might be worth a try, see the -dump switch. The encoding part you can then easily solve separately using iconv or something similar.

zoul
Lynx doesn't do justification, does it? As far as I know, it completely ignores CSS (which is of little use for terminal browsers). So, it's simply converting HTML to text, which is rather simple with other methods, too. Sorry.
Boldewyn
And did you try some of the links variants? These are much better at formatting, some of them might as well support justification.
zoul
Could be that I missed something, but I found neither Lynx nor Links nor Elinks nor Links-hacked nor w3m doing justification or even hyphenation.
Boldewyn
Ah, sorry then, I was just guessing.
zoul
No problem! Actually, I'd really like to see a terminal-based browser supporting the CSS @tty media type.
Boldewyn
+1  A: 

If you are familiar with Emacs, you may open the HTML file in Emacs-W3M (i.e. M-x w3m-find-file foo.html), save the rendered page as a plain text file, and then call M-x set-justification-full on it.

You can even write a small function to do the job:

(defun my-html-to-justifed-text (html-file text-file)
  "Convert HTML-FILE to plain TEXT-FILE."
  (find-file html-file)
  (w3m-rendering-buffer)
  (set-justification-full (point-min) (point-max))
  (write-file text-file))

(my-html-to-justifed-text "~/tmp/2.html" "~/tmp/2.txt")
Török Gábor
This is nice too, and I like a Lisp solution... I accept AAron's answer, because his also includes the hyphenation part, but thank you for showing what Emacs can do.
Boldewyn