ansaurus

Question

Answer 1

+4 A:

Try Python. Use BeautifulSoup to parse the HTML. The textwrap module will allow you to format the text.

There are two features missing, though. To justify the text, you'll need to add spaces to each line but that shouldn't be a big issue (see this code example).

For hyphenation, try this project.

Aaron Digulla 2009-10-30 09:12:38

It's a bit BYO, but, yes, with a bit of logic implemented between textwrap and the hyphenation it could work. I like the pure Python approach.

Boldewyn 2009-10-30 09:18:03

Answer 2

A:

Links or lynx might be worth a try, see the -dump switch. The encoding part you can then easily solve separately using iconv or something similar.

zoul 2009-10-30 09:16:30

Lynx doesn't do justification, does it? As far as I know, it completely ignores CSS (which is of little use for terminal browsers). So, it's simply converting HTML to text, which is rather simple with other methods, too. Sorry.

Boldewyn 2009-10-30 11:27:53

And did you try some of the links variants? These are much better at formatting, some of them might as well support justification.

zoul 2009-10-30 12:14:32

Could be that I missed something, but I found neither Lynx nor Links nor Elinks nor Links-hacked nor w3m doing justification or even hyphenation.

Boldewyn 2009-11-02 11:27:42

Ah, sorry then, I was just guessing.

zoul 2009-11-02 12:02:30

No problem! Actually, I'd really like to see a terminal-based browser supporting the CSS @tty media type.

Boldewyn 2009-11-03 10:51:46

Answer 3

+1 A:

If you are familiar with Emacs, you may open the HTML file in Emacs-W3M (i.e. M-x w3m-find-file foo.html), save the rendered page as a plain text file, and then call M-x set-justification-full on it.

You can even write a small function to do the job:

(defun my-html-to-justifed-text (html-file text-file)
  "Convert HTML-FILE to plain TEXT-FILE."
  (find-file html-file)
  (w3m-rendering-buffer)
  (set-justification-full (point-min) (point-max))
  (write-file text-file))

(my-html-to-justifed-text "~/tmp/2.html" "~/tmp/2.txt")

Török Gábor 2009-11-10 16:04:49

This is nice too, and I like a Lisp solution... I accept AAron's answer, because his also includes the hyphenation part, but thank you for showing what Emacs can do.

Boldewyn 2009-11-10 20:41:04

ansaurus

tags:

views:

answers:

Justified plain text from HTML

related questions