tags:

views:

153

answers:

4

Is there anything out there to convert html to plain text (maybe a nokogiri script)? Something that would keep the line breaks, but that's about it.

If I write something on googledocs, like this, and run that command, it outputs (removing the css and javascript), this:

\n\n\n\n\nh1. Test h2. HELLO THEREI am some teexton the next line!!!OKAY!#*!)$!

So the formatting's all messed up. I'm sure someone has solved the details like these somewhere out there.

A: 

Is simply stripping tags and excess line breaks acceptable?

html.gsub(/<\/?[^>]*>/, '').gsub(/\n\n+/, "\n").gsub(/^\n|\n$/, '')

First strips tags, second takes duplicate line breaks down to one, third removes line breaks at the start and end of the string.

Matchu
i wish, but I'd like the text to have the same spacing/breaks (the way it appears at least). If there's a double spaced line in HTML, it's converted sometimes into 5-10 line breaks using that. Updated the question.
viatropos
@viatropos: Is it acceptable to simply remove redundant line breaks, then?
Matchu
then I have to build my own parser in the end :). don't really have the time to do that right now.
viatropos
@viatropos: Why does removing excess line breaks require a parser? See edited answer.
Matchu
Ohh. `<br/>` tags. Gotcha.
Matchu
looking for something that's already solved these little issues. thanks for the help though, when I get some time later I'd be down to work through it. until then, if there's something that's ready to go that'd be awesome.
viatropos
@viatropos: It's still not clear to me exactly what output you want. You might want to include an example, since it seems like you're assuming that the output you desire is the format that everyone would desire. Such is likely not the case, and you may, in fact, be forced to do your own work in this case to get exactly what you want.
Matchu
A: 

Actually, this is much simpler:

require 'rubygems'
require 'nokogiri'

puts Nokogiri::HTML(my_html).text

You still have line break issues, though, so you're going to have to figure out how you want to handle those yourself.

Matchu
+1  A: 

You could start with something like this:

require 'open-uri'
require 'rubygems'
require 'nokogiri'

uri = 'http://stackoverflow.com/questions/2505104/html-to-plain-text-with-ruby'
doc = Nokogiri::HTML(open(uri))
doc.css('script').each { |node| node.remove }
doc.css('link').each { |node| node.remove }
puts doc.css('body').text.squeeze(" ").squeeze("\n")
Levi
If you care to cleanup a bit more, replace the last line with this: puts doc.css('body').text.split("\n"). collect { |line| line.strip }.join("\n")
Levi
this is really close! i ran this on http://docs.google.com/View?id=dffk85xk_63f29hv2hn and it's almost perfect. It's just not including the first new line (<p> or <div> tags around content). Is there a way to include that?
viatropos
hmm. nothing is jumping out at me. Are you sure you don't want to just pipe this through lynx?
Levi
A: 

You want hpricot_scrub:

http://github.com/UnderpantsGnome/hpricot_scrub

You can specify which tags to strip / keep in a config hash.

Matt M.