I'm looking for a way to convert text like this:
" <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\"
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n <html xml:lang=\"en\" lang=\"en\"
xmlns=\"http://www.w3.org/1999/xhtml\">\n \t<head>\n \t\t<title>My Page
Title</title>\n \t\t<META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html; charset=ISO-
8859-1\">\n <style type=\"text/css\" media=\"screen\"> \n \t\tblockquote\n
{\n \tfont-style: italic;\n }\n cite\n {\n
\ttext-align: right;\n \tfont-style: normal;\n }\n .author\n {\n \ttext-align: right;\n \tmargin-right: 80px;\n
}\n </style>\n \t</head>\n \t<body>\n \t\t<h1>My Page
Title</h1>\n<h3>Production Manager</h3>\n<blockquote>\n<p>“I want my passion for
business plan and my pride in my work to show in every step of our company: from the
labels and papers, to our relationships with our customers, to the enjoyment of each bottle
of My Company business plan. As we expand our production, my dream is to plant a company
of my own to specialize in good business, my personal favorite
varietal.”</p>\n</blockquote>\n<p class=\"author\"><cite>- John
Smith</cite></p>\n<p>Born and raised on the north coast of California, John Smith always
felt a deep connection to this......"
Into this:
My Page Title. Production Manager. I want my passion for business plan and my pride in my
work to show in every step of our company: from the labels and papers, to our
relationships with our customers, to the enjoyment of each bottle of My Company business
plan. As we expand our production, my dream is to plant a company of my own to specialize
in good business, my personal favorite varietal.
That's just extracting all the text before the first period. But it must:
- Strip HTML tags
- Replace \n with ". " (and multiple \n\n\n with ". ")
- Replace \t with " "
- Replace \s+ with " "
- Unescape things like
“
- Replace " with '
After starting to do something like that, I figured this is probably already solved somewhere else more thoroughly. Does anyone have a good one-liner way to create a plain text excerpt from an HTML string like this (in Ruby)?
I use Nokogiri for full-featured HTML parsing, but it seems like it'd be just as difficult to use that.