views:

61

answers:

2

I'm looking for a way to convert text like this:


"  <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\"
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\"&gt;\n  <html xml:lang=\"en\" lang=\"en\"
 xmlns=\"http://www.w3.org/1999/xhtml\"&gt;\n   \t<head>\n   \t\t<title>My Page 
Title</title>\n   \t\t<META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html; charset=ISO-
8859-1\">\n      <style type=\"text/css\" media=\"screen\"> \n       \t\tblockquote\n
{\n \tfont-style: italic;\n }\n cite\n {\n
\ttext-align: right;\n \tfont-style: normal;\n }\n .author\n {\n \ttext-align: right;\n \tmargin-right: 80px;\n
}\n </style>\n \t</head>\n \t<body>\n \t\t<h1>My Page Title</h1>\n<h3>Production Manager</h3>\n<blockquote>\n<p>&#8220;I want my passion for business plan and my pride in my work to show in every step of our company: from the labels and papers, to our relationships with our customers, to the enjoyment of each bottle of My Company business plan. As we expand our production, my dream is to plant a company of my own to specialize in good business, my personal favorite varietal.&#8221;</p>\n</blockquote>\n<p class=\"author\"><cite>- John Smith</cite></p>\n<p>Born and raised on the north coast of California, John Smith always felt a deep connection to this......"

Into this:


My Page Title. Production Manager. I want my passion for business plan and my pride in my
work to show in every step of our company:  from the labels and papers, to our 
relationships with our customers, to the enjoyment of each bottle of My Company business 
plan.  As we expand our production, my dream is to plant a company of my own to specialize
in good business, my personal favorite varietal.

That's just extracting all the text before the first period. But it must:

  • Strip HTML tags
  • Replace \n with ". " (and multiple \n\n\n with ". ")
  • Replace \t with " "
  • Replace \s+ with " "
  • Unescape things like
  • Replace " with '

After starting to do something like that, I figured this is probably already solved somewhere else more thoroughly. Does anyone have a good one-liner way to create a plain text excerpt from an HTML string like this (in Ruby)?

I use Nokogiri for full-featured HTML parsing, but it seems like it'd be just as difficult to use that.

A: 

Does it have to be in ruby? I can I write it in PHP:

$text = '<html> ...';
$result = preg_replace(array('/\\n+/', '/\\[ts]/', '/"/'), array('. ', ' ', '\''), html_entity_decode(strip_tags($text)));
dmitrig01
A: 

Hmm. That seems like a rather lot of functionality for a one-liner. If you just want to parse and display an HTML page as plain text, I'd recommend using w3m.

string = "..." # your string

IO.popen("w3m -T text/html", "r+") do |pipe|
  pipe.write string
  pipe.close_write
  puts pipe.read
end

Gives me:

My Page Title

Production Manager

    “I want my passion for business plan and my pride in my work to show in
    every step of our company: from the labels and papers, to our relationships
    with our customers, to the enjoyment of each bottle of My Company business
    plan. As we expand our production, my dream is to plant a company of my own
    to specialize in good business, my personal favorite varietal.”

- John Smith

Born and raised on the north coast of California, John Smith always felt a deep
connection to this......

For the rest of the substitutions, I'd recommend applying a regexp replace either before or after processing, depending on your exact needs.

Brian Campbell