views:

624

answers:

5

Hi there,

I have a page that will list news articles. To cut down on the page's length, I only want to display a teaser (the first 200 words / 600 letters of the article) and then display a "more..." link, that, when clicked, will expand the rest of the article in a jQuery/Javascript way. Now, I've all that figured out and even found the following helper method on some paste page, which will make sure, that the news article (string) is not chopped up right in the middle of a word:

 def shorten (string, count = 30)
    if string.length >= count
      shortened = string[0, count]
      splitted = shortened.split(/\s/)
      words = splitted.length
      splitted[0, words-1].join(" ") + ' ...'
    else
      string
    end
  end

The problem that I have is that the news article bodies that I get from the DB are formatted HTML. So if I'm unlucky, the above helper will chop up my article string right in the middle of an html tag and insert the "more..." string there (e.g. between ""), which will corrupt my html on the page.

Is there any way around this or is there a plugin out there that I can use to generate excerpts/teasers from an HTML string?

Regards,

Sebastian

A: 

you would have to write a more complex parsers if you dont want to split in the middle of html elements. it would have to remember if it is in the middle of a <> block and if its between two tags.

even if you did that, you would still have problems. if some put the whole article into an html element, since the parser couldnt split it anywhere, because of the missing closing tag.

if it is possible at all i would try not to put any tags into the articles or keep it to tags that dont contain anything (no <div> and so on). that way you would only have to check if you are in the middle of a tag which is pretty simple:

  def shorten (string, count = 30)
     if string.length >= count
       shortened = string[0, count]
       splitted = shortened.split(/\s/)
       words = splitted.length
       if(splitted[words-1].include? "<")
         splitted[0,words-2].join(" ") + ' ...'
       else
         splitted[0, words-1].join(" ") + ' ...'
     else
       string
     end   
  end
LDomagala
+1  A: 

My answer here should do work. The original question (err, asked by me) was about truncating markdown, but I ended up converting the markdown to HTML then truncating that, so it should work.

Of course if your site gets much traffic, you should cache the excerpt (perhaps when the post is created/updated, you could store the excerpt in the database?), this would also mean you could allow the user to modify or enter their own excerpt

Usage:

>> puts "<p><b><a href=\"hi\">Something</a></p>".truncate_html(5, at_end = "...")
=> <p><b><a href="hi">Someth...</a></b></p>

..and the code (copied from the other answer):

require 'rexml/parsers/pullparser'

class String
  def truncate_html(len = 30, at_end = nil)
    p = REXML::Parsers::PullParser.new(self)
    tags = []
    new_len = len
    results = ''
    while p.has_next? && new_len > 0
      p_e = p.pull
      case p_e.event_type
      when :start_element
        tags.push p_e[0]
        results << "<#{tags.last}#{attrs_to_s(p_e[1])}>"
      when :end_element
        results << "</#{tags.pop}>"
      when :text
        results << p_e[0][0..new_len]
        new_len -= p_e[0].length
      else
        results << "<!-- #{p_e.inspect} -->"
      end
    end
    if at_end
      results << "..."
    end
    tags.reverse.each do |tag|
      results << "</#{tag}>"
    end
    results
  end

  private

  def attrs_to_s(attrs)
    if attrs.empty?
      ''
    else
      ' ' + attrs.to_a.map { |attr| %{#{attr[0]}="#{attr[1]}"} }.join(' ')
    end
  end
end
dbr
oh i like yours, it fixes the problem with tags around the text
LDomagala
A: 

You can use a combination of Sanitize and Truncate.

truncate("And they found that many people were sleeping better.", 
  :omission => "... (continued)", :length => 15)
# => And they found... (continued)

I'm doing a similar task where I have blog posts and I just want to show a quick excerpt. So in my view I simply do:

sanitize(truncate(blog_post.body), :length => 150))

That strips out the HTML tags, gives me the first 150 characters and is handled in the view so it's MVC friendly.

Good luck!

mwilliams
+2  A: 

Thanks a lot for your answers! However, in the meantime I stumbled upon the jQuery HTML Truncator plugin, which perfectly fits my purposes and shifts the truncation to the client-side. It doesn't get any easier :-)

Sebastian
A: 

I would have sanitized the HTML and extracted the first sentence. Assuming you have an article model, with a 'body' attribute that contains the HTML:

# lib/core_ext/string.rb
class String
  def first_sentence
    self[/(\A[^.|!|?]+)/, 1]
  end
end

# app/models/article.rb
def teaser
  HTML::FullSanitizer.new.sanitize(body).first_sentence
end

This would convert "<b>This</b> is an <em>important</em> article! And here is the rest of the article." into "This is an important article".

August Lilleaas