views:

40

answers:

4

For a blog like project, I want to get the first few paragraphs, headers, lists or whatever within a range of characters from a markdown generated html fragment to display as a summary.

So if I have

<h1>hello world</h1>
<p>Lets say these are 100 chars</p>
<ul>
    <li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>

And assume, I want to summarize with text within the first 150 chars (does not have to be overly exact, I could just get the first 150 chars, including tags and go on with that, but probably would create some artifacts at the tail which could be more difficult to handle...), it should give me the h1, the p and the ul, but not the final p (which would be truncated). If the first element should have more than 150 chars, I would take the full first element.

How could I get this? Using XPath or a regex? I am a bit without ideas on that...

Edit

First I want to give a big THANK YOU to all of you who replied!

While I got really great answers in this thread, I actually found it much easier to plug in before the markdown interpreter hits in, take the first n textblocks separated by \r\n\r\n and just pass this on for md generation.

  class String
    def summarize_md length
        arr = self.split(/\r\n\r\n/)
        sum =""
        arr.each do |ea|
          break if sum.length + ea.length > length
          sum = sum+"#{ea}\r\n\r\n"
        end
        sum
      end
  end

while one probably could reduce this code to a one liner, it is still much simpler and cpu friendlier than any of the proposed solutions. Anyway, since my question could be interpreted such as if the html was the starting point (and not the md text), I'll just give the answer to the first guy... I hope that's just...

A: 

Using XPath is the most robust and flexible. Here's a sample app:

require 'rubygems'
require 'nokogiri'

html = <<End
<h1>hello world</h1>
<p>Lets say these are 100 chars.......................................................................</p>
<ul>
    <li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>
End

LIMIT = 150
summary = ""

doc = Nokogiri::HTML.parse(html)
doc.xpath('//text()').each do |node|
  text = node.text
  break if summary.length + text.length >= LIMIT
  summary << text
end

puts summary
puts summary.length

The XPath //text() simply selects all the text nodes in the document. If you wanted to be more specific about which elements you were interested in, you can.

Mark Thomas
@Jan as a followup to your question in comments elsewhere, this can easily be amended to give you the tags.
Mark Thomas
A: 

A pure XPath 1.0 solution:

substring(/*,1,150)

where the parent of the provided XHTML fragment is the top element (/* or /html).

A very exact XPath 2.0 solution exists:

   for $t in (//text())[not(sum((.| preceding::text())/string-length(.)) gt 150)]
     return
       ($t, '&#xA;')

Do note: The XML document must be parsed in a mode that discards the white-space-only text nodes. Otherwise string-length(.) must be replaced by string-length(normalize-space(.))

Dimitre Novatchev
Interesting solution(s) indeed! Do you think it would be possible to get the tags also, not just text()?
Jan Limpens
@Jan: Not with XPath 1.0. The XPath 2.0 can be adjusted a little -- however it must also take into account the stringlengths of tags (opening and closing) + attributes (names and values). If you only have XPath 1.0 available, then an XSLT solution can be provided -- in case you don't exclude XSLT as a possibility. Anyway, you haven't specified in this question that you want the tags, too. So, feel free to ask a new question and also tag it XSLT :)
Dimitre Novatchev
A: 

How could I get this?

XSLT, of course!

This stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
    <xsl:strip-space elements="*"/>
    <xsl:param name="pMaxLength" select="73"/>
    <xsl:template match="node()">
        <xsl:param name="pPrecedingLength" select="0"/>
        <xsl:variable name="vContent">
            <xsl:copy>
                <xsl:copy-of select="@*"/>
                <xsl:apply-templates select="node()[1]">
                    <xsl:with-param name="pPrecedingLength"
                                    select="$pPrecedingLength"/>
                </xsl:apply-templates>
            </xsl:copy>
        </xsl:variable>
        <xsl:variable name="vLength"
                      select="$pPrecedingLength + string-length($vContent)"/>
        <xsl:if test="$pMaxLength > $vLength and
                      (string-length($vContent) or not(node()))
                      or not($pPrecedingLength)">
            <xsl:copy-of select="$vContent"/>
            <xsl:apply-templates select="following-sibling::node()[1]">
                <xsl:with-param name="pPrecedingLength" select="$vLength"/>
            </xsl:apply-templates>
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

Output:

<html>
    <h1>hello world</h1>
    <p>Lets say these are 100 chars</p>
    <ul>
        <li>some bla bla, 40 chars</li>
    </ul>
</html>
Alejandro
A: 

For my uses I always wanted to strip tags because they could include all sorts of nastiness that would totally hose the display of the summary. They could also seriously skew the letter count, depending on the tags and whether they contain parameters.

I've used something like this many times.

require 'nokogiri'

html = %q{
<h1>hello world</h1>
<p>Lets say these are 100 chars</p>
<ul>
    <li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>
}

doc = Nokogiri::HTML(html)
puts doc.content.gsub(/\n/, ' ').squeeze(' ').strip[0 .. 150]

Which outputs

hello world Lets say these are 100 chars some bla bla, 40 chars some other text

I'll leave it to you to figure out how to ignore or subtract the text from the final <p> tag, but looking up that tag and grabbing its content and then stripping it from the end of the string shouldn't be too hard.

Greg