ansaurus

Question

Get first few elements of a html fragment with xpath on ruby

Answer 1

A:

Using XPath is the most robust and flexible. Here's a sample app:

require 'rubygems'
require 'nokogiri'

html = <<End
<h1>hello world</h1>
<p>Lets say these are 100 chars.......................................................................</p>
<ul>
    <li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>
End

LIMIT = 150
summary = ""

doc = Nokogiri::HTML.parse(html)
doc.xpath('//text()').each do |node|
  text = node.text
  break if summary.length + text.length >= LIMIT
  summary << text
end

puts summary
puts summary.length

The XPath //text() simply selects all the text nodes in the document. If you wanted to be more specific about which elements you were interested in, you can.

Mark Thomas 2010-10-21 00:30:56

@Jan as a followup to your question in comments elsewhere, this can easily be amended to give you the tags.

Mark Thomas 2010-10-21 17:27:14

Answer 2

A:

A pure XPath 1.0 solution:

substring(/*,1,150)

where the parent of the provided XHTML fragment is the top element (/* or /html).

A very exact XPath 2.0 solution exists:

   for $t in (//text())[not(sum((.| preceding::text())/string-length(.)) gt 150)]
     return
       ($t, '&#xA;')

Do note: The XML document must be parsed in a mode that discards the white-space-only text nodes. Otherwise string-length(.) must be replaced by string-length(normalize-space(.))

Dimitre Novatchev 2010-10-21 02:42:28

Interesting solution(s) indeed! Do you think it would be possible to get the tags also, not just text()?

Jan Limpens 2010-10-21 06:12:37

@Jan: Not with XPath 1.0. The XPath 2.0 can be adjusted a little -- however it must also take into account the stringlengths of tags (opening and closing) + attributes (names and values). If you only have XPath 1.0 available, then an XSLT solution can be provided -- in case you don't exclude XSLT as a possibility. Anyway, you haven't specified in this question that you want the tags, too. So, feel free to ask a new question and also tag it XSLT :)

Dimitre Novatchev 2010-10-21 12:43:47

Answer 3

A:

How could I get this?

XSLT, of course!

This stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
    <xsl:strip-space elements="*"/>
    <xsl:param name="pMaxLength" select="73"/>
    <xsl:template match="node()">
        <xsl:param name="pPrecedingLength" select="0"/>
        <xsl:variable name="vContent">
            <xsl:copy>
                <xsl:copy-of select="@*"/>
                <xsl:apply-templates select="node()[1]">
                    <xsl:with-param name="pPrecedingLength"
                                    select="$pPrecedingLength"/>
                </xsl:apply-templates>
            </xsl:copy>
        </xsl:variable>
        <xsl:variable name="vLength"
                      select="$pPrecedingLength + string-length($vContent)"/>
        <xsl:if test="$pMaxLength > $vLength and
                      (string-length($vContent) or not(node()))
                      or not($pPrecedingLength)">
            <xsl:copy-of select="$vContent"/>
            <xsl:apply-templates select="following-sibling::node()[1]">
                <xsl:with-param name="pPrecedingLength" select="$vLength"/>
            </xsl:apply-templates>
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

Output:

<html>
    <h1>hello world</h1>
    <p>Lets say these are 100 chars</p>
    <ul>
        <li>some bla bla, 40 chars</li>
    </ul>
</html>

Alejandro 2010-10-21 16:06:09

Answer 4

A:

For my uses I always wanted to strip tags because they could include all sorts of nastiness that would totally hose the display of the summary. They could also seriously skew the letter count, depending on the tags and whether they contain parameters.

I've used something like this many times.

require 'nokogiri'

html = %q{
<h1>hello world</h1>
<p>Lets say these are 100 chars</p>
<ul>
    <li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>
}

doc = Nokogiri::HTML(html)
puts doc.content.gsub(/\n/, ' ').squeeze(' ').strip[0 .. 150]

Which outputs

hello world Lets say these are 100 chars some bla bla, 40 chars some other text

I'll leave it to you to figure out how to ignore or subtract the text from the final <p> tag, but looking up that tag and grabbing its content and then stripping it from the end of the string shouldn't be too hard.

Greg 2010-10-21 22:42:31

ansaurus

tags:

views:

answers:

Get first few elements of a html fragment with xpath on ruby

Edit

related questions