For a blog like project, I want to get the first few paragraphs, headers, lists or whatever within a range of characters from a markdown generated html fragment to display as a summary.
So if I have
<h1>hello world</h1>
<p>Lets say these are 100 chars</p>
<ul>
<li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>
And assume, I want to summarize with text within the first 150 chars (does not have to be overly exact, I could just get the first 150 chars, including tags and go on with that, but probably would create some artifacts at the tail which could be more difficult to handle...), it should give me the h1, the p and the ul, but not the final p (which would be truncated). If the first element should have more than 150 chars, I would take the full first element.
How could I get this? Using XPath or a regex? I am a bit without ideas on that...
Edit
First I want to give a big THANK YOU to all of you who replied!
While I got really great answers in this thread, I actually found it much easier to plug in before the markdown interpreter hits in, take the first n textblocks separated by \r\n\r\n and just pass this on for md generation.
class String
def summarize_md length
arr = self.split(/\r\n\r\n/)
sum =""
arr.each do |ea|
break if sum.length + ea.length > length
sum = sum+"#{ea}\r\n\r\n"
end
sum
end
end
while one probably could reduce this code to a one liner, it is still much simpler and cpu friendlier than any of the proposed solutions. Anyway, since my question could be interpreted such as if the html was the starting point (and not the md text), I'll just give the answer to the first guy... I hope that's just...