tags:

views:

102

answers:

4

I need to get a short excerpt of news items written in HTML to show on my front page. Obviously I can't use something as simple as substr because it might leave tags unclosed or even leave half a tag.

Which is easier:

  • Converting the HTML to decent looking plain text and take a piece of that
  • Taking the beginning from the HTML and closing any unclosed tags at the cutoff (will this always look OK?)

And how would I go about implementing the chosen solution?

+5  A: 

Simplest way is to strip all HTML from the item text using strip_tags() before truncating it.

Ben James
Using this now for automatically generated excerpts. It's not the best, but it's ok, since I provided news posters with special markup to specify their own excerpts.
Bart van Heukelom
+2  A: 

I would take the 2nd option if it's important to retain the HTML structure of the original news item.

A simple way to implement this would be to run your fragment through Tidy to close off any unclosed tags. In particular, see the tidy::cleanRepair method.

Richard Nguyen
+1  A: 

You could try parsing your data to XML and then truncating only the "pure" text nodes.

Note: This solution forces the input to be valid XML and to be always in about the same structure.

cimnine
+2  A: 

Hello i exactly know what you looking for its called website scraping. Here is how you can scrapa website; Use a library PHP Simple HTML DOM Parser download here PHP Simple HTML DOM Parser And Finaly here is the code how you can scrap slashdot

   // Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
    $item['title']     = $article->find('div.title', 0)->plaintext;
    $item['intro']    = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}

print_r($articles);
streetparade
Not using this in this situation, but helpful nonetheless.
Bart van Heukelom