views:

920

answers:

5

I use Simple HTML DOM to scrape a page for the latest news, and then generate an RSS feed using this PHP class.

This what I have now:

<?php

 // This is a minimum example of using the class
 include("FeedWriter.php");
 include('simple_html_dom.php');

 $html = file_get_html('http://www.website.com');

foreach($html->find('td[width="380"] p table') as $article) {
$item['title'] = $article->find('span.title', 0)->innertext;
$item['description'] = $article->find('.ingress', 0)->innertext;
$item['link'] = $article->find('.lesMer', 0)->href;     
$item['pubDate'] = $article->find('span.presseDato', 0)->plaintext;     
$articles[] = $item;
}


//Creating an instance of FeedWriter class. 
$TestFeed = new FeedWriter(RSS2);


 //Use wrapper functions for common channel elements

 $TestFeed->setTitle('Testing & Checking the RSS writer class');
 $TestFeed->setLink('http://www.ajaxray.com/projects/rss');
 $TestFeed->setDescription('This is test of creating a RSS 2.0 feed Universal Feed Writer');

  //Image title and link must match with the 'title' and 'link' channel elements for valid RSS 2.0

  $TestFeed->setImage('Testing the RSS writer class','http://www.ajaxray.com/projects/rss','http://www.rightbrainsolution.com/images/logo.gif');


foreach($articles as $row) {

    //Create an empty FeedItem
    $newItem = $TestFeed->createNewItem();

    //Add elements to the feed item    
    $newItem->setTitle($row['title']);
    $newItem->setLink($row['link']);
    $newItem->setDate($row['pubDate']);
    $newItem->setDescription($row['description']);

    //Now add the feed item
    $TestFeed->addItem($newItem);
}

  //OK. Everything is done. Now genarate the feed.
  $TestFeed->genarateFeed();

?>

How can I make this code simpler? Right know there is two foreach statements, how can I combine them?

Because the news scraped is in Norwegian, I need to apply the html_entity_decode() on the title. I've tried It here, but I couldn't get it to work:

foreach($html->find('td[width="380"] p table') as $article) {
$item['title'] = html_entity_decode($article->find('span.title', 0)->innertext, ENT_NOQUOTES, 'UTF-8');
$item['description'] = "<img src='" . $article->find('img[width="100"]', 0)->src . "'><p>" . $article->find('.ingress', 0)->innertext . "</p>";    
$item['link'] = $article->find('.lesMer', 0)->href;     
$item['pubDate'] = unix2rssdate(strtotime($article->find('span.presseDato', 0)->plaintext));
$articles[] = $item;
}

Thanks :)

+2  A: 

Well for just a simple combination of the two loops you could create the feed as your parse through the HTML:

<?php
include("FeedWriter.php");
include('simple_html_dom.php');

$html = file_get_html('http://www.website.com');

//Creating an instance of FeedWriter class. 
$TestFeed = new FeedWriter(RSS2);
$TestFeed->setTitle('Testing & Checking the RSS writer class');
$TestFeed->setLink('http://www.ajaxray.com/projects/rss');
$TestFeed->setDescription(
  'This is test of creating a RSS 2.0 feed Universal Feed Writer');

$TestFeed->setImage('Testing the RSS writer class',
                    'http://www.ajaxray.com/projects/rss',
                    'http://www.rightbrainsolution.com/images/logo.gif');

//parse through the HTML and build up the RSS feed as we go along
foreach($html->find('td[width="380"] p table') as $article) {
  //Create an empty FeedItem
  $newItem = $TestFeed->createNewItem();

  //Look up and add elements to the feed item   
  $newItem->setTitle($article->find('span.title', 0)->innertext);
  $newItem->setDescription($article->find('.ingress', 0)->innertext);
  $newItem->setLink($article->find('.lesMer', 0)->href);     
  $newItem->setDate($article->find('span.presseDato', 0)->plaintext);     

  //Now add the feed item
  $TestFeed->addItem($newItem);
}

$TestFeed->genarateFeed();
?>

What's the issue you're seeing with html_entity_decode, if you give us a link to a page it doesn't work on that might help?

Parrots
Try the script with www.mil .no in the: $html = file_get_html('http://www.website.com');When I add the html_entity_decode, the PHP script won't generate the feed. Any thoughts?
mofle
+4  A: 

It seems that you loop through the $html to build an array of articles, then loop through these adding to a feed - you can skip a whole loop here by adding items to the feed as they're found. To do this you'll need to move you FeedWriter contstructor up a bit in the execution flow.

I'd also add a couple of methods in to help with readability, which may help maintainability in the long run. Encapsulating your feed creation, item modification etc should make it easier if you ever need to plug a different provider class in for the feed, change parsing rules, etc. There are further improvements that can be made on the below code (html_entity_decode is on a separate line from $item['title'] assignment etc) but you get the general idea.

What is the issue you're having with html_entity_decode? Have you a sample input/output?

<?php

 // This is a minimum example of using the class
 include("FeedWriter.php");
 include('simple_html_dom.php');

 // Create new instance of a feed
 $TestFeed = create_new_feed();

 $html = file_get_html('http://www.website.com');

 // Loop through html pulling feed items out
 foreach($html->find('td[width="380"] p table') as $article) 
 {
    // Get a parsed item
    $item = get_item_from_article($article);

    // Get the item formatted for feed
    $formatted_item = create_feed_item($TestFeed, $item);

    //Now add the feed item
    $TestFeed->addItem($formatted_item);
 }

 //OK. Everything is done. Now generate the feed.
 $TestFeed->generateFeed();


// HELPER FUNCTIONS

/**
 * Create new feed - encapsulated in method here to allow
 * for change in feed class etc
 */
function create_new_feed()
{
     //Creating an instance of FeedWriter class. 
     $TestFeed = new FeedWriter(RSS2);

     //Use wrapper functions for common channel elements
     $TestFeed->setTitle('Testing & Checking the RSS writer class');
     $TestFeed->setLink('http://www.ajaxray.com/projects/rss');
     $TestFeed->setDescription('This is test of creating a RSS 2.0 feed Universal Feed Writer');

     //Image title and link must match with the 'title' and 'link' channel elements for valid RSS 2.0
     $TestFeed->setImage('Testing the RSS writer class','http://www.ajaxray.com/projects/rss','http://www.rightbrainsolution.com/images/logo.gif');

     return $TestFeed;
}


/**
 * Take in html article segment, and convert to usable $item
 */
function get_item_from_article($article)
{
    $item['title'] = $article->find('span.title', 0)->innertext;
    $item['title'] = html_entity_decode($item['title'], ENT_NOQUOTES, 'UTF-8');

    $item['description'] = $article->find('.ingress', 0)->innertext;
    $item['link'] = $article->find('.lesMer', 0)->href;     
    $item['pubDate'] = $article->find('span.presseDato', 0)->plaintext;     

    return $item;
}


/**
 * Given an $item with feed data, create a
 * feed item
 */
function create_feed_item($TestFeed, $item)
{
    //Create an empty FeedItem
    $newItem = $TestFeed->createNewItem();

    //Add elements to the feed item    
    $newItem->setTitle($item['title']);
    $newItem->setLink($item['link']);
    $newItem->setDate($item['pubDate']);
    $newItem->setDescription($item['description']);

    return $newItem;
}
?>
ConroyP
Thanks, but the feed will only generate if i comment out this line: $item['title'] = html_entity_decode($item['title'], ENT_NOQUOTES, 'UTF-8');Any solution?
mofle
A: 

How can I make this code simpler?

I know it's not exactly what you're asking, but do you know about http://pipes.yahoo.com/pipes/?

troelskn
Yes, I actively use Pipes, it's very good. But I'm doing it manually to learn PHP.
mofle
A: 

Haven't gotten any solution to the html_entity_decode problem. I I remove the html_entity_decode the feed gets generated, but not if i have it in the code.

Anyone else wanna try?

mofle
A: 

Maybe you can just use something like Feedity - http://feedity.com which already solves the problem to generate RSS feed from any webpage.

Samona