tags:

views:

69

answers:

1

Hi all,

I run a small blog network and on this I have a page where I show the latest blog posts from different blogs on my server. I would like to extend this page, to also include new posts from external blogs using rss feeds.

Currently it’s easy to get the content, since it’s just a simple query selecting posts by date, but it troubles me to see how to make the most effective design when extending it.

The easiest solution would be to periodic run a cronjob that import posts from the external sites, and then save them in the database. Though this creates the possibility that the posts could be altered in content or removed by the author, leaving me to display ”invalid content”.

The best solution would be if I don’t have to save the posts, and instead just import them directly on the page. But how would this affect usability and loading time? Is it somehow possible to cache the feeds? If I should choose a combination of displaying internal and external posts using a query and importing feeds directly, how can this be combined to use ”pagination” (10 results pr. page)?

I hope someone can help me with a small proof of concept code, or describe what they believe would be the most effective way of handling this.

PS: For importing feeds I use SimplePie http://simplepie.org

Thanks in advance

A: 

If you already use SimplePie then you can use its caching mechanism to have the feed data cached.

To combine the articles from internal and external sources create a data structure with all articles. This can be an array of all items sorted by publication timestamp. Then from this array choose the articles for a certain page number.

Here's some code to create a combined array of posts. This should give you a idea of the steps involved. The Post class represents a post. The internal and external posts are converted to a Post and stored in the array $posts. This array is sorted by timestamp and at the end all posts are echoed.

$internalPosts must contain the posts form your system and $feedUrls the URL's of the external feeds. Since I don't know the structure of the internal posts you must adapt the part where internal posts are converted to generic posts.

$internalPosts = array();
$feedUrls = array();

include_once 'simplepie.inc';

class Post {
    public $title;
    public $link;
    public $description;
    public $publishedAt;

    public function __construct($title, $link, $description, $publishedAt) {
        $this->title = $title;
        $this->link = $link;
        $this->description = $description;
        $this->publishedAt = $publishedAt;
    }   
}

$posts = array();

// Convert internal posts to generic post.
foreach($internalPosts as $item){
    $posts[] = new Post($item->title, $item->link, $item->description, $item->publishedAt);
}

// Retrieve feeds and add posts.
$feed = new SimplePie();

foreach($feedUrls as $url){
    $feed->set_feed_url($url);
    $feed->init();

    foreach ($feed->get_items() as $item) {
        $posts[] = new Post($item->get_title(), $item->get_link(), $item->get_description(), $item->get_date('U'));
    }
}

// Sort function.
function byPublicationTimestamp($itemA, $itemB){
    return ($itemB->publishedAt - $itemA->publishedAt);
}

usort($posts, 'byPublicationTimestamp');

foreach($posts as $post){
    echo "<p><a href='$post->link'>$post->title</a><br/>" . date('l, j F Y', $post->publishedAt) . " - $post->description</p>"; 
}

For improved performance consider storing the combined articles separately and build the pages from this data. Then you need to update this combined data anytime a new article is published internally or the cached version of an external feed has been refreshed.

If you need to publish the external content shortly after it's published on the original site then I would contact those sites to see if it's possible to get a notification of updates instead of waiting for the cached version to expire.

EDIT: added sample code.

Kwebble
Thanks for your response! - I don't like option 1 (even though that would be the easiest) because then I would need to get ALL data each time I change the page number. I'm not in a need of publishing external content right away, so I believe using the cache expire would be the best way. Could you possible provide a small proof of concept code of option 2? It doesnt need to work, just a small block of code explaining how to do it. Thanks in advance
kris
Do you mean creating a combined array of articles?
Kwebble
Yes. My troubles lies with how to handle the external posts and connecting them the right way.
kris
I've added a code example to combine the data and sort it.
Kwebble
Your example provides a great method of combining the two data sources! However since I need to split up the results to 10 posts pr. page, I believe this method would give terrible performance if I have a very large amount of posts? Having just one data source I would limit the results directly in my query. Using your method I have to get all internal and external posts and then combine them before eventually splitting it up. Is there any smart method of doing this? Thanks again, I’m very grateful for your help.
kris
That's why I suggested to store the combined data. Perhaps you can insert the external articles in your own system, using a specific label/tag/category to identify them. Then you can query the normal database. Or create a new table with the combined articles and query this.
Kwebble
Kwebble I'm very grateful for all your help! Since performance is very important to me, I want to make sure this is made the most effective way. If I extend your example adding my own data structure, could it be possible persuading you to show me how to store and query the data? Perhaps the best way would be to create a cronjob running every 10 minutes, combining new posts and storing them- what would you recommend? I will gladly pay you for your help since this is so important to me, and in addition I’m very aware this is not a ‘do-my-job’ forum :- )
kris
Here's a suggestion: store the external posts in their own table and let the cron job keep this up to date. Query both internal and external tables using a union to combine results. In MySql it would look something like this: SELECT int_title, int_text, int_pubdate FROM internal UNION SELECT ext_title, ext_text, ext_pubdate FROM external ORDER BY int_pubdate DESC LIMIT 10; I assumed some column names here. I haven't used union before, so there may be room for improvement. Oh, and for now, I just like to help out and learn something on the way. But if you want to we can use email instead of SO.
Kwebble
Hi Kwebble! I made a script that will run every 30 minutes using a cronjob. The script will run through all submitted blogs and grab the 10 latest posts from the blog feeds. Can you see further ways to optimize the process? http://pastie.org/private/rnx2qfupshts0gvjeswtjq - I never used UNION before, but it’s deffintly what I need for making step 2!
kris