tags:

views:

225

answers:

5

Hello guys,

I'm able to parse RSS with PHP - What I'm looking for is to be able to get only the updated content, and do nothing if there's no new update to the RSS.

For example, I have this RSS File, and if there's no new content, nothing happens, but if there's a new content, I want to send my users the latest RSS update, and not resend what they already have. I'm parsing and sending the title and link only.

I use cronjob to check every hour for update. My question is how can I tell that the feed is now updated and not the same as the last one? Here's the PHP file that I'm using to read the RSS. Do I write the last content to file and compare them or is there any other way to determine that the content is now different from the last?

Update: I had to resurrect this post because I'm still trying to get it to work. Although I accepted a few answers, they have been very hard to implement, for example the hashing option looked like a good idea initially, but as thousands of RSS would be checked, it would be almost impossible to hash them all.

Again, someone suggested HTTP Cache - I couldn't find a simple demo so I'm practically stuck.

Any further suggest would be highly appreciated.

A: 

Your clients will always be asking for your feed data so you cannot necessarily control when they ask. I dont think most feed readers obey HTTP Cache Control / Expires headers so you cannot rely on using the HTTP spec and leverage HTTP caching.

I think your best bet is to just cache your last response and send all subsequent requests from the cache - updating the cache appropriately when changes are made. Effectively this means that your cost to respond to each client and its stale data is pretty much close to 0, if you just pull it from memcache or the filesystem.

Cody Caughlan
Thanks Cody for your comment - but our system is meant to send off information to the client when there's an update. Infact, it's an SMS system and should only send latest info and not repeat the last info
Helen Neely
@Cody Supporting HTTP conditional GET is always a good idea. Do you have any reference to support your claim that it's not respected by clients?
Adam Byrtek
+1  A: 

Because of the diversity of rss there is no easy solution for the problem your raised. The main issue is how to determine the uniqueness of the rss item. It can be guid, publish time or content itself, but it maybe tricky to detect that automatically.

Once you know the uniqueness criteria you can persist all 'old' items and compare them to the newest ones you receive.

HTTP Cache Control and Expires headers could be used as an optimization for the sites that support that, but unfortunately some doesn't.

Gennady Shumakher
Thanks Genndy, your response has given me an idea. I will now write the titles in a file and compare them with new content when the cronjob runs. If they new one does not appear in the old list, I will send that off.Which means I will have to purge the whole list every week to stop it from growing out of control on the server. At least, that's the only option I have now.
Helen Neely
This will work only if you are sure that item title is unique. In general, you can easily find feeds where that's not the case.
Gennady Shumakher
GUID/UUID would be a better candidate for comparison, it's meant to be globally (probabilistically) unique.
Adam Byrtek
+3  A: 

HTTP Conditional GET is probably as close as you're going to get to what you want.

jalefkowit
+4  A: 

You could use hashes for this, in two ways:

  1. To ease updating - When requesting an update, you hash the whole feed and compare the result with the hash from the last time - if they are identical, you know that the feed did not change and can stop before even parsing it.
  2. To identify changes - On parsing, you hash each item and compare it to the hashes stored from previous runs. If it matches one, you know that you've seen it before.

If the feed in question offers guids for its items you could refine this process by storing guid<>hash pairs. This would make the comparison quicker, as you would only compare items to known previous versions instead of comparing to all previous items.

You'd still need some expiration/purge mechanism to keep the amount of stored hashes within bounds, but given that you only store relatively short strings (depending on the chosen hash algorithm), you should be able to keep quite a backlog before getting performance problems.

Henrik Opel
it's actually faster not to be using hash, but compare packets of bytes.... (for hashing you are reading the whole of both files, no matters what, and active the hash algorithm - reading the whole file is surely taking more than reading as much as needed, and the has algorithm is surely more than just comparing bytes).
Itay
@Itay: Sure, hashing will take its toll for generating, but the point here is about storing the previous entries and comparing to those. To do the byte comparison, you'd have to store the whole feed and whole feed items in the database, which, depending on the feed, could be a considerable amount of data. Writing and reading those would take time also, but especially reduce the number of past entries one can keep with a given amount of storage space.
Henrik Opel
@Itay - you can hash the old content once, so you only need to hash the new content. If it's a lot of content, you save re-reading the old content, you only read the old hash.
orip
A: 

@Henrik's solution is correct, however it might be easiest to supply you with an example of the hashing data:

// hash the three channel variables
$hash = sha1($channel_title . $channel_link . $channel_desc);

// here you should check the currently stored database hashed 
// value against current hash value to see if any channel variables
// have recently changed
if ($database_hash != $hash) {
    // you need to update the channel data in your database
    // including the new hash value
}

for ($i = 0; $i < 3; $i++) {

    // hash the item values
    $hash = $item_title . $item_link . $item_description

    // here you should check the currently stored database hashed 
    // value against all item hash values to see if any item variables
    // have recently changed
    if ($database_hash != $hash) {
        // you need to update the item data in your database
        // including the new hash value
    }

}

Also, if you want to do a quick check to determine if any data in the XML file has changed whatsoever, you could hash the XML as a string. You should store this value and check against it every time you run your cronjob to see if the value has changed (indicating some data within the XML file has changed).

$overall_hash = sha1($xmlDoc->saveXML());
cballou