tags:

views:

62

answers:

2

Hello, I'd like to build a C# application that would:

  1. go through the list of my Favorites (for example, in IE)
  2. check if site was updated since my last visit
  3. show a list of recently updated urls

Point 2 seems problematic, since C#'s HttpWebResponse.LastModified property is not working for some blogs and other sites (it reports the current date and time).

Any ideas? Thanks

+2  A: 

The Last-Modified header is indeed not set by some web servers, and there is nothing you can do about it. For those cases you'd need to grab the HTML and make a hash of the data. If the hash matches on a next retrieval, it has (very very likely) not changed.

In case there is a constantly changing part of the HTML you could parse the HTML tree and remove the typically changing parts of them, like Google Ads. But this starts being a whole lot more effort than merely checking the header, depends on what is your actual use case to see if it's worth your effort. A good tool to use for this last endeavour is the HTML Agility Pack

Yet another approach that might yield better results would be to measure the distance between two versions of a page and mark as updated those above a certain threshold. Again, this will fail (and will now give false positives) on many cases. Just throwing this here in case it inspires you for something useful.

Vinko Vrsalovic
That would fail if page contains Google Text Ads, would it?
friol
Those are normally java script calls and you should see all the script tags in what you get back from the server however if an database driven web page so much as shows a date your has will change.
rerun
Problem with this approach are pages with dynamic elements, e.g. text ads that are changing on load. Some of these (e.g. Google ads) can be filtered out but the effort to do this may end up being tedious and incomplete.
dbkk
+1  A: 

Not sure if the Last-Modified will work like you expect. From the RFC:

The exact meaning of this header field depends on the implementation of the origin server and the nature of the original resource. For files, it may be just the file system last-modified time. For entities with dynamically included parts, it may be the most recent of the set of last-modify times for its component parts. For database gateways, it may be the last-update time stamp of the record. For virtual objects, it may be the last time the internal state changed.

My interpretation of the spec would lead me to set the Last-Modified header to the current date/time for dynamically generated content (ie: PHP pages). The server hosting the page really has no idea when the content being built was actually last updated (some of the data came from one database server other from another, neither records have a field to indicate update time, etc..). It could use the filesystem time for the PHP file itself but this might not change for months on end while the content rendered changed every reload. I see no way the server/interpretor could figure this out without guidance from the developer as to which value to use.

So unfortunately I think your best option is to analyze the page content itself like other have suggested but this will not be easy to accurately identify because of the dymanic content aspects.

Cory Charlton