views:

203

answers:

4

Hi There,

For a given URL, I want to check if the content has changed since the last time. The content for the (http) URL is generated by a script which goes through several modifications regularly. Need to see if there are any regressions caused by this changing of the script.

Prac

A: 

This needs to specify a language to use, or something. C? Unix shell script? Java? PHP?

General Procedure: download file, compute a SHA1 hash on the file. For each future version do the same, and compare the SHA-1 hashes. If they differ, congrats, your content has changed!

BobMcGee
A: 

A quick way to do this is to check the headers for the content. If the script generates the correct content headers you can simply check the Content-Age or Content-MD5 header to see if the content should be re-gotten. If you have access to the script generating the response it would be good to add these if they are not there.

If you cant modify the script, or these headers are not present the second quickest way to do this is to figure out how much of the page is sufficient for a change hash, retrieve that part and generate a hash to see if it changed. In general computing the MD5 on less than 1MB of content is fairly trivial time wise, with larger data taking more time. If the first part of the page has a timestamp or similar you dont need to hash anything beyond this as it will tell you if the content changed.

Of course the third thing is if the page content changes frequently but you are only interested if the formatting, or volume of the content changed (not the content itself) you will need to identify meaningful structure in the page and compare that. This would be true if you had a page that was writing log files, you didnt care about the log files themselves, but you cared if one was added or a new source was added. This is the trickiest to detect by far.

GrayWizardx
+1  A: 

Barring knowing what language you're using, the simplest solution is to format your request using the If-Modified-Since HTTP header and check for a 304 (not modified) response from your server. If the content is a static file generated by a script, then your webserver will check against the modified timestamp on the file. You'll either get a 304 response, or a 200 (OK) response with the new content page.

WarrenB
A: 

Thanks for the answers @BobMcGee.. i can use what u said. but then i wont be able to find the point where the content is different.

So as Adam commented, i have saved the html page as reference and each time i get the new html from the url, compare it with the reference file to see what has changed.

prac
Y'know, you could.... ah, vote up and accept one of the answers? So we get some reputation for helping...
BobMcGee