views:

128

answers:

2

I googled and couldn't find any could that would compare a webpage to a previous version.

In this case the page I'm trying to watch is link text. There are services that can watch a page, but I'd like to set this up on my own server.

I've set this up as a wiki so anyone can add to the code. Here's my idea

  1. Check if previous version of file exists. If false then download page
  2. If page exists, diff to find differences and email the new content along with dates of new and old versions.

This script would be called nightly via cron or on-demand via the browser (the latter is not a priority)

Sounds simple, maybe I'm just not looking in the right place.

A: 

You can check This SO posting to get a few ideas and also information about the challenge of detecting "true" changes to a web page (with fluctuating advertisement block, and other "noise")

mjv
valid posts, however I'm not looking to fingerprint, as in this case its one site with minor changes happening weekly. so even if the change is minor would still be nice to see it.
shaiss
+1  A: 

Perhaps a simple sh-script like this, featuring wget, diff & test?

#!/bin/sh

WWWURI="http://foo.bar/testfile.html"
LOCALCOPY="testfile.html"
TMPFILE="tmpfile"
WEBFILE="changed.html"

MAILADDRESS="$(whoami)"
SUBJECT_NEWFILE="$LOCALCOPY is new"
BODY_NEWFILE="first version of $LOCALCOPY loaded"
SUBJECT_CHANGEDFILE="$LOCALCOPY updated"
SUBJECT_NOTCHANGED="$LOCALCOPY not updated"
BODY_CHANGEDFILE="new version of $LOCALCOPY"

# test for old file
if [ -e "$LOCALCOPY" ]
then
    mv "$LOCALCOPY" "$LOCALCOPY.bak"
    wget "$WWWURI" -O"$LOCALCOPY" -o/dev/null
    diff "$LOCALCOPY" "$LOCALCOPY.bak" > $TMPFILE

# test for update
    if [ -s "$TMPFILE" ]
    then
        echo "$SUBJECT_CHANGEDFILE"
        ( echo "$BODY_CHANGEDFILE" ; cat "$TMPFILE" ) | tee "$WEBFILE" | mail -s "$SUBJECT_CHANGEDFILE" "$MAILADDRESS"
    else
        echo "$SUBJECT_NOTCHANGED"
    fi
else
    wget "$WWWURI" -O"$LOCALCOPY" -o/dev/null
    echo "$BODY_NEWFILE"
    echo "$BODY_NEWFILE" | tee "$WEBFILE" | mail -s "$SUBJECT_NEWFILE" "$MAILADDRESS"
fi
[ -e "$TMPFILE" ] && rm "$TMPFILE"

Update: Pipe through tee, little spelling & remove of $TMPFILE

osti
great script, I've set that out on my webserver and will post back shortly with the results
shaiss
script works like a charm, however I still believe the ideal solution would be a web language that provide access via the browser
shaiss
The tee-Pipe will write the diff to a file (and afterwards, pipe it to mail).For a more sophisticated version, you probably want to switch to PHP or similar things :)
osti