How can you detect when somebody else's web page was last updated (or was changed)?
In general, there is no way to know when something on another site has been changed. If the site offers an RSS feed, you should try that. If the site does not offer an RSS feed (or if the RSS feed doesn't include the information you're looking for), then you have to scrape and compare.
01. Open the page for which you want to get the information.
02. Clear the address bar [where you type the address of the sites]:
and type or copy/paste from below:
javascript:alert(document.lastModified)
03. Press Enter or Go button.
Another Website Has Changed
There are two possible situations for the website's content: it is static or it is dynamic.
Static Content
For static sites:
Download the entire website. For example:
wget -r http://www.website.com
Write a shell script with the following algorithm:
- Change to the directory containing a copy of the website.
- Download a page from the website.
- Compare the page to the existing copy.
- If the pages differ, send an email notification and replace the old copy with the new one.
- Repeat for all pages.
Dymamic Content
For web pages that change, then you will need an additional step:
Remove the dynamic content. For example:
lynx -dump http://www.google.com
Since lynx
is a text-only browser, it will remove all graphics and most of the dynamic content.
Dymamic Server-generated Content
If the website uses server-side technologies (PHP, SSI, Servlets, etc.) to generate content, a more complicated solution is necessary.
One possible solution would be to check if a certain percentage of the content is the same.
Another Web Page Has Changed
If you need to check if a single web page has changed (for example, a license agreement), then the solution is much simpler:
- Download copy #1 of the web page.
- Wait a few days.
- Download copy #2 of the web page.
- Write a
vi
script that removes all the dynamic content. - Check if copy #1 and copy #2 differ.
- If they differ, send an email and replace copy #2 with copy #1.
Good luck!
The last changed time comes with the assumption that the web server provides accurate information. Dynamically generated pages will likely return the time the page was viewed. However, static pages are expected to reflect actual file modification time.
This is propagated through the HTTP header Last-Modified
. The Javascript trick by AZIRAR is clever and will display this value. Also, in Firefox going to Tools->Page Info will also display in the "Modified" field.