I had a similar problem. I was trying to devise a safe linking system for a directory of user submitted links. A user would publish a page on a blog or news site and submit the link to the index. A human would verify the link to be appropriate then add the page into the index.
The problem was to come up with a way to automate checks that ensured the link was still appropriate over time. For instance, did someone modify the page weeks later and insert racial slurs? Did the news site start telling people 'you must subscribe to read this story'?
I ended up extracting paragraph <p> elements and comparing the cached copy to the current word for word. In simplest terms:
cached[] = { "Lorem", "Ipsum", "..." };
scanned[] = { "Lorem, "foo", ... };
After that, a series of sorters would work on it while ignoring common words 'if but can or and' while treating other words (profanity, etc) with a heavier weight.
This resulted in a scoring system that would all but ignore minor edits and revisions (typos, sentence structure, etc) but quickly reveal if the content needed to be examined again. A score was then returned, scores above a threshold would be put in a queue for a human to re-verify.
This also helped to account for major cosmetic changes to the site. I would not trust it to run completely on its own, but it did do its job predictably well with a little help from humans. Admittedly, the system was not as efficient as it could have been as far as the methodology goes.