views:

483

answers:

1

I'm trying to download a static mirror of a wiki using wget. I only want the latest version of each article (not the full history or diffs between versions). It would be easy to just download the whole thing and delete unnecessary pages later, but doing so would take too much time and place an unnecessary strain on the server.

There are a number of pages I clearly don't need such as:

WhoIsDoingWhat?action=diff&date=1184177979

Is there a way to tell wget not to download and recurse on URLs that have 'action=diff' in them? Or otherwise exclude URLs that match some regex?

+2  A: 
-R '*action=diff*,*action=edit*'
chaos
It looks like doing that will download the page, reject it, and then delete it (instead of skipping to download it altogether).
stonea
Although it will prevent recursing on the rejected page.
stonea
I see no evidence of that. "The ‘--reject’ option works the same way as ‘--accept’, only its logic is the reverse; Wget will download all files except the ones matching the suffixes (or patterns) in the list". (-R is the same as --reject and --rejlist.) That seems to be clearly stating it will not download matching patterns.
chaos
Seems like a bug in wget. Other people have had this issue before: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=217243
stonea
Hunh. Well, that's friggin' goofy. Sorry, guess you can't quite do all of it with wget then. :(
chaos
If you're using Mediawiki, you could try using the API instead http://www.mediawiki.org/wiki/API
Adrian Archer