views:

311

answers:

2

Is anyone aware of a way to programatically download images from WikiMedia Commons without registering for a Bot account? It seems like the only way to get approval for a Bot account is if it adds to or edits information already on Wikimedia. If you try to download any images, without a bot account, using some of the api libraries out there you get error messages instead of the images. Seems like they block anyone not coming in from a browser? Anyone else have any experience with this? Am I missing something here?

A: 

They probably don't want people to scrape their site. It's freely available but that doesn't mean that they're obliged to facilitate automated bulk downloads.

John at CashCommons
Thats what I figured. I'm not suggesting that they're obligated to allow scraping but it does severely limit a publishers ability to access the content. There's a Wordpress plugin for quickly accessing Flickr images licensed under Creative Commons BY and BY SA licenses and including them in your articles. Would be nice if there were something similar for WikiMedia Commons. Going to the site each time is tedious and is probably why most CC BY and CC BY SA content seems to come from Flickr. I'm assuming anyone uploading works under these licenses would be happy to see them being widely used.
tomvon
Yeah, those are business decisions of Flickr and Wikimedia. They get to make the rules how their sites, and the included content, can be accessed. Likewise, content contributors and developers alike are free to deal or not deal with them based on their terms of use.
John at CashCommons
I get that. I'm really looking for specific information in regards to my question from anyone who has experience with the issue, not to be patronized. Thanks anyway.
tomvon
In that case, I suggest that you ask technical questions and avoid editorializing about the intentions of --INSERT WEBSITE--. If those kinds of comments aren't part of your question, then people won't address them in their answers.
John at CashCommons
Thanks for schooling me in the ways of the intarweb.
tomvon
+2  A: 

Try explaining exactly what you want to do? And what you've tried? What error message did you get? You're not very clear...

What libraries have you tried? If you're not aggressive, there are no restrictions in downloading WM content. I've never heard of any restrictions. Some User-Agents are banned from editing to avoid stupid spamming, but really, I've never heard of downloading restrictions.

If you are trying to scrape a massive amount of images, downloading them through Commons, you're doing it wrong (tm). If you are trying to get a few images, anywhere from 10 to 200, you should be able to write a decent tool in a few lines of code, provided that you are respecting the throttling requirement: when the API tells you to slow down, if you don't do it, sysadmins are likely to kick you out.

If you need a complete image dump, (we're talking of a few TBs) try asking on wikitech-l. We had torrents available when there were less images, now it's more complicated, but still doable.

About bot accounts. How deep have you looked in the system? You need a bot account for fast, unsupervised edits. Bot privileges also open a few facilities such as increased query sizes. But remember: bot account? it's simply an augmented user-account. Have you tried running anything with a classical account?

NicDumZ
Thanks, this is helpful. I have a site about plants and I'd like to include some photos from WikiMedia Commons. I ran a query against http://toolserver.org/~daniel/WikiSense/CategoryIntersect.php to get a list of images in particular category and then ran another query against http://toolserver.org/~magnus/commonsapi.php to get the metadata about each image. I then used urllib.urlretrieve in python script to get the actual image. Tho I just tried it again and it works, so does wget. Hmmm, I may have had bug with the formation of the url.
tomvon
Im not looking for a complete dump, just a few pics. I'd also like to create a Wordpress plugin that lets you search WC and add more easily images to your site (with proper attribution). Do you know where there's info about the throttling limits? I've done some pretty extensive reading at WC but don't remember seeing anything about limits. I certainly want to respect the Terms of Use.
tomvon
See http://www.mediawiki.org/wiki/Manual:Maxlag_parameter for throttling. Note that it's a recommendation, so if you have never actually seen a "maxlag" error or blocked/autoblocked/ratelimited error codes, you probably have never been throttled or blocked.
NicDumZ