tags:

views:

150

answers:

4

You have a forum (vbulletin) that has a bunch of images - how easy would it be to have a page that visits a thread, steps through each page and forwards to the user (via ajax or whatever) the images. i'm not asking about filtering (that's easy of course).

doable in a day? :)

I have a site that uses codeigniter as well - would it be even simpler using it?

+2  A: 

assuming this is to be carried out on server, curl + regexp are your friends .. and yes .. doable in a day...

there are also some open-source HTML parsers that might make this cleaner

Scott Evernden
Looks good! Thanks.
A: 

It depends on where your scraping script runs.

If it runs on the same server as the forum software, you might want to access the database directly and check for image links there. I'm not familiar with vbulletin, but probably it offers a plugin api that allows for high level database access. That would simplify querying all posts in a thread.

If, however, your script runs on a different machine (or, in other words, is unrelated to the forum software), it would have to act as a http client. It could fetch all pages of a thread (either automatically by searching for a NEXT link in a page or manually by having all pages specified as parameters) and search the html source code for image tags (<img .../>). Then a regular expression could be used to extract the image urls. Finally, the script could use these image urls to construct another page displaying all these images, or it could download them and create a package.

In the second case the script actually acts as a "spider", so it should respect things like robots.txt or meta tags.

Ole
Right - I will be parsing it using the image tags, and of course skipping avatars, etc. I have permission to do this from the owner, but I will be acting as a third party without access.Thank you.
A: 

When doing this, make sure to rate-limit your fetching. You don't want to overload the forum server by requesting many pages per second. Simplest way to do this is probably just to sleep for X seconds between each fetch.

derobert
A: 

Yes doable in a day

Since you already have a working CI setup I would use it.

I would use the following approach:

1) Make a model in CI capable of:

  • logging in to vbulletin (images are often added as attachments and you need to be logged in before you can download them). Use something like snoopy.
  • collecting the url for the "last button" using preg_match(), parsing the url with parse_url() / and parse_str() and generating links from page 1 to page last
  • collecting html from all generated links. Still using snoopy.
  • finding all images in html using preg_match_all()
  • downloading all images. Still using snoopy.
  • moving the downloaded image from a tmp directory into another directory renaming it imagename_01, imagename_02, etc. if the same imagename allready exists.
  • saving the image name and precise bytesize in a db table. Then you can avoid downloading the same image more than once.

2) Make a method in a controller that collects all images

3) Setup a cronjob that collect images at regular intervals. wget -o /tmp/useless.html http://localhost/imageminer/collect should do nicely

4) Write the code that outputs pretty html for the enduser using the db table to get the images.

Will do - thank you.