views:

53

answers:

2

I am used to post my ideas on one forum and started to worry that I will loose them if it gets closed. Do you know a good way to download entire (ideas of other guys are also nice!) phpBB3 forum to a database? Is there software already available, or I have to write it myself?

UPDATE1:

Well, I can write it myself - this is not that hard problem, isn't it? I just don't want to waste time on inventing bicycle.

UPDATE2:

There is an answer at SuperUser: How can I download an entire (active) phpbb forum?

But I preferred to make a Ruby script for backuping the forum. It is not a complete solution, but it is enough for me. And yes, it doesn't violates any TOS if you are so worried.

require :rubygems
require :hpricot
require 'open-uri'
require :uri
require :cgi
#require 'sqlite3-ruby'

class PHPBB
  def initialize base_url
    @base_url = base_url
    @forums, @topics = Array.new(4) { {} }
    self.parse_main_page 'main', 'index.php'
    @forums.keys.each do |f|
      self.parse_forum "forum.#{f}", "viewforum.php?f=#{f}"
    end
    @topics.keys.each do |t|
      self.parse_topic "topic.#{t}", "viewtopic.php?t=#{t}"
    end
  end


  def read_file cached, remote
    local = "%s.%s.html" % [__FILE__, cached]
    if File.exists? local
      return IO.read local
    else # download and save
      puts "load #{remote}"
      File.new(local, "w+") << (content = open(@base_url + remote).read)
      return content
    end
  end


  def parse_main_page local, remote
    doc = Hpricot(self.read_file(local,remote))
    doc.search('ul.forums/li.row').each do |li|
      fa = li.search('a.forumtitle').first # forum anchor
      f = self.parse_anchor(fa)['f']
      @forums[f] = {
        forum_id: f,
        title: fa.inner_html,
        description: li.search('dl/dt').first.inner_html.split('<br />').last.strip
      }
      ua, pa = li.search('dd.lastpost/span/a') # user anchor, post anchor
      q = self.parse_anchor(pa)
      self.last_post f, q['p'] unless q.nil?
    end
  end

  def last_post f,p
    @last_post = {forum_id: f, post_id: p} if @last_post.nil? or p.to_i > @last_post[:post_id].to_i
  end

  def last_topic f,t
  end


  def parse_forum local, remote, start=nil
    doc = Hpricot(self.read_file(local,remote))
    doc.search('ul.topics/li.row').each do |li|
      ta = li.search('a.topictitle').first # topic anchor
      q = self.parse_anchor(ta)
      f = q['f']
      t = q['t']
      u = self.parse_anchor(li.search('dl/dt/a').last)['u']
      @topics[t] = {
        forum_id: f,
        topic_id: t,
        user_id: u,
        title: ta.inner_html
      }
    end
  end


  def parse_topic local, remote, start=nil
    doc = Hpricot(self.read_file(local,remote))
    if start.nil?
      doc.search('div.pagination/span/a').collect{ |p| self.parse_anchor(p)['start'] }.uniq.each do |p|
        self.parse_topic "#{local}.start.#{p}", "#{remote}&start=#{p}", true
      end
    end
    doc.search('div.postbody').each do |li|
      # do something
    end
  end


  def parse_url href
    r = CGI.parse URI.parse(href).query
    r.each_pair { |k,v| r[k] = v.last }
  end


  def parse_anchor hp
    self.parse_url hp.attributes['href'] unless hp.nil?
  end
end
+2  A: 

This will be a violation of Terms of Service and may be illegal too.

Secondly, if StackOverflow community starts solving these kind of web-scraping problems, then you know ...

shamittomar
Nope, this dose not violate the TOS. What makes you think so? For me this is a standard problem of parsing and sorting data. Do you have a problem with this?
Andrey
@Andrei, can you please provide the URL of the forum ?
shamittomar
@shamittomar: Hmm... I would refuse to do this. Does it start to look [even more] evil?
Andrey
If you just want to check the TOS, then it is the standard phpBB3 TOS, which does not say anything about downloading entire forum.
Andrey
A: 

use offline Explorer

M.H
This was in my mind, but I would prefer a SQLite database with useful information only.
Andrey