ansaurus

Question

Importing a large dataset into a database

Answer 1

A:

dunno if TAPS will help you here, http://adam.blog.heroku.com/past/2009/2/11/taps_for_easy_database_transfers/

stephenmurdoch 2010-03-15 20:31:29

Unfortunately, from what I see, I don't think it will. Taps seems to be for database to database transfers, whereas I have a set of files that need to be imported into a database.

2010-03-15 20:59:19

Answer 2

+1 A:

It will take two months to download the dump from their website. But it should only take a few hours to import this.

The fastest way will be for you to use Postgres' copy command. You can use that for the author's file. But the editions file needs to be inserted in both the books and author_books tables.

This script is in Python 2.6 but you should be able to adapt to Ruby if needed.

!#/usr/bin/env python
import json

fp = open('editions.json')
ab_out = open('/tmp/author_book.dump', 'w')
b_out = open('/tmp/book.dump', 'w')
for line in fp:
  vals = json.loads(s.split('/type/edition ')[1])
  b_out.write("%(key)s\t%(title)s\t(publish_date)s" % vals)
  for author in vals['authors']:
    ab_out.write("%s\t%s" % (vals['key'], author['key'])
fp.close()
ab_out.close()
b_out.close()

Then to copy to Postgres:

COPY book_table FROM '/tmp/book.dump'

Scott Bailey 2010-03-16 00:00:06

When you say I can just the Postgres copy command for the author's file, what do you mean? Wouldn't I also need to process it into the format that Postgres expects using a script like this one?

2010-03-16 01:35:16

Yes of course. I did the harder of the two files for you and assumed you could do the easier one yourself.

Scott Bailey 2010-03-16 05:16:22

Thanks again, got this solved, your help was invaluable.

2010-03-17 05:58:47

Answer 3

A:

Following Scott Bailey's advice, I wrote Ruby scripts to modify the JSON into a format acceptable for the Postgres copy command. In case anyone else runs into this same problem, here are the scripts I wrote:

require 'rubygems'
require 'json'

fp = File.open('./edition.txt', 'r')
ab_out = File.new('./author_book.dump', 'w')
b_out = File.new('./book.dump', 'w')

i = 0
while (line = fp.gets) 
  i += 1
  start = line.index /\{/
  if start
    to_parse = line[start, line.length]
    vals = JSON.parse to_parse

    if vals["key"].nil? || vals["title"].nil?
      next
    end
    title = vals["title"]
    #Some titles contain backslashes and tabs, which we need to escape and remove, respectively
    title.gsub! /\\/, "\\\\\\\\"
    title.gsub! /\t/, " "
    if ((vals["isbn_10"].nil? || vals["isbn_10"].empty?) && (vals["isbn_13"].nil? || vals["isbn_13"].empty?))
      b_out.puts vals["key"] + "\t" + title + "\t" + '\N' + "\n"
    #Only get the first ISBN number
    elsif (!vals["isbn_10"].nil? && !vals["isbn_10"].empty?) 
      b_out.puts vals["key"] + "\t" + title + "\t" + vals["isbn_10"][0] + "\n"
    elsif (!vals["isbn_13"].nil? && !vals["isbn_13"].empty?)
      b_out.puts vals["key"] + "\t" + title + "\t" + vals["isbn_13"][0] + "\n"    
    end
    if vals["authors"]
      for author in vals["authors"]
        if !author["key"].nil?
          ab_out.puts vals["key"] + "\t" + author["key"]
        end
      end
    end
  else
    puts "Error processing line: " + line.to_s
  end
  if i % 100000 == 0
    puts "Processed line " + i.to_s
  end
end

fp.close
ab_out.close
b_out.close

and

require 'rubygems'
require 'json'

fp = File.open('./author.txt', 'r')
a_out = File.new('./author.dump', 'w')

i = 0
while (line = fp.gets) 
  i += 1
  start = line.index /\{/
  if start
    to_parse = line[start, line.length]
    vals = JSON.parse to_parse

    if vals["key"].nil? || vals["name"].nil?
      next
    end
    name = vals["name"]
    name.gsub! /\\/, "\\\\\\\\"
    name.gsub! /\t/, " "
    a_out.puts vals["key"] + "\t" + name + "\n"
  else
    puts "Error processing line: " + line.to_s
  end
  if i % 100000 == 0
    puts "Processed line " + i.to_s
  end
end

fp.close
a_out.close

2010-03-17 05:58:32

ansaurus

tags:

views:

answers:

Importing a large dataset into a database

related questions