views:

55

answers:

2

I have a simple 4-column Excel spreadsheet that matches universities to their ID codes for lookup purposes. The file is pretty big (300k).

I need to come up with a way to turn this data into a populated table in my Rails app. The catch is that this is a document that is updated now and then, so it can't just be a one-time solution. Ideally, it would be some sort of ruby script that would read the file and create the entries automatically so that when we get emailed a new version, we can just update it automatically. I'm on Heroku if that matters at all.

How can I accomplish something like this?

+2  A: 

If you can, save the spreadsheet as CSV, there's much better gems for parsing CSV files than for parsing excel spreadsheets. I found an effective way of handling this kind of problem is to make a rake task that reads the CSV file and creates all the records as appropriate.

So for example, here's how to read all the lines from a file using the old, but still effective FasterCSV gem

data = FasterCSV.read('lib/tasks/data.csv')
columns = data.remove(0)
unique_column_index = -1#The index of a column that's always unique per row in the spreadsheet
data.each do | row |
  r = Record.find_or_initialize_by_unique_column(row[unique_column_index])
  columns.each_with_index do | index, column_name |
    r[column_name] = row[index]
  end
  r.save! rescue => e Rails.logger.error("Failed to save #{r.inspect}")
end

It does kinda rely on you having a unique column in the original spreadsheet to go off though.

If you put that into a rake task, you can then wire it into you're Capistrano deploy script, so it'll be run every time you deploy. the find_or_initialize should ensure you shouldn't get duplicate records.

Ceilingfish
Thanks so much Ceilingfish! I'm using Heroku for deployment so I'll have to figure it out, it won't be too hard as the files are updated monthly or so.
Kevin
A: 

Parsing newish Excel files isn't too much trouble using Hpricot. This will give you a two-dimensional array:

require 'hpricot'

doc  = open("data.xlsx") { |f| Hpricot(f) }
rows = doc.search('row')
rows = rows[1..rows.length] # Skips the header row

rows = rows.map do |row|
    columns = []
    row.search('cell').each do |cell|
        # Excel stores cell indexes rather than blank cells
        next_index          = (cell.attributes['ss:Index']) ? (cell.attributes['ss:Index'].to_i - 1) : columns.length
        columns[next_index] = cell.search('data').inner_html
    end
    columns
end
elektronaut