views:

282

answers:

5

Hi all!

I have an old project in our company that uses shell scripting a lot. Most popular commands are: grep, sed, sort. And it was OK for now. But one of the thing really bothers me. Sometimes we have an input data from other companies that is in csv-like formats. These input file are connected by several IDs like they are just a dump of some DBMS.

Instead of importing the data into some DBMS there are pages of shell scripts that do 'cut'-ing, 'paste'-ing and 'join'-ing in combination of sort to emulate real db.

The best solution seems to be use SQLite, but my worries are about the size of the data it can handle and the speed of work. And it seems like SQLite will make it more difficult to use such features of shell as 'sed'.

Can someone advise or describe your success story?

+1  A: 

If shell scripts are working for you, then I see no reason to abandon them (and this is coming from a big DB guy). Even if you do get a database, you'll still want those scripts to do imports/exports and other operations. What I would do is make sure your "cut", "paste", and "join" logic is abstracted away, such that joining two sets of csv data is an easy operation that relies on "black box" code, rather than recoding join logic for every new operation.

Joel Coehoorn
A: 

Using any database system that has indexes will be faster than parsing of textual files. For example, if you have 100.000 records, a binary indexed column can extract any "record" within 17 reads. Finding a record in such flat file can take from one to 100.000 reads.

Indexes are most important for speed and fast access, but with databases you also get other benefits like data consistency (makes sure all data has valid type).

Milan Babuškov
1. 'join' needs all files to be 'sort'-ed before usage. So it's not just sequential search in a file.Yes, typing was one of the benefits I described to my boss :)
antimirov
+3  A: 
  • size: SQLite can hanlde huge datasets, easily go into the Terabytes, according to docs
  • speed: propery indexed queries and joins are on the O(logn) or O(nlogn) instead of the O(n) or O(n^2) you get from sed, cut, paste and join.
  • sed:
    • if you use it to simply transform a field, it's just as easy to do a one- or two-liner on Python to handle it.
    • if you use it to transform the table structure, it's a totally different task, either trivially solved by ALTER TABLE, or maybe a huge processing project.
Javier
AFAIK, SQLite does not support altering columns.
Milan Babuškov
http://www.sqlite.org/lang_altertable.html
Milan Babuškov
To quote: "It is not possible to rename a colum, remove a column, or add or remove constraints from a table."
Milan Babuškov
That's why i said that transforming table structure can be trivial, or a big project by itself; it all depends on what you want to do. Sometimes it __is__ easier to process with sed than with SQL
Javier
+1  A: 

Sqlite3 is a Reasonable Plan

Sqlite3 will work fine for you as long as you run the scripts and the database on the same machine. If you want to work over a network, why not just run mysql?

Things like Ruby, Python, and Perl are useful as scripting languages and should interoperate nicely with all the other shell commands. Although in theory SQL can be used directly you shouldn't try it, you will want to access the database via a scripting language that already has sqlite3 integration.

You use the scripting language a lot like an awk, sed, or sh script. For example, here are some Ruby commands used to create an sqlite3 db:

require 'rubygems'
require 'sqlite3'    
# ... stuff removed here ...
db = SQLite3::Database.new(aDatabase)
db.execute("drop table if exists " + aTable);
File.open(aFile, "r") do |f|
  @sql = SqlParams.new(f.gets)
  sqlCreateStmt = "CREATE TABLE #{aTable} (
      Id INTEGER PRIMARY KEY,
      #{@sql.sqlCols}
      )\n"
  placeholders = ("?," * @sql.n).chomp(",");
  sqlInsertStmt = "INSERT INTO #{aTable} VALUES (NULL, #{placeholders})"
  puts @sql.n, ": ", sqlCreateStmt, ": ", placeholders, ": ",sqlInsertStmt
  db.execute(sqlCreateStmt)
  db.transaction do |d|
    f.each_line do |el|
      t = el.split("\t", @sql.n)
      print '.'
      d.execute(sqlInsertStmt, t)
    end
  end
end
db.close() unless db.closed?
print "\n"
DigitalRoss
awk, sed and sh are a bad idea for this. You need a language like Perl, Python or Ruby.
reinierpost
A: 

I've used Sqlite on a project recently. The Sqlite file was +600 MB and it was fast enough for me. You may use indexes to improve speed. You can test if your index is used with EXPLAIN QUERY PLAN (http://www.sqlite.org/lang_explain.html)

Using SQL for your queries will make your life a lot easier compared to csv parsing.

If you need to combine sqlite with scripting, I recommend to use Python. Python has built in libraries for sqlite and csv file handling. Hence it is very easy to start working and combining both (e.g. write a simple script that outputs data for your other existing command line tools).

wierob