views:

672

answers:

5

We use grep, cut, sort, uniq, and join at the command line all the time to do data analysis. They work great, although there are shortcomings. For example, you have to give column numbers to each tool. We often have wide files (many columns) and a column header that gives column names. In fact, our files look a lot like SQL tables. I'm sure there is a driver (ODBC?) that will operate on delimited text files, and some query engine that will use that driver, so we could just use SQL queries on our text files. Since doing analysis is usually ad hoc, it would have to be minimal setup to query new files (just use the files I specify in this directory) rather than declaring particular tables in some config.

Practically speaking, what's the easiest? That is, the SQL engine and driver that is easiest to set up and use to apply against text files?

+2  A: 

MySQL has a CVS storage engine, that might do what you need, if your files are CSV files.

Otherwise, you can use mysqlimport to import text files into MySQL. You could create a wrapper around mysqlimport, which figures out columns etc. and creates the necessary table.

You might also be able to use DBD::AnyData, a Perl module which lets you access text files like a database.

That said, it sounds a lot like you should really look at using a database. Is it really easier keeping table-oriented data in text files?

sleske
Yes, it really is easier keeping table-oriented data in text files. It's the ad-hoc nature of it, combined with all the command-line statistics tools we have: take this data, join it to that, cut away a few columns, pass it through an ANOVA. Setting up tables all the time, importing and exporting to get to our stats tools would be painful.
dfrankow
Agreed, the issue here is that I use data from many different sources and the common medium is usually a white-space separated file. My job would be easier if I could, for clarity, give 'cut' a field name instead of a column number. Five minutes later I am done with that file which makes an import to mysql feel cumbersome.
mcassano
+2  A: 

Maybe write a script that creates an SQLite instance (possibly in memory), imports your data from a file/stdin (accepting your data's format), runs a query, then exits?

Depending on the amount of data, performance could be acceptable.

orip
Ugly, but the first thing that came to my mind to. Certainly good enough for prototyping and small jobs.
dmckee
+2  A: 

Riffing off someone else's suggestion, here is a Python script for sqlite3. A little verbose, but it works.

I don't like having to completely copy the file to drop the header line, but I don't know how else to convince sqlite3's .import to skip it. I could create INSERT statements, but that seems just as bad if not worse.

Sample invocation:

$ sql.py --file foo --sql "select count(*) from data"

The code:

#!/usr/bin/env python

"""Run a SQL statement on a text file"""

import os
import sys
import getopt
import tempfile
import re

class Usage(Exception):
    def __init__(self, msg):
        self.msg = msg

def runCmd(cmd):
    if os.system(cmd):
        print "Error running " + cmd
        sys.exit(1)
        # TODO(dan): Return actual exit code

def usage():
    print >>sys.stderr, "Usage: sql.py --file file --sql sql"

def main(argv=None):
    if argv is None:
        argv = sys.argv

    try:
        try:
            opts, args = getopt.getopt(argv[1:], "h",
                                       ["help", "file=", "sql="])
        except getopt.error, msg:
            raise Usage(msg)
    except Usage, err:
        print >>sys.stderr, err.msg
        print >>sys.stderr, "for help use --help"
        return 2

    filename = None
    sql = None
    for o, a in opts:
        if o in ("-h", "--help"):
            usage()
            return 0
        elif o in ("--file"):
            filename = a
        elif o in ("--sql"):
            sql = a
        else:
            print "Found unexpected option " + o

    if not filename:
        print >>sys.stderr, "Must give --file"
        sys.exit(1)
    if not sql:
        print >>sys.stderr, "Must give --sql"
        sys.exit(1)

    # Get the first line of the file to make a CREATE statement
    #
    # Copy the rest of the lines into a new file (datafile) so that
    # sqlite3 can import data without header.  If sqlite3 could skip
    # the first line with .import, this copy would be unnecessary.
    foo = open(filename)
    datafile = tempfile.NamedTemporaryFile()
    first = True
    for line in foo.readlines():
        if first:
            headers = line.rstrip().split()
            first = False
        else:
            print >>datafile, line,
    datafile.flush()
    #print datafile.name
    #runCmd("cat %s" % datafile.name)
    # Create columns with NUMERIC affinity so that if they are numbers,
    # SQL queries will treat them as such.
    create_statement = "CREATE TABLE data (" + ",".join(
        map(lambda x: "`%s` NUMERIC" % x, headers)) + ");"

    cmdfile = tempfile.NamedTemporaryFile()
    #print cmdfile.name
    print >>cmdfile,create_statement
    print >>cmdfile,".separator ' '"
    print >>cmdfile,".import '" + datafile.name + "' data"
    print >>cmdfile, sql + ";"
    cmdfile.flush()
    #runCmd("cat %s" % cmdfile.name)
    runCmd("cat %s | sqlite3" % cmdfile.name)

if __name__ == "__main__":
    sys.exit(main())
dfrankow
Cool, thanks dfrankow!
mcassano
+2  A: 

David Malcolm wrote a little tool named show, which allows you to use SQL-like command-line syntax to parse text files of various formats. It would be pretty easy to extend it to understand csv or other tabular data.

An example on show's home page:

$ show "count(*)", source from /var/log/messages* group by source order by "count(*)" desc
count(*)|source              |
--------+--------------------+
1633    |kernel              |
1324    |NetworkManager      |
98      |ntpd                |
70      |avahi-daemon        |
63      |dhclient            |
48      |setroubleshoot      |
39      |dnsmasq             |
29      |nm-system-settings  |
27      |bluetoothd          |
14      |/usr/sbin/gpm       |
13      |acpid               |
10      |init                |
9       |pcscd               |
9       |pulseaudio          |
6       |gnome-keyring-ask   |
6       |gnome-keyring-daemon|
6       |gnome-session       |
6       |rsyslogd            |
5       |rpc.statd           |
4       |vpnc                |
3       |gdm-session-worker  |
2       |auditd              |
2       |console-kit-daemon  |
2       |libvirtd            |
2       |rpcbind             |
1       |nm-dispatcher.action|
1       |restorecond         |
ephemient
A: 

I have used Microsoft LogParser to query csv files several times... and it serves the purpose. It was surprising to see such a useful tool from M$ that too Free!

PeterParker