ansaurus

Question

SQL query engine for text files on Linux?

Answer 1

+2 A:

MySQL has a CVS storage engine, that might do what you need, if your files are CSV files.

Otherwise, you can use mysqlimport to import text files into MySQL. You could create a wrapper around mysqlimport, which figures out columns etc. and creates the necessary table.

You might also be able to use DBD::AnyData, a Perl module which lets you access text files like a database.

That said, it sounds a lot like you should really look at using a database. Is it really easier keeping table-oriented data in text files?

sleske 2009-04-22 16:35:58

Yes, it really is easier keeping table-oriented data in text files. It's the ad-hoc nature of it, combined with all the command-line statistics tools we have: take this data, join it to that, cut away a few columns, pass it through an ANOVA. Setting up tables all the time, importing and exporting to get to our stats tools would be painful.

dfrankow 2009-04-22 16:49:32

Agreed, the issue here is that I use data from many different sources and the common medium is usually a white-space separated file. My job would be easier if I could, for clarity, give 'cut' a field name instead of a column number. Five minutes later I am done with that file which makes an import to mysql feel cumbersome.

mcassano 2009-04-22 16:53:17

Answer 2

+2 A:

Maybe write a script that creates an SQLite instance (possibly in memory), imports your data from a file/stdin (accepting your data's format), runs a query, then exits?

Depending on the amount of data, performance could be acceptable.

orip 2009-04-22 17:51:41

Ugly, but the first thing that came to my mind to. Certainly good enough for prototyping and small jobs.

dmckee 2009-04-22 18:01:03

Answer 3

+2 A:

Riffing off someone else's suggestion, here is a Python script for sqlite3. A little verbose, but it works.

I don't like having to completely copy the file to drop the header line, but I don't know how else to convince sqlite3's .import to skip it. I could create INSERT statements, but that seems just as bad if not worse.

Sample invocation:

$ sql.py --file foo --sql "select count(*) from data"

The code:

#!/usr/bin/env python

"""Run a SQL statement on a text file"""

import os
import sys
import getopt
import tempfile
import re

class Usage(Exception):
    def __init__(self, msg):
        self.msg = msg

def runCmd(cmd):
    if os.system(cmd):
        print "Error running " + cmd
        sys.exit(1)
        # TODO(dan): Return actual exit code

def usage():
    print >>sys.stderr, "Usage: sql.py --file file --sql sql"

def main(argv=None):
    if argv is None:
        argv = sys.argv

    try:
        try:
            opts, args = getopt.getopt(argv[1:], "h",
                                       ["help", "file=", "sql="])
        except getopt.error, msg:
            raise Usage(msg)
    except Usage, err:
        print >>sys.stderr, err.msg
        print >>sys.stderr, "for help use --help"
        return 2

    filename = None
    sql = None
    for o, a in opts:
        if o in ("-h", "--help"):
            usage()
            return 0
        elif o in ("--file"):
            filename = a
        elif o in ("--sql"):
            sql = a
        else:
            print "Found unexpected option " + o

    if not filename:
        print >>sys.stderr, "Must give --file"
        sys.exit(1)
    if not sql:
        print >>sys.stderr, "Must give --sql"
        sys.exit(1)

    # Get the first line of the file to make a CREATE statement
    #
    # Copy the rest of the lines into a new file (datafile) so that
    # sqlite3 can import data without header.  If sqlite3 could skip
    # the first line with .import, this copy would be unnecessary.
    foo = open(filename)
    datafile = tempfile.NamedTemporaryFile()
    first = True
    for line in foo.readlines():
        if first:
            headers = line.rstrip().split()
            first = False
        else:
            print >>datafile, line,
    datafile.flush()
    #print datafile.name
    #runCmd("cat %s" % datafile.name)
    # Create columns with NUMERIC affinity so that if they are numbers,
    # SQL queries will treat them as such.
    create_statement = "CREATE TABLE data (" + ",".join(
        map(lambda x: "`%s` NUMERIC" % x, headers)) + ");"

    cmdfile = tempfile.NamedTemporaryFile()
    #print cmdfile.name
    print >>cmdfile,create_statement
    print >>cmdfile,".separator ' '"
    print >>cmdfile,".import '" + datafile.name + "' data"
    print >>cmdfile, sql + ";"
    cmdfile.flush()
    #runCmd("cat %s" % cmdfile.name)
    runCmd("cat %s | sqlite3" % cmdfile.name)

if __name__ == "__main__":
    sys.exit(main())

dfrankow 2009-04-22 21:31:19

Cool, thanks dfrankow!

mcassano 2009-04-23 17:03:40

Answer 4

+2 A:

David Malcolm wrote a little tool named show, which allows you to use SQL-like command-line syntax to parse text files of various formats. It would be pretty easy to extend it to understand csv or other tabular data.

An example on show's home page:

$ show "count(*)", source from /var/log/messages* group by source order by "count(*)" desc
count(*)|source              |
--------+--------------------+
1633    |kernel              |
1324    |NetworkManager      |
98      |ntpd                |
70      |avahi-daemon        |
63      |dhclient            |
48      |setroubleshoot      |
39      |dnsmasq             |
29      |nm-system-settings  |
27      |bluetoothd          |
14      |/usr/sbin/gpm       |
13      |acpid               |
10      |init                |
9       |pcscd               |
9       |pulseaudio          |
6       |gnome-keyring-ask   |
6       |gnome-keyring-daemon|
6       |gnome-session       |
6       |rsyslogd            |
5       |rpc.statd           |
4       |vpnc                |
3       |gdm-session-worker  |
2       |auditd              |
2       |console-kit-daemon  |
2       |libvirtd            |
2       |rpcbind             |
1       |nm-dispatcher.action|
1       |restorecond         |

ephemient 2009-04-24 19:24:43

Answer 5

A:

I have used Microsoft LogParser to query csv files several times... and it serves the purpose. It was surprising to see such a useful tool from M$ that too Free!

PeterParker 2010-07-04 13:30:38

ansaurus

tags:

views:

answers:

SQL query engine for text files on Linux?

related questions