views:

369

answers:

6

Does anyone know of any tools to provide simple, fast queries of flat files using a SQL-like declarative query language? I'd rather not pay the overhead of loading the file into a DB since the input data is typically thrown out almost immediately after the query is run.

Consider the data file, "animals.txt":

dog 15
cat 20
dog 10
cat 30
dog 5
cat 40

Suppose I want to extract the highest value for each unique animal. I would like to write something like:

cat animals.txt | foo "select $1, max(convert($2 using decimal)) group by $1"

I can get nearly the same result using sort:

cat animals.txt | sort -t " " -k1,1 -k2,2nr

And I can always drop into awk from there, but this all feels a bit awkward (couldn't resist) when a SQL-like language would seem to solve the problem so cleanly.

I've considered writing a wrapper for SQLite that would automatically create a table based on the input data, and I've looked into using Hive in single-processor mode, but I can't help but feel this problem has been solved before. Am I missing something? Is this functionality already implemented by another standard tool?

Halp!

A: 

We'll I have a lightweight ORM for sqlite that would simplify this task without requiring any configuration files, etc.

If you can using PowerShell has a lot of powerful capabilities for parsing and querying text files (example here). Otherwise using .NET/Mono you can cut that up in and use LINQ in no time.

mythz
I'd like to be able to do everything right from the shell, which this solution doesn't seem to support.
plinehan
Which one? Powershell is bash on steroids which lets you do everything from the shell. As for OrmLite (which is what I would use), you write a few lines program that imports all the data to your db of choice then you can use the sqlite3.exe to query from the command prompt.
mythz
+1  A: 

Perl DBI using DBD::AnyData

harschware
A: 

you can use sqlite. Here's an example using Python.

import sqlite3
conn = sqlite3.connect('/tmp/test.db')
cursor = conn.cursor()
try:
    cursor.execute("""create table table1 (word varchar not null, number varchar not null)""")
except: pass
cursor.execute("insert into table1 values ('dog', '15')")
cursor.execute("insert into table1 values ('cat', '20')")
cursor.execute("insert into table1 values ('dog', '10')")
cursor.execute("select max(number) , word from table1 group by word")
print cursor.fetchall()

output

$ ./python.py
[(u'20', u'cat'), (u'15', u'dog')]
ghostdog74
I'd like to be able to do everything right from the shell, which this solution doesn't seem to support.
plinehan
+1  A: 

I just stumbled across this Python script which does something like what you want, although it only supports very basic queries.

David Johnstone
A: 

I never managed to find a satisfying answer to my question, but I did at least find a solution to my toy problem using uniqs "-f" option, which I had been unaware of:

cat animals.txt | sort -t " " -k1,1 -k2,2nr \
| awk -F' ' '{print $2, " ", $1}' | uniq -f 1

The awk portion above could, obviously, be skipped entirely if the input file were created with columns in the opposite order.

I'm still holding out hope for a SQL-like tool, though.

plinehan
+1  A: 

I wrote TxtSushi mostly to do SQL selects on flat files. Here is the command chain for your example (all of these commands are from TxtSushi):

tabtocsv animals.txt | namecolumns - | tssql -table animals - \
'select col1, max(as_int(col2)) from animals group by col1'

namecolumns is only required because animals.txt doesn't have a header row. You can get a quick sense of what is possible by looking through the example scripts. There are also links to similar tools on the bottom of the main page.

Keith
Very nice. How well does it scale? I'm hoping to deal with muti-gigabyte files which exceed the available RAM on my machine.
plinehan
It does any kind of row filtering or column selection using a streaming approach but as soon as you ask it to do anything requiring a sort (group by, join on, order by all require sorts) it wants to read the full table into memory. In this case you can give the -external-sort option which tells TxtSushi to sort on disk instead, but my current implementation of external sort is very inefficient and needs some work.
Keith