Is there some kind of tool which lets me do SQL like queries (counting, aggregating, joining,etc) without using a full fledged database?
Preferably it's some kind of commandline tool:
sqlcommandline "select count(*) from file1.csv where bladebla"
Is there some kind of tool which lets me do SQL like queries (counting, aggregating, joining,etc) without using a full fledged database?
Preferably it's some kind of commandline tool:
sqlcommandline "select count(*) from file1.csv where bladebla"
Not SQL, but take a look at my own OSS project CSVfix, which does some of what you want.
AutoMate (you can have trial version), but it seems as an excess to me using automation tool for such simple task.
Partly it depends on what exactly you want to be able to do -- do you have some specific problems you'd like to solve? This kind of thing I tend to address using awk / grep, and if it gets complicated I'd write a script in Ruby or Python.
On the other hand, I had almost exactly the requirement you have recently and I solved it using MySQL. I also tried SQLite, but it was too slow (basically because I needed to run thousands of separate queries). SQLite is a good option, particularly if you're working with a language like Python, but I understand it works best if you can do one large query and then process the results in code, which might not really meet your needs.
If you have the sql server tools installed, you might be able to use the included command line query tool (osql.exe) and treat the file as an ole data source.
I expect you want something a little simpler to deploy, though, and you didn't even mention if you were on windows.
You don't say what OS you're on, which would help. I suspect from your stated command line preference that you may be on *nix, though.
If you're on Windows, you could investigate using the Microsoft Text Driver to create a data source targeted on a directory. Then each CSV can be treated as a table.
If you're not afraid of Perl, DBI and DBD::CSV.
http://search.cpan.org/~jzucker/DBD-CSV-0.22/lib/DBD/CSV.pm
However, whether this handles aggregate functions is unknown to me. It is best to import it into a real database first (SQLite as has been mentioned is a good candidate)
I've stumbled upon pig and hadoop. It looks like it could do the job. I'll investigate this some more too.
The Jet Database Engine is installed with Windows, or can be downloaded http://support.microsoft.com/kb/239114/en-us. You can use it with script to query a number of file types.
Dim cn: Set cn = CreateObject("ADODB.Connection")
Dim rs: Set rs = CreateObject("ADODB.Recordset")
cn.Open _
"Provider = Microsoft.Jet.OLEDB.4.0; " & _
"Data Source =c:\docs\;" _
& "Extended Properties=""text;HDR=Yes;FMT=Delimited"";"
rs.Open "SELECT * FROM Test.csv", cn
a = rs.GetString
MsgBox a
There is a Groovy script, gcsvsql, which does exactly what you are asking for, including joins. You can do things like
gcsvsql "select name,age from /users/data/people.csv where age > 40"
gcsvsql "select sum(score) from people.csv where age < 40"
gcsvsql " select people.name,children.child from people.csv,children.csv where people.name=children.name and people.age < 40"
You can get it from Google code here: