ansaurus

Question

How to count many things with SQLite under Perl ?

Answer 1

+1 A:

This is really not a Perl question but a SQL question. In any case, you do not need the COUNT column as SQLite provides a builtin count function to do the counting for you:

SELECT str, countr(str) FROM mytable GROUP BY str

should give you each unique str and the number of times it appears in the table.

Of course, if you defined your table with str as the primary key, you cannot insert multiple strs by definition, so your table structure needs to be refined.

UPDATE:

If I were to do this (and I am not sure I would), I would set up a table with an autogenerated id column and a column for the string. SQLite's INTEGER PRIMARY KEY, a 64-bit integer would be sufficient to assign a unique id to each string inserted.

Then, I would use the query above to get the frequencies by string.

If you are inserting via Perl's DBI, make sure to turn AutoCommit off during insertion and remember to commit at the end (or periodically).

Creating an index seems almost mandatory but it should be done after all the strings are in the database and before any queries are run.

#!/usr/bin/perl

use strict; use warnings;

use DBI;

my $dbh = DBI->connect('dbi:SQLite:counter.db', undef, undef,
    { RaiseError => 1, AutoCommit => 0 },
);

my $sth = $dbh->prepare(q{
    INSERT INTO strings (string) VALUES(?)
});

my @strings = qw( 0123456789 9876543210 );

for ( 1 .. 10 ) {
    my $val = $strings[0.5 > rand];
    $sth->execute($val);
}

$dbh->commit;

my $result = $dbh->selectall_hashref(
    q{SELECT string, count(string) FROM strings GROUP BY string},
    'string',
);

$dbh->disconnect;

use Data::Dumper;
print Dumper $result;

The SQL:

DROP TABLE strings;

CREATE TABLE strings (
    id INTEGER PRIMARY KEY,
    string char(10)
);

Output:

$VAR1 = {
          '9876543210' => {
                            'count(string)' => '9',
                            'string' => '9876543210'
                          },
          '0123456789' => {
                            'count(string)' => '1',
                            'string' => '0123456789'
                          }
        };

Sinan Ünür 2010-01-20 09:18:18

I get the impression that the strings are not in a database. He's planning to add them to a database as a means of counting them. In which case your proposal becomes: insert duplicates, then let SQL count them.

cjm 2010-01-20 09:22:14

So will `insert duplicates, then let SQL count them` be faster than `insert first then update` ?

Galaxy 2010-01-20 09:24:32

@Galaxy: We have no idea. You'd have to try it and see. It probably depends strongly on how many duplicates there are.

cjm 2010-01-20 09:33:27

If I try the `insert duplicates, then count`, when should I add an index on `str` ?The histogram of counts should be of a Poisson distribution.

Galaxy 2010-01-20 09:40:30

@cjm: Yes, exactly, *insert duplicates and let SQL count them*. Putting all the strings in a database has the advantage of being able to run other queries against them later without going through the import process again.

Sinan Ünür 2010-01-20 10:47:01

I have tested, `insert duplicates, then count` uses about 2/3 time of `insert first then update`.In the test, I added a index just before select, and I havenot test without the index.

Galaxy 2010-01-20 11:42:37

If there are too many different strings, the select itself will take too much memory, can this be reduced?

Galaxy 2010-01-20 17:04:09

Answer 2

A:

INSERT OR REPLACE is roughly equivalent to doing a DELETE on the unique constraint(s) using the values from the row to be inserted before doing the INSERT. It's useless for your purpose because you can't get the value of the counter from the old row. (The value for the new row is calculated before it finds out if there is an existing row to replace.)

If you expect that most strings will be unique (i.e., in most cases the UPDATE will do nothing), then it might be more efficient to do the INSERT first and issue the UPDATE only if that fails with a unique constraint error.

But as newt said, a hash would be faster unless you think you'll exceed your address space. (Even if you exceed available RAM, swapping might be faster than a database.)

cjm 2010-01-20 09:20:48

hash is not suitable as the strings take about 3Gx10=30GB, and I am planing to run on a 16GB machine.

Galaxy 2010-01-20 09:29:55

ansaurus

tags:

views:

answers:

How to count many things with SQLite under Perl ?

related questions