views:

268

answers:

2

I want to count many strings (>3G), so I choose SQLite with a table of (str TEXT PRIMARY KEY, count INTEGER DEFAULT 1).

There are about 3G strings, each takes 40*2/8=10 bytes, thus the whole strings is 30GB. Of those 10 bytes, there are 2^80 kinds, which is much larger than 3G.

So how to update effectively ?

UPDATE table SET count = count + 1 WHERE str = 'xxx';
# check whether rows infected
INSERT INTO table (str) VALUES ('yyy')

Or sth. like INSERT OR REPLACE, which I am not familiar with.

Any suggestions ?


I follow Sinan Ünür's way:

PRAGMA synchronous = OFF;
PRAGMA journal_mode = OFF;
PRAGMA temp_store = MEMORY;
PRAGMA auto_vacuum = NONE;
PRAGMA cache_size = 4000000;
CREATE TABLE kmers ( seq TEXT );

SELECT seq,COUNT(seq) FROM kmers GROUP BY seq;

No index used. Autocommit is 0.

And I have not test whether journal_mode OFF is faster. temp_store should be useless.

+1  A: 

This is really not a Perl question but a SQL question. In any case, you do not need the COUNT column as SQLite provides a builtin count function to do the counting for you:

SELECT str, countr(str) FROM mytable GROUP BY str

should give you each unique str and the number of times it appears in the table.

Of course, if you defined your table with str as the primary key, you cannot insert multiple strs by definition, so your table structure needs to be refined.

UPDATE:

If I were to do this (and I am not sure I would), I would set up a table with an autogenerated id column and a column for the string. SQLite's INTEGER PRIMARY KEY, a 64-bit integer would be sufficient to assign a unique id to each string inserted.

Then, I would use the query above to get the frequencies by string.

If you are inserting via Perl's DBI, make sure to turn AutoCommit off during insertion and remember to commit at the end (or periodically).

Creating an index seems almost mandatory but it should be done after all the strings are in the database and before any queries are run.

#!/usr/bin/perl

use strict; use warnings;

use DBI;

my $dbh = DBI->connect('dbi:SQLite:counter.db', undef, undef,
    { RaiseError => 1, AutoCommit => 0 },
);

my $sth = $dbh->prepare(q{
    INSERT INTO strings (string) VALUES(?)
});

my @strings = qw( 0123456789 9876543210 );

for ( 1 .. 10 ) {
    my $val = $strings[0.5 > rand];
    $sth->execute($val);
}

$dbh->commit;

my $result = $dbh->selectall_hashref(
    q{SELECT string, count(string) FROM strings GROUP BY string},
    'string',
);

$dbh->disconnect;

use Data::Dumper;
print Dumper $result;

The SQL:

DROP TABLE strings;

CREATE TABLE strings (
    id INTEGER PRIMARY KEY,
    string char(10)
);

Output:

$VAR1 = {
          '9876543210' => {
                            'count(string)' => '9',
                            'string' => '9876543210'
                          },
          '0123456789' => {
                            'count(string)' => '1',
                            'string' => '0123456789'
                          }
        };
Sinan Ünür
I get the impression that the strings are not in a database. He's planning to add them to a database as a means of counting them. In which case your proposal becomes: insert duplicates, then let SQL count them.
cjm
So will `insert duplicates, then let SQL count them` be faster than `insert first then update` ?
Galaxy
@Galaxy: We have no idea. You'd have to try it and see. It probably depends strongly on how many duplicates there are.
cjm
If I try the `insert duplicates, then count`, when should I add an index on `str` ?The histogram of counts should be of a Poisson distribution.
Galaxy
@cjm: Yes, exactly, *insert duplicates and let SQL count them*. Putting all the strings in a database has the advantage of being able to run other queries against them later without going through the import process again.
Sinan Ünür
I have tested, `insert duplicates, then count` uses about 2/3 time of `insert first then update`.In the test, I added a index just before select, and I havenot test without the index.
Galaxy
If there are too many different strings, the select itself will take too much memory, can this be reduced?
Galaxy
A: 

INSERT OR REPLACE is roughly equivalent to doing a DELETE on the unique constraint(s) using the values from the row to be inserted before doing the INSERT. It's useless for your purpose because you can't get the value of the counter from the old row. (The value for the new row is calculated before it finds out if there is an existing row to replace.)

If you expect that most strings will be unique (i.e., in most cases the UPDATE will do nothing), then it might be more efficient to do the INSERT first and issue the UPDATE only if that fails with a unique constraint error.

But as newt said, a hash would be faster unless you think you'll exceed your address space. (Even if you exceed available RAM, swapping might be faster than a database.)

cjm
hash is not suitable as the strings take about 3Gx10=30GB, and I am planing to run on a 16GB machine.
Galaxy