views:

312

answers:

7

I am writing a test program with Ruby and ActiveRecord, and it reads a document which is like 6000 words long. And then I just tally up the words by

recordWord = Word.find_by_s(word);
if (recordWord.nil?)
  recordWord = Word.new
  recordWord.s = word
end
if recordWord.count.nil?
  recordWord.count = 1
else
  recordWord.count += 1
end
recordWord.save

and so this part loops for 6000 times... and it takes a few minutes to run at least using sqlite3. Is it normal? I was expecting it could run within a couple seconds... can MySQL speed it up a lot?

+13  A: 

With 6000 calls to write to the database, you're going to see speed issues. I would save the various tallies in memory and save to the database once at the end, not 6000 times along the way.

acrosman
even writing all the records at the end, it hits the database 6000 times... and took quite a while. i was thinking of more like 10, 15 seconds max. if i write all data to a flat file, it should only take 1 or 2 seconds. can't i turn off sqlite3 to not "force write" to the db every time, but write it once after all the insert of records?
動靜能量
look at my answer, using bulk insert should speed it way up.
James Avery
+1  A: 

What kind of database connection are you using? Some databases allow you to connect 'directly' rather then using a TCP network connection that goes through the network stack. In other words, if you're making an internet connection and sending data through that way, it can slow things down.

Another way to boost performance of a database connection is to group SQL statements together in a single command.

For example, making a single 6,000 line SQL statement that looks like this

"update words set count = count + 1 where word = 'the'
update words set count = count + 1 where word = 'in'
...
update words set count = count + 1 where word = 'copacetic'"

and run that as a single command, performance will be a lot better. By default, MySQL has a 'packet size' limit of 1 megabyte, but you can change that in the my.ini file to be larger if you want.

Since you're abstracting away your database calls through ActiveRecord, you don't have much control over how the commands are issued, so it can be difficult to optimize your code.

Another thin you could do would be to keep a count of words in memory, and then only insert the final total into the database, rather then doing an update every time you come across a word. That will probably cut down a lot on the number of inserts, because if you do an update every time you come across the word 'the', that's a huge, huge waste. Words have a 'long tail' distribution and the most common words are hugely more common then more obscure words. Then the underlying SQL would look more like this:

"update words set count = 300 where word = 'the'
update words set count = 250 where word = 'in'
...
update words set count = 1 where word = 'copacetic'"

If you're worried about taking up too much memory, you could count words and periodically 'flush' them. So read a couple megabytes of text, then spend a few seconds updating the totals, rather then updating each word every time you encounter it. If you want to improve performance even more, you should consider issuing SQL commands in batches directly

Chad Okere
it is using sqlite3... and if i use activerecord, looks like it will hit the db and the hard drive 6000 times and thus quite slow. so i hope to write it all at once...
動靜能量
+2  A: 

I wrote up some quick code in perl that simply does:

  1. Create the database
  2. Insert a record that only contains a single integer
  3. Retrieve the most recent record and verify that it returns what it inserted

And it does steps #2 and #3 6000 times. This is obviously a considerably lighter workload than having an entire object/relational bridge. For this trivial case with SQLite it still took 17 seconds to execute, so your desire to have it take "a couple of seconds" is not realistic on "traditional hardware."

Using the monitor I verified that it was primarily disk activity that was slowing it down. Based on that if for some reason you really do need the database to behave that quickly I suggest one of two options:

  1. Do what people have suggested and find away around the requirement
  2. Try buying some solid state disks.

I think #1 is a good way to start :)

Code:

#!/usr/bin/perl

use warnings;
use strict;

use DBI;

my $dbh = DBI->connect('dbi:SQLite:dbname=/tmp/dbfile', '', '');

create_database($dbh);
insert_data($dbh);

sub insert_data {
  my ($dbh) = @_;

  my $insert_sql = "INSERT INTO test_table (test_data) values (?)";
  my $retrieve_sql = "SELECT test_data FROM test_table WHERE test_data = ?";

  my $insert_sth = $dbh->prepare($insert_sql);
  my $retrieve_sth = $dbh->prepare($retrieve_sql);

  my $i = 0;
  while (++$i < 6000) {
     $insert_sth->execute(($i));
     $retrieve_sth->execute(($i));

     my $hash_ref = $retrieve_sth->fetchrow_hashref;

     die "bad data!" unless $hash_ref->{'test_data'} == $i;
  }
}

sub create_database {
   my ($dbh) = @_;

   my $status = $dbh->do("DROP TABLE test_table");
   # return error status if CREATE resulted in error
   if (!defined $status) {
     print "DROP TABLE failed";
   }

   my $create_statement = "CREATE TABLE test_table (id INTEGER PRIMARY KEY AUTOINCREMENT, \n";
   $create_statement .= "test_data varchar(255)\n";
   $create_statement .= ");";

   $status = $dbh->do($create_statement);

   # return error status if CREATE resulted in error
   if (!defined $status) {
     die "CREATE failed";
   }
}
earino
Modern databases can easily deal with 6k inserts in a second if they're setup properly, with the right connection types, indexes, disk caching, etc.
Chad Okere
+1  A: 

Without knowing about Ruby and Sqlite, some general hints:

create a unique index on Word.s (you did not state whether you have one)

define a default for Word.count in the database ( DEFAULT 1 )

optimize assignment of count:

recordWord = Word.find_by_s(word);
if (recordWord.nil?)
    recordWord = Word.new
    recordWord.s = word
    recordWord.count = 1
else
    recordWord.count += 1
end
recordWord.save
devio
+5  A: 

Take a look at AR:Extensions as well to handle the bulk insertions.

http://rubypond.com/articles/2008/06/18/bulk-insertion-of-data-with-activerecord/

James Avery
+1  A: 

Use BEGIN TRANSACTION before your updates then COMMIT at the end.

A: 

ok, i found some general rule:

1) use a hash to keep the count first, not the db
2) at the end, wrap all insert or updates in one transaction, so that it won't hit the db 6000 times.

動靜能量