views:

476

answers:

2

I've got a largish (~1.5M records) table that holds text strings of varying length for which I run queries against looking for matches:

CREATE TABLE IF NOT EXISTS `shingles` (
  `id` bigint(20) NOT NULL auto_increment,
  `TS` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
  `shingle` varchar(255) NOT NULL,
  `count` int(11) NOT NULL default '0',
  PRIMARY KEY  (`id`),
  KEY `shingle` (`shingle`,`TS`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=1571668;

My problem is that I need while I'm doing comparisons against this table I am constantly adding and removing data from it, so maintaining indexes is hard.

I'm looking for best practices for managing the inserts in a timely fashion while maximizing the throughput for the selects. This process is running 24hrs a day and needs to be as quick as possible.

Any help is appreciated.

Update: To clarify, I'm doing one to one matches on the 'shingle' column, not text searches within it.

A: 

Hi jqs,

For starters, use InnoDB instead of MyISAM. That'll solve the problem of doing queries while you also do inserts.

You might need to tweak your mysql configuration a bit to use the memory for innodb (innodb_buffer_pool_size instead of key_buffer_size).

Ask Bjørn Hansen
I've already tried moving to InnoDB but that actually slowed things down. I'm now profiling my queries and attempting other performance improvements.
jqs
Did you make sure to configure the MySQL server appropriately for InnoDB? In 97% of all installations InnoDB will be faster (yeah, the number was made up, but I bet it's not far off...)
Ask Bjørn Hansen
+1  A: 

First: your bigint primary key could be killing you here, it's a very expensive type to try to maintain. 1.5 million records is nowhere near the limit for unsigned int (~4.2 billion).

Using a big int for a primary key is even worse in InnoDB as it stores the PK against each entry in every other index, so that could partially explain the problems when you tried switching. As soon as you're adding and deleting from the table MyISAM is gonna get screwed if there are a lot of concurrent transactions.

A trick to get around the expense of string comparisons is to store crc32(shingle) as well as shingle. You then index this column, but not your varchar. Something like below is how I'd do it:

CREATE TABLE IF NOT EXISTS `shingles` (
  `id` int unsigned NOT NULL auto_increment,
  `TS` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
  `crc` int unsigned not null,
  `shingle` varchar(255) NOT NULL,
  `count` int(11) NOT NULL default '0',
  PRIMARY KEY  (`id`),
  KEY `crc` (`crc`)
)
insert into shingles (crc, shingle, count) values (crc32('testtest'),'testtest',1),(crc32('foobar'),'foobar',4);
select * from shingles where crc = crc32('foobar') and shingle = 'foobar';

If you intend to query on 'ts' then add it as the second component of the crc index

J.D. Fitz.Gerald