views:

59

answers:

2

I have a mid-size collection of records --about 20 million -- that I need to load into mySQL for use in data analysis. These happen to be records of people visiting places. They are uniquely identified by three data items:

  • place - a unique INT
  • person - a character string, sometimes numeric and sometimes alphanumeric , eg AB12345678
  • visit -similar to person

I don't have any control over the person and visit field contents, as these are provided by the different places, and each place does their own thing.

I can find all the records for a person by matching both place and person, and an individual record by matching all three.

I can make this work fine in mySql by creating a table like this:

CREATE TABLE ENCOUNTER (
  PLACE int(11) NOT NULL,
  PERSON varchar(255) NOT NULL,
  VISIT varchar(255) NOT NULL,
  ARRIVAL_TIME datetime DEFAULT NULL,
  DEPARTURE_TIME datetime DEFAULT NULL,
  EVENT varchar(255) NOT NULL,
  PRIMARY KEY (PLACE,PERSON,VISIT)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;

I went with MyISAM because I don't need ACID transactional integrity on this table; it's used for statistical reporting so if it's one or two rows stale it's no problem.

The table will often get hit with UPDATEs that simply change one of the fields, like DEPARTURE_TIME. These UPDATEs will most likely be about twice as frequent as new row INSERTs. There won't be a need to update place, person, or visit identifiers.

Here's some questions:

Would I be better off performance wise with a single index and key column concatenating the place/person/visit information?

How much of a hit do I take for varchar indexes? Is it worth trying to constrain them to a fixed-length field?

Any other suggestions from the collected wisdom?

Thanks.

A: 

your indexes are correct. you're not going to be able to do much better than that.

this is a perfect, non-obvious opportunity to use partitions. i have a feeling that all of your analysis is going to be based on place. if so, then create a hash partition based on the place column, like so:

ALTER TABLE encounter PARTITION BY KEY(place) PARTITIONS 12;

this will make your queries much faster since mysql knows it can skip looking at 1/12th of the rows when doing an analysis for one place.

longneck
Thanks. Actually most of the analysis is likely to be based on one of the timestamps in my schema. The trouble is, they can get changed after the row gets created. But partitioning on place is a very good idea.
Ollie Jones
A: 

I can find all the records for a person by matching both place and person, and an individual record by matching all three.

If you are going to search for all places a person visited, you'll need to make an additional index on (person, place).

How much of a hit do I take for varchar indexes? Is it worth trying to constrain them to a fixed-length field?

A key hit takes the same time for INT and VARCHAR records.

A key miss is more expensive for VARCHAR fields.

Quassnoi
Thanks for this info, Q! I can't search for all the places a person visited, because each place assigns its own person IDs. Place = a hospitalPerson = hospital-assigned medical record numberVisit = hospital-assigned patient visit identifier.
Ollie Jones