I have a mid-size collection of records --about 20 million -- that I need to load into mySQL for use in data analysis. These happen to be records of people visiting places. They are uniquely identified by three data items:
- place - a unique INT
- person - a character string, sometimes numeric and sometimes alphanumeric , eg AB12345678
- visit -similar to person
I don't have any control over the person and visit field contents, as these are provided by the different places, and each place does their own thing.
I can find all the records for a person by matching both place and person, and an individual record by matching all three.
I can make this work fine in mySql by creating a table like this:
CREATE TABLE ENCOUNTER (
PLACE int(11) NOT NULL,
PERSON varchar(255) NOT NULL,
VISIT varchar(255) NOT NULL,
ARRIVAL_TIME datetime DEFAULT NULL,
DEPARTURE_TIME datetime DEFAULT NULL,
EVENT varchar(255) NOT NULL,
PRIMARY KEY (PLACE,PERSON,VISIT)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;
I went with MyISAM because I don't need ACID transactional integrity on this table; it's used for statistical reporting so if it's one or two rows stale it's no problem.
The table will often get hit with UPDATEs that simply change one of the fields, like DEPARTURE_TIME. These UPDATEs will most likely be about twice as frequent as new row INSERTs. There won't be a need to update place, person, or visit identifiers.
Here's some questions:
Would I be better off performance wise with a single index and key column concatenating the place/person/visit information?
How much of a hit do I take for varchar indexes? Is it worth trying to constrain them to a fixed-length field?
Any other suggestions from the collected wisdom?
Thanks.