I'm working with a 12-million record MyISAM table with surname, address, gender and birthdate fields:
ID SURNAME GENDER BDATE COUNTY ADDRESS CITY
1 JONES M 1954-11-04 015 51 OAK ST SPRINGFIELD
2 HILL M 1981-02-16 009 809 PALM DR JONESVILLE
3 HILL F 1979-06-23 009 809 PALM DR JONESVILLE
4 HILL F 1941-10-11 009 809 PALM DR JONESVILLE
5 SMITH M 1914-07-27 035 1791 MAPLE AVE MAYBERRY
6 SMITH F 1954-02-05 035 1791 MAPLE AVE MAYBERRY
7 STEVENS M 1962-05-05 019 404 CYPRESS ST MAYBERRY
. . . . . .
. . . . . .
. . . . . .
Surname, bdate, and address fields are indexed. My goal is to append a field for inferred marital status, defined by the following criteria: For each record, if another record exists in the table with (1) an identical surname, (2) a different gender, (3) an identical address, and (4) an age difference of less than 15 years, set married = T; else set married = F.
Being a SQL novice, my initial approach was to add a marital field that defaults to 'F' and then use a self-join to set MARRIED = T.
ALTER TABLE MY_TABLE
ADD COLUMN MARRIED CHAR(1) NOT NULL DEFAULT 'F';
UPDATE MY_TABLE T1, MY_TABLE T2
SET T1.MARRIED = 'T' WHERE
T1.SURNAME = T2.SURNAME AND
T1.GENDER != T2.GENDER AND
T1.ADDRESS = T2.ADDRESS AND
T1.CITY = T2.CITY AND
ABS(YEAR(T1.BDATE)-YEAR(T2.BDATE)) < 15;
While this works fine on small tables, I learned quickly that I'll probably retire before this process completes on a 12-million row table. My SQL knowledge is very limited, so I'm sure this is a sub-optimal approach. Any suggested alternatives? Perhaps indexing SURNAME + ADDRESS + CITY? Grouping by ADDRESS + CITY first? Better table design? Any suggestions would be appreciated.
Thanks in advance for you help!