views:

42

answers:

3

I'm working with a 12-million record MyISAM table with surname, address, gender and birthdate fields:

ID  SURNAME  GENDER       BDATE  COUNTY         ADDRESS         CITY
 1    JONES       M  1954-11-04     015       51 OAK ST  SPRINGFIELD
 2     HILL       M  1981-02-16     009     809 PALM DR   JONESVILLE
 3     HILL       F  1979-06-23     009     809 PALM DR   JONESVILLE
 4     HILL       F  1941-10-11     009     809 PALM DR   JONESVILLE
 5    SMITH       M  1914-07-27     035  1791 MAPLE AVE     MAYBERRY
 6    SMITH       F  1954-02-05     035  1791 MAPLE AVE     MAYBERRY
 7  STEVENS       M  1962-05-05     019  404 CYPRESS ST     MAYBERRY
 .        .       .           .       .               .
 .        .       .           .       .               .
 .        .       .           .       .               .

Surname, bdate, and address fields are indexed. My goal is to append a field for inferred marital status, defined by the following criteria: For each record, if another record exists in the table with (1) an identical surname, (2) a different gender, (3) an identical address, and (4) an age difference of less than 15 years, set married = T; else set married = F.

Being a SQL novice, my initial approach was to add a marital field that defaults to 'F' and then use a self-join to set MARRIED = T.

ALTER TABLE MY_TABLE
ADD COLUMN MARRIED CHAR(1) NOT NULL DEFAULT 'F';

UPDATE MY_TABLE T1, MY_TABLE T2
SET T1.MARRIED = 'T' WHERE
  T1.SURNAME = T2.SURNAME AND
  T1.GENDER != T2.GENDER AND
  T1.ADDRESS = T2.ADDRESS AND
  T1.CITY    = T2.CITY AND
  ABS(YEAR(T1.BDATE)-YEAR(T2.BDATE)) < 15;

While this works fine on small tables, I learned quickly that I'll probably retire before this process completes on a 12-million row table. My SQL knowledge is very limited, so I'm sure this is a sub-optimal approach. Any suggested alternatives? Perhaps indexing SURNAME + ADDRESS + CITY? Grouping by ADDRESS + CITY first? Better table design? Any suggestions would be appreciated.

Thanks in advance for you help!

A: 

Well, indexing all the fields in your WHERE clause would definitely spead up the query.

This means SURNAME, GENDER, ADDRESS, CITY and BDATE.

Annother thing you could try is defining the rules to narrow down the result in the ON part:

UPDATE MY_TABLE T1
  LEFT JOIN MY_TABLE T2
  ON T1.SURNAME = T2.SURNAME
    AND T1.GENDER != T2.GENDER
    AND T1.CITY   = T2.CITY
  SET T1.MARRIED = 'T'
  WHERE ABS(YEAR(T1.BDATE)-YEAR(T2.BDATE)) < 15;
JochenJung
+1  A: 

Watch out for brothers & sisters!

Jaydee
Indeed. This is an absurd idea. For many years, I lived at the same address as 4 women with my surname within 15 years of my age, and was married to none of them. And marriage to them would be quite illegal in this country.
Don Roby
While I agree with both sentiments, this is a comment--not an answer.
OMG Ponies
+1  A: 

I would try a couple of variations to see which performs the best:

Version 1 using a simple Exists but using Date_Add instead of the ABS value function:

Update My_Table
Set Married = 'T'
Where Exists    (
                Select 1
                From My_Table As T2
                Where T2.SurName = My_Table.SurName
                    And T2.Gender != My_Table.Gender
                    And T2.Address = My_Table.Address
                    And T2.City = My_Table.City
                    And (
                        T2.BDate > Date_Add(My_Date.BDate, Interval 15 Year)
                        Or T2.BDate < Date_Add(My_Date.BDate, Interval -15 Year)
                        )
                )

Version 2 using a UNION ALL

Update My_Table
Set Married = 'T'
Where Exists    (
                Select 1
                From My_Table As T2
                Where T2.SurName = My_Table.SurName
                    And T2.Gender != My_Table.Gender
                    And T2.Address = My_Table.Address
                    And T2.City = My_Table.City
                    And T2.BDate > Date_Add(My_Date.BDate, Interval 15 Year)
                Union All
                Select 1
                From My_Table As T2
                Where T2.SurName = My_Table.SurName
                    And T2.Gender != My_Table.Gender
                    And T2.Address = My_Table.Address
                    And T2.City = My_Table.City
                    And T2.BDate < Date_Add(My_Date.BDate, Interval -15 Year
                )

Version 3 using an Inner Join and Date_Add

Update My_Table As T1
    Join My_Table As T2
            On T2.SurName = T1.SurName
                And T2.Gender != T1.Gender
                And T2.Address = T1.Address
                And T2.City = T1.City
Set Married = 'T'
Where T1.BDate > Date_Add(T2.BDate, Interval 15 Year)
        Or T1.BDate < Date_Add(T2.BDate, Interval -15 Year)

Stepping back from the SQL, I think that trying to deduce whether two people are married based on the information provided is going to be fraught with problems. It does not account for couples that have a larger age variance than 15 years (Anna Nicole Smith anyone?) nor does it account for siblings nor does it account for two people that marry but do not change their surnames.

Thomas