views:

44

answers:

2

Hello,

I have a table with the following structure:

WorkerPersons
-------------------------------
ID          (PK)
PersonID    (Indicates which version of Person the record describes)
SomeColumn1 (data specific to Worker)
SomeColumn2 (data specific to Person)
....
SomeColumnN
-------------------------------

As you can see, it's a denormalized table, which holds both Worker and Person (and many versions of one Person) data in one table. My wish is to normalize that table, however, as the table holds a lot of data (many many columns), I need to be sure which columns should go to Workers table and which columns to Persons table. The outcome should be like this:

Workers                 Persons
----------------------- ---------------------
ID                      ID
PersonID (now a FK)     PersonColumn1
WorkerColumn1           PersonColumn2
WorkerColumn2           ...
...                     PersonColumnN
WorkerColumnN
----------------------- ---------------------

To do that, I need to analyze which data differs in scope of Person over all unique Persons (wich are separated by PersonID in WorkerPersons). For example:

WorkerPersons
-------------------------------------------------------
ID      PersonID      Column1      Column2      Column3
-------------------------------------------------------
1       PersonA       10.1         John Doe     Single
2       PersonA       10.1         John Doe     Single
3       PersonA       10.1         John Doe     Married
4       PersonB       09.2         Sully        Single
5       PersonB       09.2         Sullivan     Single

In this case, there are 3 versions on PersonA and 2 versions of PersonB. Column1 values are always the same over all versions of Person, and we can move that column to table Worker. But Column 2 and Column3 values change over different versions of Person, so those values should be moved to Person table.

No imagine, I have about 10 tables like this that need to be normalized, with about 40 columns in each. Eeach table holds about 500k to 5m rows.

I need a script that helps me analyse which columns to move where. I need a script that outputs all columns that change in scope of unique Person over the whole table. I've no ideas however how to do that. I experimented with LAG analytical function to compare against the next row but how in the world to output changed columns is beyond me.

Please advise.

Best wishes, Andrew

+2  A: 

Since 10 tables is not a lot, here is (some sort of) pseudo code

for each table_name in tables
  for each column_name in columns
    case (exists (select 1
          from table_name
          group by PersonID
          having min(column_name) = max(column_name))
       when true then 'Worker'
       when false then 'Person'
    end case
  end for
end for

with information schema and dynamic queries you could make the above proper PL/SQL or take the core query and script it in your favourite language.

EDIT: The above assumes no NULLs in column_name.

EDIT2: Other variants of the core query can be

SELECT 1
FROM 
(SELECT COUNT(DISTINCT column_name) AS distinct_values_by_pid
FROM table_name
GROUP BY PersonID) T
HAVING MIN(distinct_values_by_pid) = MAX(distinct_values_by_pid)

Which will return a row if all values per PersonID are the same. (this query also has problems with NULLS, but I consider NULLs a separate issue; you can always cast a NULL to some out-of-domain value for purposes of the above query)

The above query can be also written as

SELECT MIN(c1)=MAX(c1), MIN(c2)=MAX(c2), ...
FROM 
(SELECT COUNT(DISTINCT column_name_1) AS c1, COUNT(DISTINCT column_name_2) AS c2, ...
FROM table_name
GROUP BY PersonID) T

Which will test multiple columns at the same time returning true for columns that belong to 'Workers' and false for columns that should go into 'Persons'.

Unreason
A: 

Thanks, but I solved it by letting Excel create series of selects over table schema information. The final query that it generated was a long list of selects but it works (although it run over a hour). The "core query" (actually a formula in Excel to create to core query):

=IF(AND(C17<>"CLOB";C17<>"NCLOB");"SELECT '"&A17&".'||initcap('"&B17&"') description, decode(count(*),0,'SAME OVE VERSIONS','DIFFERENT OVER VERSIONS') values FROM (SELECT objektid, count(DISTINCT nvl("&B17&","&IF(C17="DATE";"'01.02.0004'";IF(C17="VARCHAR2";"'!#¤¤%¤(%#¤%AS'";"-1234561"))&")) OVER (PARTITION BY objectid) arv FROM "&A17&") WHERE number > 1 union all";"SELECT '"&A17&".'||initcap('"&B17&"') description, 'CLOB field' values from dual union all")