views:

77

answers:

1

I have a Data intensive problem which requires a lot of massaging and data manipulation and I'm putting this out there to see if anyone has an idea as to how to approach it.

In simplest form. I have a lot of tables which can be joined together to give me a price listing for dentists and how much each charges for a procedure.

so we have multiple tables that looks like this.

Dentist | Procedure1 | Procedure2 | Procedure3 | .........| Procedure?
John    | 500        | 342        | 434        | .........| 843
Dave    | 343        | 434        | 322        | NULLs....|
Mary    | 500        | 342        | 434        | .........| 843
Linda   | 500        | 342        | Null       | .........| 843

Dentists can have different number of procedures and different pricing for each procedures. But there are a lot of Dentists that have the same number of procedures and the same rates that goes with it. Internally, we create a unique ID for each of these so-called fee listings.

like John would be 001, Dave would be 002, but Mary would be fee 001 and Linda would be 003 It's not so bad if I have to deal with this data once but these fee listings comes in flat files (csvs) which i basically have to DTS up to a SQL server to work with. and they come on a monthly bases. The pricing could change from month to month for each dentist which then would put them in a different unique ID internally.

Can someone shed some light on as to how to best approach this problem so that it's most efficient to process on a monthly basis without having to do tons of data manipulation?

  1. what's the best approach to finding out the duplicates of the fee listings?
  2. How do i keep track of updating a Dentist's fee listing incase they change their rates the next month? if Mary decides to charge a different fee for procedure2, then she would have a different unique ID internally. how do i keep track of that on a monthly bases without having to delete everything and re-insert?
  3. There are a few million fee listings that I'm working with and some have standard rules that are based on zipcodes and some are just unique fee listings, what's the approach here?
  4. I can write some kind of ad-hoc .net program to work with it but it's a lot of data and working straight in SQL server would be easier for me.

any help would be great, thanks guys.

+1  A: 

You probably need to unpivot the data to normalize it - so that you end up with:

Doctor: DoctorID, DoctorDetails...
FeeSchedule: DoctorID, ScheduleID, EffectiveDate, OtherDetailAtThisLevel...
FeeScheduleDetail: ScheduleID, ProcedureCode, Fee, OtherDetailAtThisLevel...

When the data comes in for a doctor, it is pivoted, a new schedule is created and the detail rows are created from the unpivoted data.

SSIS has an unpivot component which is fine - you would load the schedule first and then the detail. If the format varies significantly, you might need a custom data source or just avoid SSIS.

This system would keep track of new schedules for doctors. If the schedule is identical for a doctor, you could simply not insert it.

If this logic is extensive, you could load the data to staging tables (SSIS or whatever) and do all this in SQL (T-SQL also has an UNPIVOT operator). That can have advantages in that the code is all in one place and can do all its operations in sets.

Regarding the zip codes, if the doctor doesn't have a fee, are these like usual and customary fee? This could simply be determined from the zip code of the doctor row. In this case you have a few options. You can overlay the doctor fee schedule over a zip code fee schedule:

ZipCodeSchedule: ZipScheduleID, ZipCode, EffectiveDate
ZipCodeScheduleDetail: ZipScheduleID, ProcedureCode, Fee

Or you could save this in the regular feeschedule (potentially with some kind of flag that it was defaulted to the UCR).

Cade Roux
Unfortunately, we created internal unique IDs for these Fee listings. so if a doctor shares the same Fee listings, they would have the same internal unique ID. that was a failed attempt at normalization. but that's for individual cases. now we have regional fee listings which are based on zipcodes. now if these fees schedule listings changes, there's really no way of keeping track of them because same zips can have different fee listings depending on the type of fee listings.
stevenjmyu
@megatoast, In that case, my zipcode schedule would need to have a unique ekey on ZipCode, FeeType, EffectiveDate
Cade Roux