I have two sets of data which I need to join, but there is an added problem because the data quality is not great.
The two data sets are Calls (phone calls) and Communications (records created about phone calls). They have ID's called call_id and comm_id respectively. The communication records also have call_ids to perform the join. The problem is that the data collection system was not working correctly to start with and I have a large number of communication which I cannot match to a specific call. Not all calls will have generated a communication.
For each day I need to create a joined list to perform some analysis on. The problem is that due to the lack of some of the links I get 3 distinct row types:
- Just Calls
- Just Comms
- Linked comm and call
What I want to do is for every row which is a "Just Comm" row on a given date, I should remove a "Just Call" row for the same date. I dont need any values from the calls, I just need to know the call happened. If I do this I will end up with the correct number of rows because all the "just comms" will have removed a "just call" row which as far as I need to know was the call which created the comm.
My problem is how to do this in SSIS. I have got to the point where I have my data set which contains all the data I need and is a mixture of the 3 row types I mentioned above. How would you recommend I go through the process of deleting "Just Call" rows?