views:

202

answers:

2

I have two sets of data which I need to join, but there is an added problem because the data quality is not great.

The two data sets are Calls (phone calls) and Communications (records created about phone calls). They have ID's called call_id and comm_id respectively. The communication records also have call_ids to perform the join. The problem is that the data collection system was not working correctly to start with and I have a large number of communication which I cannot match to a specific call. Not all calls will have generated a communication.

For each day I need to create a joined list to perform some analysis on. The problem is that due to the lack of some of the links I get 3 distinct row types:

  • Just Calls
  • Just Comms
  • Linked comm and call

What I want to do is for every row which is a "Just Comm" row on a given date, I should remove a "Just Call" row for the same date. I dont need any values from the calls, I just need to know the call happened. If I do this I will end up with the correct number of rows because all the "just comms" will have removed a "just call" row which as far as I need to know was the call which created the comm.

My problem is how to do this in SSIS. I have got to the point where I have my data set which contains all the data I need and is a mixture of the 3 row types I mentioned above. How would you recommend I go through the process of deleting "Just Call" rows?

A: 

Not sure about SSIS, but I can give you some SQL to start with.

Seems you are not overly interested in the calls, since you are willing to ignore the data if it is not matching a communication.

To me this sounds like a typical LEFT JOIN situation:

SELECT
  call.call_id,
  call.some_data,
  comm.comm._id
  comm.some_data,
FROM
  Communications comm
  LEFT JOIN Calls call ON call.call_id = comm.call_id
WHERE
  <date filter: today>

This would give you all communications of today, paired up only with those calls that are correctly linked. All other calls of today would not end up in the result set.


EDIT:

From the OP's comment to another answer:

I want to take all of the unmatched call and comm records. Start with the call records and go down the list matching them with any comm record from the same day. Each comm record should only be used once. If a call record cannot find an unused comm with a matching date then it should remain null...

I see a key problem with this approach:

Which of the "call" records are the ones that should remain after the procedure? The TOP 17 (with 17 being the "call/comm"-mismatch count for today)? Whatever is left? The latest calls?

Of what value could it possibly be to leave 17 random call records in the result set? They carry no usable information whatsoever. They are not better than the ones you have crossed out. Are you going to display them somewhere: Why?

I would go with the LEFT JOIN and simply gather a count of the rest.

Tomalak
This wouldn't solve my problem because I need to know how many calls there where on each day. This solution will lose me all of the unmatched calls.
Craig Warren
Then you need to define what data and which counts you need, exactly. It's best to provide a small example that shows the situation and the desired output.
Tomalak
A: 

I'm not sure I understood your problem completely, but maybe you could try with a FULL OUTER JOIN:

SELECT
  CL.ID AS CALL_ID,
  CL.DATE AS CALL_DATE,
  CM.ID AS COMM_ID,
  CM.DATE AS COMM_DATE
FROM
  CALLS CL
  FULL JOIN
  COMMUNICATIONS CM ON
    (CM.CALL_ID = CALL.ID)
WHERE
  (CL.ID IS NULL) OR
  (CM.ID IS NULL)

This will return all the rows that contain either a NULL Call ID or Communication ID (i.e. the ones with a "broken link").

You can then use this as a view (let's call it *VCALLS_COMMS*) and join it to itself to find rows with a matching date:

SELECT 
  VCC1.CALL_ID,
  VCC2.CALL_ID
FROM
  VCALLS_COMMS VCC1
  JOIN
  VCALL_COMMS VCC2 ON
    (VCC2.COMM_DATE = VCC1.CALL_DATE)

Maybe it's not exactly what you're looking for, hope it helps.

Diego
So the first step here is what I have already done. A Full Outer Join on the comms and Calls to produce the data set which has all 3 types of rows in it. The where clause would remove those rows which are "both call and comm".The second query using the view will not do what I want.I'll explain further in the next comment...
Craig Warren
I want to take all of the unmatched call and comm records. Start with the call records and go down the list matching them with any comm record from the same day. Each comm record should only be used once.If a call record cannot find an unused comm with a matching date then it should remain null...
Craig Warren
If you are going to randomly cross out call records (up to the number of unmatched communication records) - which call records are the ones that should remain? Of what value are they to you? Are you interested in more than the count? If not, why not simply count them with a separate query?
Tomalak