views:

48

answers:

2

I've got a large dataset, some of which is is duplicate records, which are identifiable by dupes in two fields.

To find these records, the following query works:

SELECT * FROM supplierstuffs
GROUP BY "Supplier Code", "Cost ex Tax"
HAVING count("Description") > 1

Basically what I want to do is cat together all the values of "Description" to form a single row, then replace all of the duplicated rows with the single row.

This is my half broken query so far, it's kludgy and horrid. My primary goal is to get this working - but if I learn some new tricks in sql along the way that's not at all a bad thing.

UPDATE supplierstuffs SET "Description" = 
(SELECT array_to_string(array_accum("Description"), ', ') FROM supplierstuffs
GROUP BY "Supplier Code", "Cost ex Tax"
HAVING count("Description") > 1)
WHERE .....

This is far as I've gotten. What should I be reading to get a bit further? I've read a couple of books and a lot of webpages on the topic. However in this case I think my problem is not limited lack of SQL (ok, it's not my only problem) but more approaching the problem the wrong way.

EDIT 1:

'Name'; 'Supplier Code'; 'Desciption';
"7CPS PODIUM S/SLV CRICKET POLO";"7CPS";"04 -14, S - 3XL"
"7CP PODIUM CRICKET PANT  ";"7CP";"08 -14, S - 2XL"
"7CPT PODIUM 3/4 SLV CRICKET POLO";"7CPT";"04 -14, S - 3XL"
"7CPL PODIUM L/SLV CRICKET POLO";"7CPL";"04 -14, S - 3XL"
"T444MS Cool dry breathable sporty T-shirts";"T444MS";"XS - 2XL, XS - 2XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL, 8-16"

^^ is what I want to create from vv

"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"S - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"8-16"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T232RG Raglan Sleeve Tee";"T232RG";"XS - 3XL"
"T444MS Cool dry breathable sporty T-shirts";"T444MS";"XS - 2XL"
"T444MS Cool dry breathable sporty T-shirts";"T444MS";"XS - 2XL"
"T444MS Cool dry breathable sporty T-shirts";"T444MS";"XS - 2XL"
"T444MS Cool dry breathable sporty T-shirts";"T444MS";"XS - 2XL"
"7CP PODIUM CRICKET PANT  ";"7CP";"08 -14"
"7CP PODIUM CRICKET PANT  ";"7CP";"S - 2XL"
"7CPL PODIUM L/SLV CRICKET POLO";"7CPL";"04 -14"
"7CPL PODIUM L/SLV CRICKET POLO";"7CPL";"S - 3XL"
"7CPS PODIUM S/SLV CRICKET POLO";"7CPS";"04 -14"
"7CPS PODIUM S/SLV CRICKET POLO";"7CPS";"S - 3XL"
"7CPT PODIUM 3/4 SLV CRICKET POLO";"7CPT";"04 -14"
"7CPT PODIUM 3/4 SLV CRICKET POLO";"7CPT";"S - 3XL"

^^ noting that the lines not having more than one description line need to remain untouched.

I've so far created the new records in a new table with:

INSERT INTO tmptable
SELECT "Name" , "Supplier Code", array_to_string(array_accum("Description"), ', ')
FROM supplierstuffs

GROUP BY "Name", "Supplier Code", "Description"
    HAVING count("Description") > 1 

So now all that remains is to delete the records that were caught by the cat command. It seems I can't DELETE FROM with a having clause? I'm thinking that DELETE FROM table WHERE oid IN (SELECT OID's using having clause) Will work?

EDIT 2:

SELECT array_accum(oid)
FROM supplierstuffs

GROUP BY "Name", "Supplier Code", "Colour", "Cost ex Tax"
    HAVING count("Description") > 1 

returns a few arrays of 2 oids, all of which need to be delorted. I feel I'm very close, and yet so far. Thanks in advance

A: 

So what you currently have is something like this ...

DESCRIPTION            SUPPLIER_CODE  COST_EX_TAX
Widget                 X23                  42.00 
Brass gadget           X23                  42.00 
Flange                 X42                  23.00 
Flange, steel          X42                  23.00 

... and what you want is ...

DESCRIPTION            SUPPLIER_CODE  COST_EX_TAX
Brass gadget, Widget   X23                  42.00 
Flange, Flange, steel  X42                  23.00 

This still doesn't seem like the right approach. That concatenated DESCRIPTION seems wrong to me. However you know your data and your customer's requirements better than I do.

APC
I think maybe I didn't explain properly (not sure how else to phrase it in thread title..)What I'm looking to do is to alter the rows with the new data, then delete duplicates leaving a single row for each.. Does that make more sense? How would I describe this process?
Richo
Oh, and the reason for the dupes is that I'm importing data from a customers lists into our system, for their old purpose, the customers data format worked but it's not suitable for our model.
Richo
That's exactly what I'm after!However I think I may have sorted it with bkm's solution.
Richo
Have it sorted now, did some manual kludgery with the oid's by hand to translate from a bunch of rows each containing a single array to being a logical array with a stack of oid's in it in vim.Thanks again for your patience.
Richo
@Richo - you're welcome. Thanks for the points, I'm not sure that I actually provided you with a solution. Your data sample does make the scenario a lot clearer - although I *still* think the data model looks wrong ;)
APC
While you didn't personally solve my problem, the problem is now solved, and I don't think it'd be solved without your help; so I'd argue you were key to my solving it.I agree the data model isn't amazing, but unfortunately 'tell the customers their data is dodgy and unworkable' is not an option, so here I am.Thanks for your time. Have a good one.
Richo
+2  A: 

The following approach will work

  1. Identify only the duplicate rows and store them in a new table.
  2. Delete duplicate rows from parent table
  3. Concatenate the description column in the table containing only duplicate rows. Concatenate using a group by clause.
  4. Insert all rows from the result of step 3 into the original table.
bkm
Thanks bkm. I can do the move into the temp table, I'm a bit vague on how it actually changes my problem though? I was attempting to do it in place with a group by.. and not really getting anywhere. More specifically, I'm not sure how moving the rows to a table that has fewer (but not 0) rows I'm not interested in helps?Do you have any links?
Richo
Oh wait, I think I just got it:1) Create temp table2) Insert only the refined rows generated by the cat + group by into temp table3) Hose all dupes in original table4) Reinsert only clean rows from temp table?Thanks, I'll try that.
Richo