views:

31

answers:

1

I've got a table in Postgres that is chock full of articles. The articles have a url slug associated with them, which are used to display them as example.com/pretty_name as opposed to example.com\2343.

Unfortunately, when I started out, I enforced a unique constraint on urls, but neglected to do so on a case insensitive basis, and I'd like to right that wrong and start requiring urls be unique without regards to case.

As a first step to that, I need to fix all the duplicate urls already present in my database. How can I search the table for rows with duplicate urls on a case insensitive basis, and leave one row as is, while for the rest of the duplicates append something like '_2' to the end?

It's especially tricky, because I'm not 100% sure there aren't urls duplicated more than one time. I.e., I might have 3 duplicates on one url, in which case ideally I'd want the first to be pretty_name, the second to be pretty_name_2 and the third to be pretty_name_3.

+1  A: 

If you have some sort of unique id on the table:

UPDATE articles a1 set url = a1.url||'_2' 
WHERE a1.id not in (select max(a2.id) from articles a2 group by lower(a2.url));

If you don't have an unique id:

UPDATE articles a1 set url = a1.url||'_2' 
WHERE a1.ctid not in (select max(a2.ctid) from articles a2 group by lower(a2.url));
rfusca
Can you please explain how these statements work? Is this saying to update the _articles_ record that does _not_ have the maximum ID, but that shares a case-insensitive URL with another record? If so, what happens if there's more than one match? Would it convert the URLs of _all_ but the record with the maximum ID?
seh
Update every row that is not the maximum id for each set of case insensitive urls. Yes, it would convert the URLs of all but the record with the max id.
rfusca
Thanks again! My own personal database savior. In order to catch multiple duplicates, maybe I'll run the same thing over and over while incrementing the numeral until nothing duplicates anymore. On later runs, I suppose I'd need to cook up a way to tell the database to first chop off the _2 and replace it with a _3.
WIlliam Jones
You *could* greatly reduce those number of runs by replacing '_2' with '_'||round(random()*100). Its not perfect, but if there's only a few variations, you're unlikely to get a repeat.
rfusca