tags:

views:

247

answers:

4

I have a rectangular n x m matrix (n != m). What's the best way to find out if there are any duplicate rows in it in Matlab? What's the best way to find the indices of the duplicates?

A: 

Run through the rows of the matrix, and for each pair, test if

row1 == row2

John at CashCommons
This works, but is definitely both slower and more verbose than the other basic option (i.e. using 'unique()').
bnaul
+6  A: 

Use unique() to find the distinct row values. If you end up with fewer rows, there are duplicates. It'll also give you indexes of one location of each of the distinct values. All the other row indexes are your duplicates.

x = [
    1 1
    2 2
    3 3
    4 4
    2 2
    3 3
    3 3
    ];
[u,I,J] = unique(x, 'rows', 'first')
hasDuplicates = size(u,1) < size(x,1)
ixDupRows = setdiff(1:size(x,1), I)
dupRowValues = x(ixDupRows,:)
Andrew Janke
+1: Dang, beat me by 49 seconds!
gnovice
+2  A: 

You can use the functions UNIQUE and SETDIFF to accomplish this:

>> mat = [1 2 3; 4 5 6; 7 8 9; 7 8 9; 1 2 3];    %# Sample matrix
>> [newmat,index] = unique(mat,'rows','first');  %# Finds indices of unique rows
>> repeatedIndex = setdiff(1:size(mat,1),index)  %# Finds indices of repeats

repeatedIndex =

     4     5
gnovice
Shouldn't `repeatedIndex` be `[3,4]`?
AVB
@AB: No, the fourth and fifth rows of `mat` are repeats of earlier rows.
gnovice
A: 

Say your matrix is M:

[S,idx1] = sortrows(M);
idx2 = find(all(diff(S,1) == 0,2));
out = unique(idx1([idx2;idx2+1]));

out will contain the duplicate row indices if any.

upperBound
This will only work if your duplicated rows are next to one another.
gnovice
My mistake. Wrong assumption...
upperBound
@upperBound: Well, technically the OP never *explicitly* said whether or not the duplicated rows abut one another. Although not as general as using UNIQUE, this solution runs *substantially* faster in the specific case of neighboring duplicates, so +1.
gnovice
Removed my false assumption...
upperBound
@upperBound: Well, your new answer is doing something that I don't think the OP wanted. It is returning the indices of *all* rows that are not unique. I think the OP just wanted indices of duplicates *not counting* the first one found. In other words, if rows 2, 4, and 5 are the same, then rows 4 and 5 are considered "duplicates", with row 2 being the "original" (or 2 and 4 could be counted as duplicates, with 5 as the original... there was no order specified by the OP).
gnovice

related questions