views:

105

answers:

3

I have two cell arrays of strings, and I want to check if they contain the same strings (they do not have to be in the same order, nor do we know if they are of the same lengths).

For example:

a = {'2' '4' '1' '3'};
b = {'1' '2' '4' '3'};

or

a = {'2' '4' '1' '3' '5'};
b = {'1' '2' '4' '3'};

First I thought of strcmp but it would require looping over one cell contents and compare against the other. I also considered ismember by using something like:

ismember(a,b) & ismember(b,a)

but then we don't know in advance that they are of the same length (obvious case of unequal). So how would you perform this comparison in the most efficient way without writing too many cases of if/else.

+1  A: 

Take a look at the function intersect

What MATLAB Help says:

[c, ia, ib] = intersect(a, b) also returns column index vectors ia and ib such that c = a(ia) and b(ib) (or c = a(ia,:) and b(ib,:)).

Mikhail
I am not sure how to get the solution from the result of `intersect`
Dave
It depends from what you exactly have to do. If you need a scalar boolean that both vectors contains same strings then the solution by gnovice is the right answer for you.
Mikhail
+6  A: 

You could use the function SETXOR, which will return the values that are not in the intersection of the two cell arrays. If it returns an empty array, then the two cell arrays contain the same values:

arraysAreEqual = isempty(setxor(a,b));



EDIT: Some performance measures...

Since you were curious about performance measures, I thought I'd test the speed of my solution against the two solutions listed by Amro (which use ISMEMBER and STRCMP/CELLFUN). I first created two large cell arrays:

a = cellstr(num2str((1:10000).'));  %'# A cell array with 10,000 strings
b = cellstr(num2str((1:10001).'));  %'# A cell array with 10,001 strings

Next, I ran each solution 100 times over to get a mean execution time. Then, I swapped a and b and reran it. Here are the results:

    Method     |      Time     |  a and b swapped
---------------+---------------+------------------
Using SETXOR   |   0.0549 sec  |    0.0578 sec
Using ISMEMBER |   0.0856 sec  |    0.0426 sec
Using STRCMP   |       too long to bother ;)

Notice that the SETXOR solution has consistently fast timing. The ISMEMBER solution will actually run slightly faster if a has elements that are not in b. This is due to the short-circuit && which skips the second half of the calculation (because we already know a and b do not contain the same values). However, if all of the values in a are also in b, the ISMEMBER solution is significantly slower.

gnovice
To gauge the performance, you would need another solution to compare against, like the suggestion you made using a loop and [STRCMP](http://www.mathworks.com/access/helpdesk/help/techdoc/ref/strcmp.html). I imagine the performance would be perfectly fine, but if you discover that the use of [SETXOR](http://www.mathworks.com/access/helpdesk/help/techdoc/ref/setxor.html) really ends up being a bottleneck in your processing, you can try to look at its source code (`type setxor` or `edit setxor`) and rewrite it by trimming some error-checking, etc.
gnovice
thanks, I think I see what @Mikhail was trying to do. What about performance? it seems that XOR of two sets is a expensive operation when all I needed is a true/false type of answer
Dave
oops, I edited my comment and messed up the order.. sorry
Dave
+3  A: 

You can still use ISMEMBER function like you did with a small modification:

arraysAreEqual = all(ismember(a,b)) && all(ismember(b,a))

Also, you can write the loop version with STRCMP as one line:

arraysAreEqual = all( cellfun(@(s)any(strcmp(s,b)), a) )

EDIT: I'm adding a third solution adapted from another SO question:

g = grp2idx([a;b]);
v = all( unique(g(1:numel(a))) == unique(g(numel(a)+1:end)) );

In the same spirit, Im performed the time comparison (using the TIMEIT function):

function perfTests()
    a = cellstr( num2str((1:10000)') );            %#' fix SO highlighting
    b = a( randperm(length(a)) );

    timeit( @() func1(a,b) )
    timeit( @() func2(a,b) )
    timeit( @() func3(a,b) )
    timeit( @() func4(a,b) )
end

function v = func1(a,b)
    v = isempty(setxor(a,b));                      %# @gnovice answer
end

function v = func2(a,b)
    v = all(ismember(a,b)) && all(ismember(b,a));
end

function v = func3(a,b)
    v = all( cellfun(@(s)any(strcmp(s,b)), a) );
end

function v = func4(a,b)
    g = grp2idx([a;b]);
    v = all( unique(g(1:numel(a))) == unique(g(numel(a)+1:end)) );
end

and the results in the same order of functions (lower is better):

ans =
     0.032527
ans =
     0.055853
ans =
       8.6431
ans =
     0.022362
Amro