views:

221

answers:

1

I want to do hierarchical agglomerative clustering on texts in MATLAB. Say, I have four sentences,

I have a pen.
I have a paper. 
I have a pencil.
I have a cat. 

I want to cluster the above four sentences to see which are more similar. I know Statistic toolbox has command like pdist to measure pair-wise distances, linkage to calculate the cluster similarity etc. A simple code like:

X=[1 2; 2 3; 1 4];
Y=pdist(X, 'euclidean');
Z=linkage(Y, 'single');
H=dendrogram(Z)

works fine and return a dendrogram.

I wonder can I use these command on the texts as I mentioned above. Any thoughts ?


UPDATES:

Thanks to Amro. Read Understood and computed the distance among strings. Code follows:

clc
S1='I have a pen'; % first String

f_id=fopen('events.txt','r'); %saved strings to compare with
events=textscan(f_id, '%s', 'Delimiter', '\n');
fclose(f_id); %close file.
events=events{1}; % saving the text read.

ii=numel(events); % selects one text randomly.
% store the texts in a cell array

for kk=1:ii

   S2=events(kk);
   S2=cell2mat(S2);
   Z=levenshtein_distance(S1,S2);
   X(kk)=Z;

end 

I input a string and I had 4 saved strings. Now I calculated the pairwise distance using levenshtein_distance function. It returns a matrix X=[ 17 0 16 18 16].

** I guess this is my pair wise distance matrix. Similar to what pdist does. Is it ?

** Now, I'm trying to input X to compute the linkage like

Z=linkage(X, 'single);

Output I'm getting is:

Error using ==> linkage at 93 Size of Y not compatible with the output of the PDIST function.

Error in ==> Untitled2 at 20 Z=linkage(X,'single') .

Why so ? Can use the linkage function at all ? Help appreciated.

UPDATE 2

clc
S1='I have a pen';

f_id=fopen('events.txt','r');
events=textscan(f_id, '%s', 'Delimiter', '\n');
fclose(f_id); %close file.
events=events{1}; % saving the text read.

ii=numel(events)+1; % total number of strings in the comparison

D=zeros(ii, ii); % initialized distance matrix;
for kk=1:ii 

    S2=events(kk);

    %S2=cell2mat(S2);

    for jk=kk+1:ii

  D(kk,jk)= levenshtein_distance(S1{kk},S2{jk});

    end

end

D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)
D = squareform(D, 'tovector');

T = linkage(D, 'single');
dendrogram(T).

*Error: ??? Cell contents reference from a non-cell array object. Error in ==> Untitled2 at 22 D(kk,jk)= levenshtein_distance(S1{kk},S2{jk});*

Also, Why am I reading the event from the file inside the first loop ? Doesn't seem logical. Bit confused, if I can work this way or only solution is to input all strings inside the code. Help much appreciated.

UPDATE

code to compare two sentences:

clc
    str1 = 'Fire in NY';
    str2= 'Jeff is sick';

D=levenshtein_distance(str1,str2);
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)

%D = squareform(D, 'tovector');

T = linkage(D, 'complete');
[H,P] = dendrogram(T,'colorthreshold','default');  

Output D=18.

WITH Different strings:

clc
str1 = 'Fire in NY';
str2= 'NY catches fire';

D=levenshtein_distance(str1,str2);
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)

%D = squareform(D, 'tovector');

T = linkage(D, 'complete');
[H,P] = dendrogram(T,'colorthreshold','default'); 

D=28.

Based on distance, a completely different sentence looks similar. What I'm trying to do, If I have stored Fire in NY, I wont store NY catches fire. However, for the first case, I would store as the information is new.

IS LD sufficient to do this ? Help appreciated.

+1  A: 

What you need is a distance function that can handle strings. Check out the Levenshtein distance (edit distance). There are plenty of implementations out there:

Alternatively, you should extract some interesting features (ex: number of vowels, length of string, etc..) to build a vector space representation, then you can apply any of the usual distance measures (euclidean, ...) on the new representation.


EDIT

The problem with your code is that LINKAGE expects the input distances format to match that of PDIST, namely a row vector corresponding to pairs of observations in the order 1-vs-2, 1-vs-3, 2-vs-3, etc.. which is basically the lower half of the complete distance matrix (since its supposed to be symmetric as dist(1,2) == dist(2,1))

%# instances
str = {'I have a pen.'
    'I have a paper.'
    'I have a pencil.'
    'I have a cat.'};
numStr = numel(str);

%# create and fill upper half only of distance matrix
D = zeros(numStr,numStr);
for i=1:numStr
    for j=i+1:numStr
        D(i,j) = levenshtein_distance(str{i},str{j});
    end
end
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)
D = squareform(D, 'tovector');

T = linkage(D, 'single');
dendrogram(T)

Please refer to the documentation of the functions in question for more information...

Amro
L.distance is different from HAC, isn't it ? I'm looking for HAC only. I couldnt find if HAC extracts vector using any special feature. Can I convert texts to vector (all characterless) ? Any hints on how to build a vector space ?
Tinglin
Levenshtein distance is not a clustering algorithm, its a distance function between two strings. This is needed because Hierarchical clustering starts by computing the distance matrix between all pairs of instances `PDIST`, and then start to merge them in a bottom-up approach (agglomerative) `LINKAGE`
Amro
I'm just reading the link you gave. will update on what I could do. Thanks Amro.
Tinglin
Just read the documents and understood. copied the code and ran too. If I understand correctly , the first loop takes one string at a time and second loop it is compared with the rest. I defined one string as S1 and want to read others from a file. I'm updating the code which I just edited following your code. I'm having an error and wondering if I could do this way or not. Thanks very much and help appreciated.
Tinglin
@Tinglin: what is it you're trying to do? hierarchical clustering takes **all** instances and compute their distances among each other (not one in particular vs all others), and then it groups similar instances together to form the clusters.. As to the loop, it will compare the instance in the order I explained before: 1-vs-2, 1-vs-3, 1-vs-4, 2-vs-3, 2-vs-4, and 3-vs-4 (basically all possible unordered pairs)
Amro
@ Amro. I understood clearly what you explained. I'm just trying like : I have one string say (S1). In a file I have previously saved 4 strings (S2, S3, S4, S5). Assume they couldn't be clustered before or may be clustered. I would like to compare (S1-S2) (S1-S3) (S1-S4) (S1-S5) to see if S1 can be cluster with any of the four strings. If yes, I would replace say S2 with S1 and save the file. possible ? Example: I have a pen will replace previous I have a pencil. or, stock exchange fails will replace failure in stock exchange. Or, the whole concept is wrong ? Thanks very much.
Tinglin
@Tinglin: I'm afraid I don't follow your logic. The whole point of clustering is that it discovers intrinsic grouping of your data. This means that given a bunch of instances (strings in your case), it will group them for you such that similar instances are placed together (or in this case a hierarchy of groups). It might help if you read on more on the subject: http://en.wikipedia.org/wiki/Cluster_analysis#Hierarchical_clustering
Amro
@Amro. I read all the documents I had and wiki. I think I have to modify what I'm trying to do and here it goes: I received an information S1='I have a pen'. I stored it in the pc. Now, I have received a new information S2=' I have a pencil'. Before I store S2, I want to determine is S2 same/ similar to S1 ? If yes, I either want to truncate S2 or replace S1 with S2. But if, I find S2 is very different than S1. I store S2 as well. HAC groups similar items but I dont think appropriate for my scenario. Also LD. Have look at the code. Thanks.
Tinglin
@Tinglin: Clearly clustering does not fit your problem. Still, this is way too vague: how would do you determine if two strings are similar/different? Perhaps you should post a new question, but make sure to clearly explain the task you are trying to achieve (based on your description, I suspect you're not sure of the concept yourself!)
Amro
@Amro, you are dead right. I'm still thinking the concept and modifying. I came up with a decent one and posted a new Question. LD works upto some extent though. Thanks again.
Tinglin
I'm linking to your new question for posterity: http://stackoverflow.com/questions/3655612/how-to-compute-similarity-between-two-sentences-syntactical-and-semantical
Amro