I want to do hierarchical agglomerative clustering on texts in MATLAB. Say, I have four sentences,
I have a pen.
I have a paper.
I have a pencil.
I have a cat.
I want to cluster the above four sentences to see which are more similar. I know Statistic toolbox has command like pdist
to measure pair-wise distances, linkage
to calculate the cluster similarity etc. A simple code like:
X=[1 2; 2 3; 1 4];
Y=pdist(X, 'euclidean');
Z=linkage(Y, 'single');
H=dendrogram(Z)
works fine and return a dendrogram.
I wonder can I use these command on the texts as I mentioned above. Any thoughts ?
UPDATES:
Thanks to Amro. Read Understood and computed the distance among strings. Code follows:
clc
S1='I have a pen'; % first String
f_id=fopen('events.txt','r'); %saved strings to compare with
events=textscan(f_id, '%s', 'Delimiter', '\n');
fclose(f_id); %close file.
events=events{1}; % saving the text read.
ii=numel(events); % selects one text randomly.
% store the texts in a cell array
for kk=1:ii
S2=events(kk);
S2=cell2mat(S2);
Z=levenshtein_distance(S1,S2);
X(kk)=Z;
end
I input a string and I had 4 saved strings. Now I calculated the pairwise distance using levenshtein_distance
function. It returns a matrix X=[ 17 0 16 18 16]
.
** I guess this is my pair wise distance matrix. Similar to what pdist does. Is it ?
** Now, I'm trying to input X to compute the linkage like
Z=linkage(X, 'single);
Output I'm getting is:
Error using ==> linkage at 93 Size of Y not compatible with the output of the PDIST function.
Error in ==> Untitled2 at 20 Z=linkage(X,'single') .
Why so ? Can use the linkage function at all ? Help appreciated.
UPDATE 2
clc
S1='I have a pen';
f_id=fopen('events.txt','r');
events=textscan(f_id, '%s', 'Delimiter', '\n');
fclose(f_id); %close file.
events=events{1}; % saving the text read.
ii=numel(events)+1; % total number of strings in the comparison
D=zeros(ii, ii); % initialized distance matrix;
for kk=1:ii
S2=events(kk);
%S2=cell2mat(S2);
for jk=kk+1:ii
D(kk,jk)= levenshtein_distance(S1{kk},S2{jk});
end
end
D = D + D'; %'# symmetric distance matrix
%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)
D = squareform(D, 'tovector');
T = linkage(D, 'single');
dendrogram(T).
*Error: ??? Cell contents reference from a non-cell array object. Error in ==> Untitled2 at 22 D(kk,jk)= levenshtein_distance(S1{kk},S2{jk});*
Also, Why am I reading the event from the file inside the first loop ? Doesn't seem logical. Bit confused, if I can work this way or only solution is to input all strings inside the code. Help much appreciated.
UPDATE
code to compare two sentences:
clc
str1 = 'Fire in NY';
str2= 'Jeff is sick';
D=levenshtein_distance(str1,str2);
D = D + D'; %'# symmetric distance matrix
%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)
%D = squareform(D, 'tovector');
T = linkage(D, 'complete');
[H,P] = dendrogram(T,'colorthreshold','default');
Output D=18.
WITH Different strings:
clc
str1 = 'Fire in NY';
str2= 'NY catches fire';
D=levenshtein_distance(str1,str2);
D = D + D'; %'# symmetric distance matrix
%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)
%D = squareform(D, 'tovector');
T = linkage(D, 'complete');
[H,P] = dendrogram(T,'colorthreshold','default');
D=28.
Based on distance, a completely different sentence looks similar. What I'm trying to do, If I have stored Fire in NY, I wont store NY catches fire
. However, for the first case, I would store as the information is new.
IS LD sufficient to do this ? Help appreciated.