views:

61

answers:

1

I'm supposed to take two sentences each time and compute if they are similar. By similar I mean, both syntactically and semantically.

INPUT1: Obama signs the law. A new law is signed by Obama.

INPUT2: A Bus is stopped here. A vehicle stops here.

INPUT3: Fire in NY. NY is burnt down.

INPUT4: Fire in NY. 50 died in NY fire.

I don't want to use ontology tree as a soul. I wrote a code to compute Levenshtein distance (LD) between sentences and then decide if the 2nd sentence:

  • can be ignored (INPUT1 and 2),
  • should replace the first sentence (INPUT 3), or
  • store along with the first sentence (INPUT4).

I'm not happy with the code as LD only computes syntactical level (what other methods ?). How can semantic be incorporated (like bus is sort of a vehicle?) .

The code goes here:

%# As the difference is computed, a decision is made on the new event
%# (string 2) to be ignored, to replace existing event (string 1) or to be
%# stored separately. The higher the LD metric, the higher the difference
%# between two strings. Of course, lower difference indices either identical
%# or similar events. However, the higher difference indicates the new event
%# as a fresh event.

%#.........................................................................
%# Calculating the LD between two strings of events.
%#.........................................................................
L1=length(str1)+1;
L2=length(str2)+1;
L=zeros(L1,L2);   %# Initializing the new length.

g=+1;             %# just constant
m=+0;             %# match is cheaper, we seek to minimize
d=+1;             %# not-a-match is more costly.

% do BC's
L(:,1)=([0:L1-1]*g)';
L(1,:)=[0:L2-1]*g;

m4=0;             %# loop invariant
%# Calculating required edits.
for idx=2:L1;
    for idy=2:L2
        if(str1(idx-1)==str2(idy-1))
            score=m;
        else
            score=d;
        end
        m1=L(idx-1,idy-1) + score;
        m2=L(idx-1,idy) + g;
        m3=L(idx,idy-1) + g;
        L(idx,idy)=min(m1,min(m2,m3)); % only minimum edits allowed.
    end
end
%# The LD between two strings.
D=L(L1,L2);

%#....................................................................
%# Making decision on what to do with the new event (string 2).
%#...................................................................
if (D<=4)     %# Distance is so less that string 2 seems identical to string 1.
    store=str1;        %# Hence string 2 is ignored. String 1 remains stored.
elseif (D>=5 && D<=15) %# Distance is larger to be identical but not enough to
    %# make string 2 an individual event.
    store= str2;       %# String 2 is somewhat similar to string 1.
                       %# So, string 1 is replaced with string 2 and stored.
else
    %# For all other distances, string 2 is stored along with string 1.
    store={str1; str2};
end

Any help is appreciated.

A: 

"Semantically". No simple text-book algorithm for that. Natural language (esp. English) is a very complicated and fickle beast. Let's look at (just a small part of) the provided cases:

INPUT1: Obama signs the law. A new law is signed by Obama.

Signing a law makes it a 'new' law.

INPUT2: A Bus is stopped here. A vehicle stops here.

Need to know a bus is a type if vehicle as well as some sort of time relation. Also, what if the bus did stop but does not normally stop or is no longer stopped? It can be taken several ways.

INPUT3: Fire in NY. NY is burnt down.

Need to know that fires can burn things down.

INPUT4: Fire in NY. 50 died in NY fire.

Need to know that fires can kill things (see next). Need to associated the "news headline" (50 WHAT?) with people. The brain can do this somewhat trivially. Computer programs are not brains.

And I'm no English major :-)

pst
Very ture. I found a few works uning WORDNET somewhat try to replate one word to several others (like lock is a part of door, bus is a vehicle, fire can burn so on). But I'm not sure how to implement that in my code.
Tinglin

related questions