views:

346

answers:

4

The problem is as follows:

I have one summary, usually between 20 to 50 words, that I'd like to compare to other relatively similar summaries. The general category and the geographical location to which the summary refers to are already known.

For instance, if people from the same area are writing about building a house, I'd like to be able to list those summaries with some level of certainty that they actually refer to building houses instead of building a garage or a backyard swimming pool.

The data set is currently around 50 000 documents with a growth rate of some 200 documents per day.

Preferred languages would be Python, PHP, C/C++, Haskell or Erlang, whichever might get the job done. Also, if you don't mind, I'd like to understand the reasoning for picking a specific language.

+1  A: 

You could have a look at the WEBSOM project.

Even though their web site has not been updated exactly this year, the problem being solved is very similar. As they were processing amounts of data similar to yours (and more) like 10 years ago, today you could probably run the algorithms almost on a cell phone.

Pukku
As you probably guessed, my data is mostly in Finnish, so this might prove to be very relevant. I'll have to dig into this tomorrow.
kari.patila
+1  A: 

There isn't really a particular language to pick. You're trying to find semantic similarity. This is a very large area. You might be interested in this paper:

Corpus-based and Knowledge-based Measures of Text Semantic Similarity

BobbyShaftoe
Yeah, I tried to steer clear of the semantic approach, because finding related terms in Finnish is a problem I'm not equipped to tackle.
kari.patila
+2  A: 

You can try to use some string similarity measures, such as Jaccard and Dice, but instead of calculating character overlaps, you calculate word overlaps. For example, using Python, you can use the following:

def word_overlap(a, b):
    return [x for x in a if x in b]


def jaccard(a, b, overlap_fn=word_overlap):
    """
    Jaccard coefficient (/\ represents intersection), given by :
     Jaccard(A, B) = (A /\ B) / (|a|) + (|b|) - (A /\ B)
    """
    c = overlap_fn(a, b)
    return float(len(c)) / (len(a) + len(b) - len(c))

jaccard("Selling a beautiful house in California".split(), "Buying a beautiful crip in California".split())
JG
+2  A: 

Since there is a native nice support for sets in python, we can modify JGs code as,

def jaccard(a, b):
    """
    Jaccard coefficient (/\ represents intersection), given by :
        Jaccard(A, B) = (A /\ B) / (|a|) + (|b|) - (A /\ B)
    """
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

jaccard(set("Selling a beautiful house in California"), set("Buying a beautiful crip in California"))
Chantz