The problem is as follows:
I have one summary, usually between 20 to 50 words, that I'd like to compare to other relatively similar summaries. The general category and the geographical location to which the summary refers to are already known.
For instance, if people from the same area are writing about building a house, I'd like to be able to list those summaries with some level of certainty that they actually refer to building houses instead of building a garage or a backyard swimming pool.
The data set is currently around 50 000 documents with a growth rate of some 200 documents per day.
Preferred languages would be Python, PHP, C/C++, Haskell or Erlang, whichever might get the job done. Also, if you don't mind, I'd like to understand the reasoning for picking a specific language.