linguistics

PHP implementation of Bayes classificator: Assign topics to texts

In my news page project, I have a database table news with the following structure: - id: [integer] unique number identifying the news entry, e.g.: *1983* - title: [string] title of the text, e.g.: *New Life in America No Longer Means a New Name* - topic: [string] category which should be chosen by the classificator, e.g: *Sports* ...

Justadistraction: tokenizing English without whitespaces. Murakami SheepMan

I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed? The inspiration for the question is the Sheep Man character in the Murakami novel 'Dance Dance Dance' In the novel, the Sheep Man is translated as saying things like: "likewesaid, we'lldowhatwecan. Trytoreconnec...

Calculating a relative Levenshtein distance - make sense?

I am using both Daitch-Mokotoff soundexing and Damerau-Levenshtein to find out if a user entry and a value in the application are "the same". Is Levenshtein distance supposed to be used as an absolute value? If I have a 20 letter word, a distance of 4 is not so bad. If the word has 4 letters... What I am now doing is taking the distanc...

How to extract words from text as per the context

Hello, I want to extract relevant words from a text statement provided by the user. eg. For a question "How many sides are there in a rectangle?" The words should be 'rectangles' , 'sides', 'many' , 'how'. We've discovered that what exactly I'm aiming to do is a NLP Question answer system. But right now I want to only extract the requi...