computational-linguistics

How can I split multiple joined words?

I have an array of 1000 or so entries, with examples below: wickedweather liquidweather driveourtrucks gocompact slimprojector I would like to be able to split these into their respective words, as: wicked weather liquid weather drive our trucks go compact slim projector I was hoping a regular expression my do the trick. But, sinc...

Word-separating algorithm

What is the algorithm - seemingly in use on domain parking pages - that takes a spaceless bunch of words (eg "thecarrotofcuriosity") and more-or-less correctly breaks it down into the constituent words (eg "the carrot of curiosity") ? ...

Morphophoneme processing library in Java

Are there any good Java libraries with prebuilt dictionaries that I can use to try and extract word roots from input words? I asked a more general question which supersedes this question. It is here. Please vote to close this question. ...

Natural Language Parsing tools: what is out there and what is not?

I'm looking for various NLP tools for a project I'm working on and right now I've found most useful the Stanford NLP projects. Does anyone know if there are other tools that are out there that would be useful for a language understander? And more importantly, are there tools that are NOT out there? Most specifically, I'm looking fo...

Computational Linguistics project idea using Hadoop MapReduce

I need to do a project on Computational Linguistics course. Is there any interesting "linguistic" problem which is data intensive enough to work on using Hadoop map reduce. Solution or algorithm should try and analyse and provide some insight in "lingustic" domain. however it should be applicable to large datasets so that i can use hadoo...

Java: remove-common-words-method in the API?

Related: Forum post Before reinventing the wheel, I need to know whether such method exists. Stripping words according to a list such as list does not sound challenging but there are linguistic aspects, such as which words to stress the most in stripping, how about context? ...

code throws std::bad_alloc, not enough memory or can it be a bug?

I am parsing using a pretty large grammar (1.1 GB, it's data-oriented parsing). The parser I use (bitpar) is said to be optimized for highly ambiguous grammars. I'm getting this error: 1terminate called after throwing an instance of 'std::bad_alloc' what(): St9bad_alloc dotest.sh: line 11: 16686 Aborted bitpar -p -b 1...

Justadistraction: tokenizing English without whitespaces. Murakami SheepMan

I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed? The inspiration for the question is the Sheep Man character in the Murakami novel 'Dance Dance Dance' In the novel, the Sheep Man is translated as saying things like: "likewesaid, we'lldowhatwecan. Trytoreconnec...

Find the words in a long stream of characters. Auto-tokenize.

How would you find the correct words in a long stream of characters? Input : "The revised report onthesyntactictheoriesofsequentialcontrolandstate" Google's Output: "The revised report on syntactic theories sequential controlandstate" (which is close enough considering the time that they produced the output) How do you think Goo...