I have an array of 1000 or so entries, with examples below:
wickedweather
liquidweather
driveourtrucks
gocompact
slimprojector
I would like to be able to split these into their respective words, as:
wicked weather
liquid weather
drive our trucks
go compact
slim projector
I was hoping a regular expression my do the trick. But, sinc...
What is the algorithm - seemingly in use on domain parking pages - that takes a spaceless bunch of words (eg "thecarrotofcuriosity") and more-or-less correctly breaks it down into the constituent words (eg "the carrot of curiosity") ?
...
Are there any good Java libraries with prebuilt dictionaries that I can use to try and extract word roots from input words?
I asked a more general question which supersedes this question. It is here. Please vote to close this question.
...
I'm looking for various NLP tools for a project I'm working on and right now I've found most useful the Stanford NLP projects.
Does anyone know if there are other tools that are out there that would be useful for a language understander?
And more importantly, are there tools that are NOT out there?
Most specifically, I'm looking fo...
I need to do a project on Computational Linguistics course. Is there any interesting "linguistic" problem which is data intensive enough to work on using Hadoop map reduce. Solution or algorithm should try and analyse and provide some insight in "lingustic" domain. however it should be applicable to large datasets so that i can use hadoo...
Related:
Forum post
Before reinventing the wheel, I need to know whether such method exists. Stripping words according to a list such as list does not sound challenging but there are linguistic aspects, such as which words to stress the most in stripping, how about context?
...
I am parsing using a pretty large grammar (1.1 GB, it's data-oriented parsing). The parser I use (bitpar) is said to be optimized for highly ambiguous grammars. I'm getting this error:
1terminate called after throwing an instance of 'std::bad_alloc'
what(): St9bad_alloc
dotest.sh: line 11: 16686 Aborted bitpar -p -b 1...
I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed?
The inspiration for the question is the Sheep Man character in the Murakami novel 'Dance Dance Dance'
In the novel, the Sheep Man is translated as saying things like:
"likewesaid, we'lldowhatwecan. Trytoreconnec...
How would you find the correct words in a long stream of characters?
Input :
"The revised report onthesyntactictheoriesofsequentialcontrolandstate"
Google's Output:
"The revised report on syntactic theories sequential controlandstate"
(which is close enough considering the time that they produced the output)
How do you think Goo...