natural-language

What's a good natural language library to use for paraphrasing?

I'm looking for an existing library to summarize or paraphrase content (I'm aiming at blog posts) - any experience with existing natural language processing libraries? I'm open to a variety of languages, so I'm more interested in the abilities & accuracy. ...

Vista speech recognition in multiple languages

Hi, my primary language is spanish, but I use all my software in english, including windows; however I'd like to use speech recognition in spanish. Do you know if there's a way to use vista's speech recognition in other language than the primary os language? ...

APIs and Datasets for Natural Languages?

Are there any good APIs and public datasets (dictionaries, phrases) for working w/ natural languages? Specifically, do any good ones exist for working on translation between English and Korean? ...

How Do You Categorize Based On Text Content?

How does one automatically find categories for text based on content? ...

Your favorite natural language parser?

This is just a poll on what parser you like to use for parsing sentences of natural language syntactically. I am interested in complete software toolkits/solutions. A good answer would list at least some of the following: The name of the parser (obviously) and a link to its webpage. The (programming!) language(s) it's written in. The (...

Word frequency algorithm for natural language processing

Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text. The goal is to get a "general feel" of what people are saying over a set of textual comments. Along the lines of Wordle. What I'd like: ignore articles, pronouns, etc...

How do I determine if a random string sounds like English?

I have an algorithm that generates strings based on a list of input words. How do I separate only the strings that sounds like English words? ie. discard RDLO while keeping LORD. EDIT: To clarify, they do not need to be actual words in the dictionary. They just need to sound like English. For example KEAL would be accepted. ...

NLP: Qualitatively "positive" vs "negative" sentence

I need your help in determining the best approach for analyzing industry-specific sentences (i.e. movie reviews) for "positive" vs "negative". I've seen libraries such as OpenNLP before, but it's too low-level - it just gives me the basic sentence composition; what I need is a higher-level structure: - hopefully with wordlists - hopefull...

Contextual Natural Language Resources, Where Do I Start?

Where can i find some .Net or conceptual resources to start working with Natural Language where I can pull context and subjects from text. I wish not to work with word frequency algorithms. ...

NLP: Building (small) corpora, or "Where to get lots of not-too-specialized English-language text files?"

Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews, whic...

What options do you recommend for language translation on content driven Web sites?

Please read the whole question. I'm not looking for an approach to managing multi-lingual content, but I'm looking for a way to actually get that multi-lingual content. This usually falls within technical recommendations on most projects I work on, and I hope someone can offer some help. We are working with a client now who has the perso...

What would the best tool to create a natural DSL in Java?

A couple of days ago, I read a blog entry (http://ayende.com/Blog/archive/2008/09/08/Implementing-generic-natural-language-DSL.aspx) where the author discuss the idea of a generic natural language DSL parser using .NET. The brilliant part of his idea, in my opinion, is that the text is parsed and matched against classes using the same n...

Latent Dirichlet Allocation, pitfalls, tips and programs

I'm experimenting with Latent Dirichlet Allocation for topic disambiguation and assignment, and I'm looking for advice. Which program is the "best", where best is some combination of easiest to use, best prior estimation, fast How do I incorporate my intuitions about topicality. Let's say I think I know that some items in the corpus a...

Is there a human readable programming language?

I mean, is there a coded language with human style coding? For example: Create an object called MyVar and initialize it to 10; Take MyVar and call MyMethod() with parameters. . . I know it's not so useful, but it can be interesting to create such a grammar. ...

How can I use NLP to parse recipe ingredients?

I need to parse recipe ingredients into amount, measurement, item, and description as applicable to the line, such as 1 cup flour, the peel of 2 lemons and 1 cup packed brown sugar etc. What would be the best way of doing this? I am interested in using python for the project so I am assuming using the nltk is the best bet but I am open t...

What are good starting points for someone interested in natural language processing?

Question So I've recently came up with some new possible projects that would have to deal with deriving 'meaning' from text submitted and generated by users. Natural language processing is the field that deals with these kinds of issues, and after some initial research I found the OpenNLP Hub and university collaborations like the atte...

Algorithms or libraries for textual analysis, specifically: dominant words, phrases across text, and collection of text

I'm working on a project where I need to analyze a page of text and collections of pages of text to determine dominant words. I'd like to know if there is a library (prefer c# or java) that will handle the heavy lifting for me. If not, is there an algorithm or multiple that would achieve my goals below. What I want to do is similar...

tf-idf and previously unseen terms

TF-IDF (term frequency - inverse document frequency) is a staple of information retrieval. It's not a proper model though, and it seems to break down when new terms are introduced into the corpus. How do people handle it when queries or new documents have new terms, especially if they are high frequency. Under traditional cosine match...

Natural Language/Text Mining and Reddit/social news site

I think there is a wealth of natural language data associated with sites like reddit or digg or news.google.com. I have done a little bit of research with text mining, but can't find how I could use those tools to parse something like reddit. What kind of applications can you come up with? ...

Split string into sentences using regular expression

I need to match a string like "one. two. three. four. five. six. seven. eight. nine. ten. eleven" into groups of four sentences. I need a regular expression to break the string into a group after every fourth period. Something like: string regex = @"(.*.\s){4}"; System.Text.RegularExpressions.Regex exp = new System.Text.Regul...