natural-language

Shell script to find bigrams.

I'm making a shell script to find bigrams, which works, sort of. #tokenise words tr -sc 'a-zA-z0-9.' '\012' < $1 > out1 #create 2nd list offset by 1 word tail -n+2 out1 > out2 #paste list together paste out1 out2 #clean up rm out1 out2 The only problem is that it pairs words from the end and start of the previous sentence. eg for th...

(human) Language of a document

Is there a way (a program, a library) to approximately know which language a document is written in? I have a bunch of text documents (~500K) in mixed languages to import in a i18n enabled CMS (Drupal).. I don't need perfect matches, only some guess. ...

Can anyone point me at a good example of pretty printing rules to "english"

I've got the equivalent of an AST that a user has built using a rule engine. But when displaying a list of the rules, I'd like to be able to "pretty print" each rule into something that looks nice**. Internally when represented as a string they look like s-expressions so imagine something like: (and (contains "foo" "foobar") (equals 4...

Is functional programming the next step towards natural-language programming?

This is my very first question so I am a bit nervous about it because I am not sure whether I get the meaning across well enough. Anyhow, here we go.... Whenever new milestones in programming have been reached it seems they always have had one goal in common: to make it easier for programmers, well, to program. Machine language, opcode...

Algorithm to determine how positive or negative a statement/text is

I need an algorithm to determine if a sentence, paragraph or article is negative or positive in tone... or better yet, how negative or positive. For instance: Jason is the worst SO user I have ever witnessed (-10) Jason is an SO user (0) Jason is the best SO user I have ever seen (+10) Jason is the be...

Natural language automation?

I remember reading about an automation program for windows that would accept a list of commands like this: press the ok button put "hello world" into the text control press the add button etc etc. Can anyone name this program? A thousand thankyous. ...

Code Golf: Number to Words

The code golf series seem to be fairly popular. I ran across some code that converts a number to its word representation. Some examples would be (powers of 2 for programming fun): 2 -> Two 1024 -> One Thousand Twenty Four 1048576 -> One Million Forty Eight Thousand Five Hundred Seventy Six The algorithm my co-worker came up was alm...

End user tool for generating a regular expression

We have a SaaS application requirement to allow a user responsible for building a CMS site to define up to 10 custom fields in a form. As part of this field definition we want to add a field validation option which we will store (and apply at runtime) as a reg-ex. Are there any tools, code samples or similar that offer a wizard style f...

Natural Language Unit Conversion in PHP?

I'm looking for a library (preferably in PHP) that can extract weigh / height data from a string. I want my users to input something like "I weigh 80 k and I'm 1.8m tall" or even "220 lb" and "6' 1" and pass it through a function that can extract the quantity and the unit. Anyone know if there's something like that out there? ...

About "AUTOMATIC TEXT SUMMARIZER (lingustic based)"

Hello, I am having "AUTOMATIC TEXT SUMMARIZER (linguistic approach)" as my final year project. I have collected enough research papers and gone through them. Still i am not very clear about the 'how-to-go-for-it' thing. Basically i found "AUTOMATIC TEXT SUMMARIZER (statistical based)" and found that it is much easier compared to my...

Detecting syllables in a word

I need to find a fairly efficient way to detect syllables in a word. E.g., invisible -> in-vi-sib-le There are some syllabification rules that could be used: V CV VC CVC CCV CCCV CVCC *where V is a vowel and C is a consonant. e.g., pronunciation (5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC) I've tried few methods, among which were using...

Tool or methods for automatically creating contextual links within a large corpus of content?

Here's the basic scenario - I have a corpus of say 100,000 newspaper-like articles. Minimally they will all have a well-defined title, and some amount of body content. What I want to do is find runs of text in articles that ought to link to other articles. So, if article Foo has a run of text like "Students in 8th grade are being en...

Should identifiers and comments be always in English or in the native language of the application and developers?

Duplicate: Do you use another language instead of english Related: Coding in other (spoken) languages Given a programming language (such as Java) that allows you to use non-ASCII identifiers (class, method, variable names), and an application written for non-English users, by developers who speak English only as a foreign language at v...

Natural English language words

I need the most exhaustive English word list I can find for several types of language processing operations, but I could not find anything on the internet that has good enough quality. There are 1,000,000 words in the English language including foreign and/or technical words. Can you please suggest me such a source (or close to 500k w...

Natural language parser for dates (.NET)?

I want to be able to let users enter dates (including recurring dates) using natural language (eg "next friday", "every weekday"). Much like the examples at http://todoist.com/Help/timeInsert I found this post, but it's a bit old and offered only one solution that I'm not entirely content with. I thought I'd resurrect this question and ...

Problem trimming Japanese string in java.

I have the following string (japanese) " ユーザー名" , the first character is "like" whitespace but its number in unicode is 12288, so if I do " ユーザー名".trim() I get the same string (trim doesn't work). If i do trim in c++ it works ok. Does anyone know how to solve this issue in java? Is there a special trim method for unicode? ...

Java library that finds sentence boundaries

Does anyone know of a Java library that handles finding sentence boundaries? I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use. Here's my experience with BreakIterator: Using the example here: I have the following Japanese: 今日はパソコンを買った。高性能のマックは早...

Misuse of English in the computer literature.....

So recently in the Rails literature the non-word (please, no down grades, I know non-word is a non-word but I'm not publishing this stuff and I don't claim to be more intelligent than those who write books :) P "dasherize" has become somewhat of a de-facto term as in: "to_xml will default to dasherizing the field names" Now in every ot...

What programming language is most like natural language?

I got the idea for this question from numerous situations where I don't understand what the person is talking about and when others don't understand me. So, a "smart" solution would be to speak a computer language. :) I am interested how far a programming language can go to get near to (English) natural language. When I say near, I mea...

Algorithm for separating nonsense text from meaningful text

I provided some of my programs with a feedback function. Unfortunately I forgot to include some sort of spam-protection - so users could send anything they wanted to my server - where every feedback is stored in a huge db. In the beginning I periodically checked those feedbacks - I filtered out what was usable and deleted garbage. Th...