I'm making a shell script to find bigrams, which works, sort of.
#tokenise words
tr -sc 'a-zA-z0-9.' '\012' < $1 > out1
#create 2nd list offset by 1 word
tail -n+2 out1 > out2
#paste list together
paste out1 out2
#clean up
rm out1 out2
The only problem is that it pairs words from the end and start of the previous sentence.
eg for th...
Is there a way (a program, a library) to approximately know which language a document is written in?
I have a bunch of text documents (~500K) in mixed languages to import in a i18n enabled CMS (Drupal)..
I don't need perfect matches, only some guess.
...
I've got the equivalent of an AST that a user has built using a rule engine. But when displaying a list of the rules, I'd like to be able to "pretty print" each rule into something that looks nice**. Internally when represented as a string they look like s-expressions so imagine something like:
(and (contains "foo" "foobar") (equals 4...
This is my very first question so I am a bit nervous about it because I am not sure whether I get the meaning across well enough. Anyhow, here we go....
Whenever new milestones in programming have been reached it seems they always have had one goal in common: to make it easier for programmers, well, to program.
Machine language, opcode...
I need an algorithm to determine if a sentence, paragraph or article is negative or positive in tone... or better yet, how negative or positive.
For instance:
Jason is the worst SO user I have ever witnessed (-10)
Jason is an SO user (0)
Jason is the best SO user I have ever seen (+10)
Jason is the be...
I remember reading about an automation program for windows that would accept a list of commands like this:
press the ok button
put "hello world" into the text control
press the add button
etc etc. Can anyone name this program? A thousand thankyous.
...
The code golf series seem to be fairly popular. I ran across some code that converts a number to its word representation. Some examples would be (powers of 2 for programming fun):
2 -> Two
1024 -> One Thousand Twenty Four
1048576 -> One Million Forty Eight Thousand Five Hundred Seventy Six
The algorithm my co-worker came up was alm...
We have a SaaS application requirement to allow a user responsible for building a CMS site to define up to 10 custom fields in a form.
As part of this field definition we want to add a field validation option which we will store (and apply at runtime) as a reg-ex.
Are there any tools, code samples or similar that offer a wizard style f...
I'm looking for a library (preferably in PHP) that can extract weigh / height data from a string.
I want my users to input something like "I weigh 80 k and I'm 1.8m tall" or even "220 lb" and "6' 1" and pass it through a function that can extract the quantity and the unit.
Anyone know if there's something like that out there?
...
Hello,
I am having "AUTOMATIC TEXT SUMMARIZER (linguistic approach)" as my final year project. I have collected enough research papers and gone through them. Still i am not very clear about the 'how-to-go-for-it' thing. Basically i found "AUTOMATIC TEXT SUMMARIZER (statistical based)" and found that it is much easier compared to my...
I need to find a fairly efficient way to detect syllables in a word. E.g.,
invisible -> in-vi-sib-le
There are some syllabification rules that could be used:
V
CV
VC
CVC
CCV
CCCV
CVCC
*where V is a vowel and C is a consonant.
e.g.,
pronunciation (5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC)
I've tried few methods, among which were using...
Here's the basic scenario - I have a corpus of say 100,000 newspaper-like articles. Minimally they will all have a well-defined title, and some amount of body content.
What I want to do is find runs of text in articles that ought to link to other articles.
So, if article Foo has a run of text like "Students in 8th grade are being en...
Duplicate: Do you use another language instead of english
Related: Coding in other (spoken) languages
Given a programming language (such as Java) that allows you to use non-ASCII identifiers (class, method, variable names), and an application written for non-English users, by developers who speak English only as a foreign language at v...
I need the most exhaustive English word list I can find for several types of language processing operations, but I could not find anything on the internet that has good enough quality.
There are 1,000,000 words in the English language including foreign and/or technical words.
Can you please suggest me such a source (or close to 500k w...
I want to be able to let users enter dates (including recurring dates) using natural language (eg "next friday", "every weekday"). Much like the examples at http://todoist.com/Help/timeInsert
I found this post, but it's a bit old and offered only one solution that I'm not entirely content with. I thought I'd resurrect this question and ...
I have the following string (japanese) " ユーザー名" , the first character is "like" whitespace but its number in unicode is 12288, so if I do " ユーザー名".trim() I get the same string (trim doesn't work).
If i do trim in c++ it works ok.
Does anyone know how to solve this issue in java?
Is there a special trim method for unicode?
...
Does anyone know of a Java library that handles finding sentence boundaries? I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.
Here's my experience with BreakIterator:
Using the example here:
I have the following Japanese:
今日はパソコンを買った。高性能のマックは早...
So recently in the Rails literature the non-word (please, no down grades, I know non-word is a non-word but I'm not publishing this stuff and I don't claim to be more intelligent than those who write books :) P "dasherize" has become somewhat of a de-facto term as in:
"to_xml will default to dasherizing the field names"
Now in every ot...
I got the idea for this question from numerous situations where I don't understand what the person is talking about and when others don't understand me.
So, a "smart" solution would be to speak a computer language. :)
I am interested how far a programming language can go to get near to (English) natural language. When I say near, I mea...
I provided some of my programs with a feedback function. Unfortunately I forgot to include some sort of spam-protection - so users could send anything they wanted to my server - where every feedback is stored in a huge db.
In the beginning I periodically checked those feedbacks - I filtered out what was usable and deleted garbage. Th...