Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews, whic...
Hello friends! I am designing an Automatic text summarizer. One of the major modules in this project requires TRAINING CORPUS. Can someone please help me out by providing TRAINING CORPUS or referring some link to download it. Thanks in anticipation
...
I'm curious if anyone understands, knows or can point me to comprehensive literature or source code on how Google created their popular passage blocks feature. However, if you know of any other application that can do the same please post your answer too.
If you do not know what I am writing about here is a link to an example of Popular...
I have a text file, and each line is of the form:
TAB WORD TAB PoS TAB FREQ#
Word PoS Freq
the Det 61847
of Prep 29391
and Conj 26817
a Det 21626
in Prep 18214
to Inf 16284
it Pron 10875
is Verb 9982
to Prep 9343
was Verb 9236
I Pron 8875
for Prep 8412
that Conj 7308
you Pron 6954
Would one of you regex wizards kindly assist me in is...
Hey there,
I'm trying to find a free downloadable dictionary (or Corpus might be the better word) which I can import into MySQL. I need to words to have the type (noun, verb, adjective) associated with them. Any tips on where I can find one? I found one several years ago that worked nicely, but I no longer have it around.
Thanks!
Chris...
I'm trying to filter names out of text blobs. Currently I'm just generating a words list and filtering it by hand but I've got ~8k words to go so I'm looking for a better way. I could grab a dictionary and filter them out but that would cull names like smith and cliff.
What I need is either of the following:
a list of common names (I'...
I'm writing some code that calculates certain statistics about word usages.
Does anyone know where I can find a database of raw news articles from various topics over a period of (say) the last year? Preferably they would be either in plain text format or XML. Trying to scrape content from random web sites isn't a good option.
I kno...
Hi:
I am looking for a simple Russian to English word corpus. It can be as simple as a csv that lists a russian word in the first column and the equivalent English word in the second. Any ideas where I can find such a thing? Does the NLTK toolkit have something like this?
Thanks
...
I have a collection of parse trees, and they are in this ascii representation where indentation determines the structure (and closing brackets are implicit). I need to convert them to s-expressions so that parentheses determine the structure. It's a little bit like python's significant whitespace vs. braces. The input format is a vertica...
I know this is a long shot, but does anyone know of a dataset of English words that has stress information by syllable? Something as simple as the following would be fantastic:
AARD vark
A ble
a BOUT
ac COUNT
AC id
ad DIC tion
ad VERT ise ment
...
Thanks in advance!
...
Does anybody know of any data that relates to the frequency of the types of mistakes the people make when they misspell a word? I'm not referring to words themselves, but tje errors that are made by the typist. For example, I personally make transposition errors the most followed by deletion errors (that is, not including a letter I sh...
Hi,
I am looking for a corups of text to run some trial fulltext style data searches across. Either something I can download, or a system that generates it. Something a bit more random would be better e.g. 1,000,000 wikipedia articles in a format easy to insert into a 2 column database (id, text).
Any ideas or suggestions?
...
Hey guys,
I am looking for a free tagged corpus for a system to train on to for Named Entity Recognition. Most of the ones I find (like the New York Times one) are expensive and not open. Can anyone help?
...
I have a set of documents, and I want to return a list of tuples where each tuple has the date of a given document and the number of times a given search term appears in that document. My code (below) works, but is slow, and I'm a n00b. Are there obvious ways to make this faster? Any help would be much appreciated, mostly so that I ca...