corpus

NLP: Building (small) corpora, or "Where to get lots of not-too-specialized English-language text files?"

Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews, whic...

CORPUS resource

Hello friends! I am designing an Automatic text summarizer. One of the major modules in this project requires TRAINING CORPUS. Can someone please help me out by providing TRAINING CORPUS or referring some link to download it. Thanks in anticipation ...

How was the Google Books' Popular passages feature developed?

I'm curious if anyone understands, knows or can point me to comprehensive literature or source code on how Google created their popular passage blocks feature. However, if you know of any other application that can do the same please post your answer too. If you do not know what I am writing about here is a link to an example of Popular...

Assistance with Find and Replace Regex

I have a text file, and each line is of the form: TAB WORD TAB PoS TAB FREQ# Word PoS Freq the Det 61847 of Prep 29391 and Conj 26817 a Det 21626 in Prep 18214 to Inf 16284 it Pron 10875 is Verb 9982 to Prep 9343 was Verb 9236 I Pron 8875 for Prep 8412 that Conj 7308 you Pron 6954 Would one of you regex wizards kindly assist me in is...

Need free English dictionary or Corpus, ultimately for a MySQL database

Hey there, I'm trying to find a free downloadable dictionary (or Corpus might be the better word) which I can import into MySQL. I need to words to have the type (noun, verb, adjective) associated with them. Any tips on where I can find one? I found one several years ago that worked nicely, but I no longer have it around. Thanks! Chris...

List of proper names?

I'm trying to filter names out of text blobs. Currently I'm just generating a words list and filtering it by hand but I've got ~8k words to go so I'm looking for a better way. I could grab a dictionary and filter them out but that would cull names like smith and cliff. What I need is either of the following: a list of common names (I'...

Where can I get raw news articles from the last year?

I'm writing some code that calculates certain statistics about word usages. Does anyone know where I can find a database of raw news articles from various topics over a period of (say) the last year? Preferably they would be either in plain text format or XML. Trying to scrape content from random web sites isn't a good option. I kno...

Russian-to-English Parallel Word Corpus?

Hi: I am looking for a simple Russian to English word corpus. It can be as simple as a csv that lists a russian word in the first column and the equivalent English word in the second. Any ideas where I can find such a thing? Does the NLTK toolkit have something like this? Thanks ...

converting a treebank of vertical trees to s-expressions

I have a collection of parse trees, and they are in this ascii representation where indentation determines the structure (and closing brackets are implicit). I need to convert them to s-expressions so that parentheses determine the structure. It's a little bit like python's significant whitespace vs. braces. The input format is a vertica...

Corpus/data set of English words with syllabic stress information?

I know this is a long shot, but does anyone know of a dataset of English words that has stress information by syllable? Something as simple as the following would be fantastic: AARD vark A ble a BOUT ac COUNT AC id ad DIC tion ad VERT ise ment ... Thanks in advance! ...

Data on the Frequency of Edit Operations Required to Correct a Misspelt Word

Does anybody know of any data that relates to the frequency of the types of mistakes the people make when they misspell a word? I'm not referring to words themselves, but tje errors that are made by the typist. For example, I personally make transposition errors the most followed by deletion errors (that is, not including a letter I sh...

Looking for data set to text FULLTEXT style searches on

Hi, I am looking for a corups of text to run some trial fulltext style data searches across. Either something I can download, or a system that generates it. Something a bit more random would be better e.g. 1,000,000 wikipedia articles in a format easy to insert into a 2 column database (id, text). Any ideas or suggestions? ...

Free Tagged Corpus for Named Entity Recognition

Hey guys, I am looking for a free tagged corpus for a system to train on to for Named Entity Recognition. Most of the ones I find (like the New York Times one) are expensive and not open. Can anyone help? ...

How do I count words in an nltk plaintextcorpus faster?

I have a set of documents, and I want to return a list of tuples where each tuple has the date of a given document and the number of times a given search term appears in that document. My code (below) works, but is slow, and I'm a n00b. Are there obvious ways to make this faster? Any help would be much appreciated, mostly so that I ca...