views:

47

answers:

3

I am looking to do some text analysis in a program I am writing. I am looking for alternate sources of text in its raw form similar to what is provided in the Wikipedia dumps (download.wikimedia.com).

I'd rather not have to go through the trouble of crawling websites, trying to parse the html , extracting text etc..

+6  A: 

What sort of text are you looking for?

There are many free e-books (fiction and non-fiction) in .txt format available at Project Gutenberg.

They also have large DVD images full of books available for download.

Blorgbeard
+1 I came here to post PG.
Joe
A: 

the gutenberg project has huge amounts of ebooks in various formats (including plain text)

Nikolaus Gradwohl
+3  A: 

NLTK provides a simple Python API to access many text corpora, including Gutenberg, Reuters, Shakespeare, and others.

>>> from nltk.corpus import brown
>>> brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
Chris S