tags:

views:

133

answers:

3

Does anyone know where I can find a huge repository of sample documents, in a variety of subjects? I'm looking for at least a few thousands documents (Office or PDF should be fine) in order to test some algorithms... The documents should have some common ground - for example, a thousand docs related to programming, another thousand related to ecology, etc...

Anyone know where I can get it?

A: 

On the internetzzz?

Edit: Me? Not being helpful? :)

import mechanize, urllib, os

template = r"http://www.google.com/search?q=filetype:pdf&hl=en&start=%s&sa=N"
links = []

br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Firefox')]
for i in xrange(0, 30, 10):
    br.open(template % i)
    links.extend((link.url for link in br.links(url_regex="^http.+pdf$"))
for url in links:
    urllib.urlretrieve(url, os.path.basename(url))
wuub
You expect him to download a thousand of those manually? -1
musicfreak
Yep, or write a simple script do it for him.
wuub
+4  A: 

Have you tried using wikipedia? Create a script that:

  1. Calls http://en.wikipedia.org/wiki/Special:Random to get a random page

  2. Follows the resulting redirect, appending ?printable=yes to the end of the url (so as to remove the layout crud) - use wget for these two steps, or its equivalent, it'll do the redirect for you.

  3. Pipes the resulting html content through a html->pdf converter.

  4. Repeat 1000 times.

That should get you a wide variety of content.

Kazar
I'm sure it will make a few friends at Wikipedia's data centres as well to :p
Nissan Fan
That kind of traffic is nothing to them - add a small pause between each request to be kind (and to stop them blocking you for trying a ddos), and no problems.
Kazar
Seriously. They get thousands (if not tens or hundreds of thousands) of hits per second. They won't notice in the slightest.
Daniel Straight
A: 

You can just use the advanced search on the Yahoo Search API to specify the document type you are looking for.

http://developer.yahoo.com/search/boss/boss_guide/Web_Search.html#optional_args_web

If you want a large number of word documents then specify the type of document you want then carry out a search based on some pre-selected keywords. That should give you a bunch of documents back.

You could also scrape the advanced Google search and grab the document links that way by specifying a fileType (from a random list), e.g:

http://www.google.co.in/search?q=monkey+badger+filetype%3Apdf

Jon