tags:

views:

52

answers:

1

For testing purposes I need to create sets of text files that have similar but not identical text. Each set needs to be different from the other set but also share some commonality.

For example, I may need to create 10 sets of 20 documents each for a total of 200 documents. Each document needs about 250 words in it.

If one of the sets of documents is about dogs then it would be appropriate that the other sets' documents be about animals, for example, such that there is a weak link between each set (in this case animals) and a strong link between the documents within a set (such as dogs in one set and cats in another set).

The words in the documents do not need to be in any particular order, nor do they need to be in sentences or make sense.

Does anybody know how I can generate or obtain this type of data for my unit tests?

+3  A: 

How about grabbing some text from Project Gutenberg?

Doug Currie
Great idea Doug - thanks - I've just been looking at the web and am now trying to work out how to find a collection of books that are about the same subject.
Guy