I want to do a bit of lightweight testing and bench-marking for full-text search, so the dataset should have the qualities:
- 10,000 - 100,000 records.
- good dispersion of English words.
- In CSV or Excel format--i.e. I don't want to access it via API.
Something like books or movies with title and description fields would be perfect. I browsed the UCI Machine Learning Repo, but it was too number-oriented. Thanks!