views:

33

answers:

2

I am looking for large web pages datasets for information search and text processing research.

The language of web pages must be English. It can be news websites backups, sites with any textual content no more than 1GB in size.

Do you know some good datasets?

A: 

The Wikipedia is a popular research corpus for many tasks: it is large, it is for free, it is internally linked and it is semantically annotated for some purposes (by means of infoboxes). The entire Wikipedia can be downloaded for off-line use. If you want something smaller than 1GB, you can always strip out what you don't need.

Otherwise, the simple 20 newsgroups corpus may be nice.

larsmans
A: 

Start by doing a survey of what other people in that field are using. If you're interested in publishing your work and there is a standard dataset in that field, not using it might get your paper rejected. A common source of rejection is not sufficiently well comparing to existing work.

André Caron