views:

71

answers:

1

Hi, I'm working on a cross language information retrieval that takes queries in english and searches documents in Russian. To evaluate this system it would be nice to have a collection of russian documents to search through. Does anyone out there know of a collection of documents I can search or websites from which I can easily scrape together a bunch of russian documents (aside from wikipedia)?

Documents can be about anything though it would be nice if they were in some specific area of human knowledge (CS, architecture, engineering, art, literature analysis, whatever...)

+1  A: 

Don't know if this is what you're looking for, but here's a torrent of Russian national standards and laws. They are in dBase4 format, and there is approximately 57.3 GB of data.

Calvin
I'd rather have something that's in unicode plain text and I need 200-10000 documents.
If you are using a *nix system, you can use this: http://linux.maruhn.com/sec/dbview.html to convert the dBase files to another format. In Windows, you can use ADODB: http://www.freevbcode.com/ShowCode.asp?ID=9055 http://www.vbcode.com/Asp/showsn.asp?theID=12507 . Also, I think Excel can read dBase files, though the .db4 extension is not recognized by default.
Calvin