tags:

views:

38

answers:

3

The arXiv e-print archive has several terabytes of papers from various fields of science. Some users would like to maintain a full copy of this data on their own computers, while others just want to download the most recent papers in a particular category. They are looking to reduce bandwidth load using some kind of distributed download system (e.g. BitTorrent). I'm looking for ideas for a program or set of programs that would cover all of this.

A: 

My first idea is that this looks an awful lot like Usenet newsgroups, with infinite persistence for messages on the servers. I don't know how well it works with PDFs, though.

sep332
+1  A: 

arXiv recommends squid in httpd accelerator mode for precisely this purpose. Any particular reason why this is not good enough?

janneb
One of the people involved said "First of all it should be noted that because of the arXiv's robot policy, nothing like this is currently possible. In other words over 15 years of research, while accessible on a nibble basis, is not really accessible." The point of this new project is to allow the entire arXiv to be downloaded.
sep332
+1  A: 

full pdf content is in the amazon cloud.

while there are > 600k papers on arXiv the total size of the pdf is < 1/2 TB

http://arxiv.org/help/bulk_data_s3

T.

thor