views:

294

answers:

2

If one has a peer-to-peer system that can be queried, one would like to

  • reduce the total number of queries across the network (by distributing "popular" items widely and "similar" items together)
  • avoid excess storage at each node
  • assure good availability to even moderately rare items in the face of client downtime, hardware failure, and users leaving (possibly detecting rare items for archivists/historians)
  • avoid queries failing to find matches in the event of network partitions

Given these requirements:

  1. Are there any standard approaches? If not, is there any respected, but experimental, research? I'm familiar some with distribution schemes, but I haven't seen anything really address learning for robustness.
  2. Am I missing any obvious criteria?
  3. Is anybody interested in working on/solving this problem? (If so, I'm happy to open-source part of a very lame simulator I threw together this weekend, and generally offer unhelpful advice).

@cdv: I've now watched the video and it is very good, and although I don't feel it quite gets to a pluggable distribution strategy, it's definitely 90% of the way there. The questions, however, highlight useful differences with this approach that address some of my further concerns, and gives me some references to follow up on. Thus, I'm provisionally accepting your answer, although I consider the question open.

+1  A: 

If you have time it would be worth checking out the Google tech talk that Wuala gave. They discuss these same problems they faced when building their peer-to-peer file system.

cdv
+1  A: 

There are multiple systems out there with various aspects of what you seek and each making different compromises, including but not limited to:

Amazon's Dynamo: http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf

Kai: http://www.slideshare.net/takemaru/kai-an-open-source-implementation-of-amazons-dynamo-472179

Hadoop: http://hadoop.apache.org/core/docs/current/hdfs_design.html

Chord: http://pdos.csail.mit.edu/chord/

Beehive: http://www.cs.cornell.edu/People/egs/beehive/

and many others. After building a custom system along those lines, I let some of the building blocks out in open source form as well: http://code.google.com/p/distributerl/ (that's not a whole system, but a few libraries useful in building one)

Justin Sheehy
This is a lot of good information, but it is almost too much. I can Google for distributed hash table as well as the next person. Could you say more about the tradeoffs? Thanks :-)
John the Statistician