views:

274

answers:

8

I was thinking about building a program that use a raid(disk) like algorithms. If one computer dies. The next will step in. In it's place. And it need to scale from 1 - 1000 computers.

I need some advice.

What the name of the algorithms I'm need to learn?

At one point I thought it was possible to build it on top of git.

A: 

BitTorrent? :)

Eric Petroelje
hehe. was it so obvious?
Flinkman
Removing the smiley seems a bit over the top...
Gavin Miller
Actually, you probably could build a distributed storage system over BitTorrent, so it was only partially a joke.
Eric Petroelje
Yea. As I understand it. It is the trackers that need to be distributed. And everyone need to share there IP for some time. So we can shelter the Heroes that support the system the most.
Flinkman
+4  A: 

You may want to read this paper on the Google File System. From the abstract:

We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

David Segonds
A: 

You could build something like memcached. Each hash entry could be a file block (e.g. SHA hash of block to contents).

Jeff Moser
+1  A: 

Distributed hash tables pop into my mind...

f3lix
A: 

Also check out the MapReduce algorithm. It's a relatively simple way of getting high scalability, that doesn't force the algorithm designer to think about locking, communication, etc. There are several implementations available, for example the open-source Hadoop by the Apache foundation.

flodin
+1  A: 

Try Hazelcast. It has distributed implementation of Set, List and more. Hazelcast is an open source transactional, distributed/partitioned implementation of queue, topic, map, set, list, lock and executor service. It is super easy to work with; just add hazelcast.jar into your classpath and start coding. Almost no configuration is required.

Hazelcast is released under Apache license and enterprise grade support is also available. Code is hosted at Google Code.

Talip Ozturk
+1  A: 

I've seen both Hadoop and the Google File System mentioned, but nobody has specifically mentioned HDFS - the distributed filesystem that comes with Hadoop. You can set the desired level of redundancy, and lose the occasional node without losing your data.

One caveat: You need to make sure the one machine that holds the "namenode" (the master machine and single point of failure in an HDFS cluster) is solid - RAID mirroring, backups, the works. You lose the namenode, you lose the cluster.

bradheintz
A: 

You might want to check out Appistry EAF. Its a distributed execution platform. It handles all the failover of tasks for you, so you don't have to build that into your code. If one node fails, another node automatically takes over. And unlike Grid, there is no centralized controller, to you remove the single point of failure/bottleneck of those types of solutions.

There is a free download available up to 5 machines.

Brett McCann