tags:

views:

60

answers:

3

What I am trying to do: I have hundreds of servers with very large log files spread out at dozens of different clients. I am creating nice python scripts to parse the logs in different ways and would like to aggregate the data I am collecting from all of the different servers. I would also like to keep the changing scripts centralized. The idea is to have a harness that can connect to each of the servers, scp the script to the servers, run the process with pexpect or something similar and either scp the resulting data back in separate files to be aggregated or (preferentially, I think) stream the data and aggregate it on the fly. I do not have keys set up (nor do I want to set them up) but I do have a database with connection information, logins, passwords and the like.

My question: this seems like it is probably a solved problem and I am wondering if someone knows of something that does this kind of thing or if there is a solid and proven way of doing this...

A: 

Parallel Python provides some functionality for distributed computing and communication:

http://www.parallelpython.com/

Dan Lorenc
These are not exactly clustered computers...I think the task I am trying to accomplish is much simpler than what parallel python is designed for.
Ichorus
+1  A: 

Take a look at Func. It's a framework for rpc-style communication with a large number of machines using python. As a bonus, it comes with built-in TLS, so you don't need to layer on top of ssh tunneling for security.

JimB
+2  A: 

Looks like hadoop is your answer http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python

or Pyro is also good but I am not sure if you can automatically distribute scripts. http://pyro.sourceforge.net/features.html

Anurag Uniyal
Wow. Pyro looks powerful, I will definitely be digging more into that. Hadoop looks a bit heavyweight for what I am trying to accomplish. Thanks!
Ichorus