views:

184

answers:

4

I have a large dataset (c. 40G) that I want to use for some NLP (largely embarrassingly parallel) over a couple of computers in the lab, to which i do not have root access, and only 1G of user space. I experimented with hadoop, but of course this was dead in the water-- the data is stored on an external usb hard drive, and i cant load it on to the dfs because of the 1G user space cap. I have been looking into a couple of python based options (as I'd rather use NLTK instead of Java's lingpipe if I can help it), and it seems distributed compute options look like:

  • Ipython
  • DISCO

After my hadoop experience, i am trying to make sure i try and make an informed choice -- any help on what might be more appropriate would be greatly appreciated.

Amazon's EC2 etc not really an option, as i have next to no budget.

+1  A: 

no actual answers; i'd have put this as a comment but on this site you're forced to only answer if you're still a noob

if it's genuinely as parallel as that, and it's only a couple of computers, could you not split the dataset up manually ahead of time?

have you confirmed that there isn't going to be a firewall or similar stopping you using something like that anyway?

you may only have 1GB of user space, but, if linux, what about /tmp ? (if windows, what about %temp% ? )

frymaster
+3  A: 

Speak with the IT dept at your school (especially if you are in college), if it is for an assignment or research I bet they would be more than happy to give you more disk space.

swanson
+1  A: 

Definitely speak with the IT department at your school. It's not a good idea to utilize computer resources that don't belong to you.

I found JPPF, which enables applications with large processing power requirements to be run on any number of computers. I'm not sure if you need to install software on the client machines, but certain ports need to be open on the client machines.

Gilbert Le Blanc
A: 

If more resources in your computing department are a no go, you're going to have to consider breaking down your data set into manageable chunks before you do any work on it, ad reduce the results down into a meaningful set.

More resources from IT would be the way to go.

Good luck !

Ben

Ben Hughes