views:

295

answers:

4

I'm looking for suggestions on new development system from some programmers that have more experience, but let me give some background on why.

  • We need a new development server for running scripts on large feeds
  • Speed is not a concern, it just needs to finish
  • The scripts are not ran often, and are typically coded very quickly (not optimized) using Python or Perl

Current problem:

  • larger feeds need to be processed, and existing scripts need to be refactored to handle them because of memory errors
  • The system running the scripts is a Win32 machine with 4GB RAM, and a single process cannot will never be allowed more then 2GB of space

Instead of my team spending time refactoring a rarely used script, I want to be able to throw more memory at it. I know a 64 bit upgrade will help here, but I'm not sure what type of environment is ideal running scripts that need a lot of memory, so that is what I'm asking suggestions for.

I've been looking into Solaris servers, FreeBSD servers, just using 64-bit Windows... It's hard to figure out what each system will be capable of once 64-bit versions of Python/Perl are installed and the scripts are actually running. If the system has 16GB of memory, when am I going to hit a memory error for a single process?

Some other things:

  • SSH to a remote server is an acceptable solution (probably ideal so we can have multiple users running scripts)
  • We have VMWare available, so that is another option if anyone has experience/comments about developing with a VMWare client

Any suggestions for a new system, or other things I should consider when deciding would be great.

+1  A: 

Performance/memory-usage-wise, the operating systems makes no big difference. Python works nicely on all of them in 64-bit mode. Unix support for 64-bit mode is a decade older than the Windows support, so there may be slightly fewer problems on Unix. However, large collections support is fairly new on all systems. So if you expect to have a need for large collections (rather than just a nesting of many many small collections), be prepared for some debugging.

Also notice that on a 64-bit system, all pointers will double in size. So to do the same work that you do in 2GB on a 32-bit system, you need 4GB on a 64-bit system.

The system will run out of memory only if the swap space is exhausted, so plan for plenty of swap space in case 16GB is not enough.

Martin v. Löwis
Bear in mind that (a) not everything is a pointer, so going from 32 to 64 bits will not double memory requirements, and (b) if there's not enough real memory, the script will start thrashing, and thrashing on a very large data set means not finishing in any reasonable time.
David Thornley
+3  A: 

"Instead of my team spending time refactoring a rarely used script,"

Clearly, the script is of considerable value, even if run rarely.

Often, small changes will yield big benefits.

Specifically,

  • If you decompose long sections of code to isolate the intermediate and temporary values, you permit more frequent garbage collection. Break up big functions into small functions. This is, perhaps, the hardest.

  • If you replace range with xrange, you can save yourself the creation of transient list objects. In some cases, you may find loops that can benefit from enumerate, replacing the range/xrange business. This is a quick grep.

  • If you reconsider any string concatenation operations and find ways to make them into lists which you (eventually) merge with " ".join( listOfStrings ). You'll save yourself creating a lot of transient intermediate string objects of no real importance. This requires reading the code and doing some refactoring to find += operations among strings.

You may be able to dramatically reduce memory consumption with only an hour or two of work.

S.Lott
We are currently doing this, and you're right S.Lott... it doesn't take more then an hour or 2. We have also used persitent objects where it makes sense, and split processes into batches where the results can be merged. We have the OK for a new system with lots of RAM though, and it's cheaper then paying developers for x Hours of work in refactoring. It's more like we will have this system for some scripts while refactoring others, and also working on new projects.
jcoon
Does anybody know how Python does using large amounts of virtual memory? What's its locality like, and how do you encourage it?
David Thornley
@coonj: Throwing memory at a bad design may not help much. It may just run longer before it crashes. You don't need to spend unlimited hours on rework. Just enough hours to get past the most egregious memory hog. Often, the memory hog has numerous other problems.
S.Lott
@David Thornley: Please post this as a separate question. When you do, please clarify your question to define "locality" and "encourage". [Encourage whom? To do what?]
S.Lott
+3  A: 

Simply install Ubuntu or Debian for amd64 on a suitable machine. It's easy (much easier than installing FreeBSD or God forbid OpenSolaris), pretty straightforward, and the Perl and Python will be 64 bits out of the box and part of default installation.

wazoox
And you get a gazillion Perl / Python modules right in the distro.
Dirk Eddelbuettel
+2  A: 

I once faced a similar issue on 32bit Linux (with 4 to 8 gb of ram) albeit using a different scripting language (R). Considerable effort went into slicing and dicing data into appropriate chunks to not go belly-up. The effective limit of 3gb per process was a real constraint for the data sets I was analyzing.

Now on 64bit Linux (with 12 to 16gb of ram), life is considerably easier. So slicing, no dicing. It just fits.

So if your problem set is at that sweet spot, consider going to 64bit in whatever form suits you best. And as wazoox mentioned, installation of Debian or Ubuntu is a breeze, especially with the smaller / more focused 'server' flavors.

Dirk Eddelbuettel
This sounds like a pretty similar situation...good to know it worked out for you. thanks
jcoon