views:

423

answers:

10

I know the question title isn't the best. Let me explain.

I do a TON of text processing which converts natural language to xml. These text files get uploaded fairly fast and thrown into a queue. From there they are pulled one-by-one into a background worker that calls our parser (using boost spirit) to transform the text into xml and load relevant portions into our db.

The parser can do about 100 of these at a time. I have rate-limiters on the background worker to only poll our queue every so often right now so it doesn't perform as fast. I can't throw up more than one background worker right now because my http requests start to drop -- the background worker and the webserver exist on the same machine and I believe it is because of cpu usage hitting 80-95%, although we could use more ram on it as well.

I need to scale this better. How would you go about doing it?

In answers to several questions:

  • we use amazon web services so buying cheap extra hardware is a bit different from spawning a new amazon instance -- maybe somebody has done some code that autospawns instances on amount of load?

  • we do have a http server that just stuffs our files into a queue so the only reason it would be affected is because the cpu is busy dealing with tons of parsing related stuff

  • I already rate-limit our background workers, although we don't utilize that in the parser itself

  • I haven't tried nice yet but I've used it in the past -- I need to write down some benchmarks on that

  • the parser is completely seperate from the web server -- we have nginx/merb as our web/application server and a rake task calling c++ as our background worker -- yet they do exist on the same machine

+4  A: 

I would buy a couple of cheap computers and do the text processing on those. As Jeff says in his latest post, "Always try to spend your way out of a performance problem first by throwing faster hardware at it."

Can Berk Güder
I think the comment makes sense in a desktop computer type environment. However, there are many cases where it doesn't make sense such as when you need to build lots of something cheaply (embedded systems) and/or portability (handhelds) or power (wireless) is primary objective.
Tall Jeff
A: 

I assume you have multiple threads where each belongs to one of two groups

  • group A that downloads text files
  • group B that converts text to xml

If you think group B is limiting your throughput, I would set its' threads to lower priority. If there is enough work, CPU will still be used 100% but download will not be affected.

If my above assumption is correct, you should also be using multi-core and multi-cpu machine(s) as your performance should scale very well with more CPUs.

bh213
If you continue to fetch files at a high rate, and but process them at a lower rate, it is just a matter of time before you overflow your queue.
Oddthinking
Well, it should be pretty obvious to stop getting new data after some threshold has been reached. The issue is that data processing is taking all cpu from data aquisition. If they manage to transform sort of natural language to something I expect one if statment isn't what they are wondering about.
bh213
If queue overflow is a risk, then clearly you just don't have the capacity for the service you want to provide: get more hardware or faster algorithms. Throttling submissions is just admitting you can't handle the load.
slim
+7  A: 

Perhaps just placing the background worker at a lower scheduling priority (e.g. using nice) would help. This means that your server can handle requests when it needs to, but when it's not busy you can go full blast with the text processing.

Methinks it will give you much more benefit than staggering the background worker arbitrarily.

Artelius
right. it's OK to use 100% CPU as long you yield it when it's needed elsewhere. there will still be some small hit but it should not be the end of the world.
frankodwyer
+3  A: 

I am not sure I am following your question exactly, but it sounds like you have an HTTP engine that feeds a work pending queue. Correct? The background thread is taking those queue requests and do the heavy lifting part, Correct?

So, it sounds like the background process is compute bound and the foreground process is essentially I/O bound...or as a minimum limited by the rate that new work can be submitted.

The best way to optimize such a process is to set your background process at a lower priority than the foreground process. This ensures that the background process stays fed with work to do. Then you set the queue depth between the processes up such that it's size is limited to the maximum amount of work you want to have pending at once.

Tall Jeff
A: 

I'd put the parser on its own machine. That way it won't impact the web server.

If you don't have the budget for another machine then use virtualization (OpenVZ is cool if your web server is hosted on Ubuntu or CentOS) to limit the CPU quota for the parser.

ewalshe
+1  A: 

One thing I've done, if you have this available, is to move these parsing services to a cloud hosting service.

I've moved a few of my distributed services (search engines, mass emailing, error logging) to a cloud computing service off of my primary machine and it's been a fantastic load off of our primary web server.

Plus cloud computing has become more cheaply available and scales almost infinitely.

jerebear
A: 

If you are having trouble servicing interrupt requests you might try upping the niceness of the CPU bound tasks. Then down the niceness of the HTTP server. Basically, try to use the system's scheduler to your advantage and don't treat all tasks as equal.

A: 

I don't know what OS you are using, but most of them have functions to specify priorities of threads/processes. As long as the parser process has a lower priority than the HTTP process, it should be fine.

Vilx-
+1  A: 

I don't understand why you would worry about your CPU being at 100%. If a job needs doing, and it's not IO bound, then your CPU should be at 100%.

What remains is:

  • Do you have enough CPU to do all the work you need to do, in the time available?

If not you need more machines, a faster CPU, or more CPU-efficient algorithms. The first two options are probably cheaper than the third - depending on the scale of your enterprise!

  • Are there some jobs that need to be more responsive than others?

It sounds like there are. It sounds like you want the HTTP server to be responsive, while the parser jobs can complete at their own pace (as long as the queue empties faster than it fills). As others have pointed out, nice tells the OS to allocate low priority processes the CPU cycles 'left over' after higher priority processes have taken what they need (although it's not quite as black and white as that).

slim
A: 

never forget the power bill / hosting prices. try a profiler finding bottleneck in your code. if you never did that, i'm sure you can reduce cpu consumption to 25-50%

Szundi