views:

79

answers:

3

Hello,

My server process is basically an API that responds to REST requests.

Some of these requests are for starting long running tasks.

Is it a bad idea to do something like this?

get "/crawl_the_web" do
  Thread.new do
    Crawler.new # this will take many many days to complete
  end
end

get "/status" do
  "going well" # this can be run while there are active Crawler threads
end

The server won't be handling more than 1000 requests a day.

+1  A: 

Not the best idea....

Use a background job runner to run jobs.

POST /crawl_the_web should simply add a job to the job queue. The background job runner will periodically check for new jobs on the queue and execute them in order.

You can use, for example, delayed_job for this, setting up a single separate process to poll for and run the jobs. If you are on Heroku, you can use the delayed_job feature to run the jobs in a separate background worker/dyno.

Justice
Why is a separate process better than a worker thread?
Alexandre
if the background tasks are long-running, you'll have to queue them to keep the number of simultaneous tasks under control. yes, you can do it yourself; but it's usually better to delegate to an already tested solution
Javier
Multithreading is hard and difficult to get right. You can try writing a solution using a single thread pool with a fixed number of threads, write your own thread-safe queueing strategy, write your own worker-thread monitoring solution, etc. Or just use delayed_job.
Justice
My impression is that the issue isn't so much about choosing threads versus processes as it is about choosing a ready-made solution versus writing everything yourself. The ready-made solution happens to use separate processes. If the separate-process approach has additional advantages (beyond just that it's the technique delayed_job uses), they haven't been discussed here yet.
Rob Kennedy
The ready-made solution permits another thread, another process, or another machine to execute long-running jobs and send back results. It has the best of both worlds.
Justice
I wanted to avoid introducing another dependancy into my application. If I use worker threads, I only need to manage a single process. Seeing as how the number of running threads will always be pretty small (5-10), I think a thread pool and queueing are unnecessary. As things grow, then both will have to be introduced and I'll consider adding the dependancy of such a queue system. In the situation I described, what are the potential hazards?
Alexandre
A: 

If you do this, how are you planning to stop/restart your sinatra app? When you finally deploy your app, your application is probably going to be served by unicorn, passenger/mod_rails, etc. Unicorn will manage the lifecycle of its child processes and it would have no knowledge of these long-running threads that you might have launched and that's a problem.

As someone suggested above, use delayed_job, resque or any other queue-based system to run background jobs. You get persistence of the jobs, you get horizontal scalability (just launch more workers on more nodes), etc.

Harish
A: 

Starting threads during request processing is a bad idea.

Besides that you cannot control your worker threads (start/stop them in a controlled way), you'll quickly get into troubles if you start a thread inside request processing. Think about what happens - the request ends and the process gets prepared to serve the next request, while your worker thread still runs and accesses process-global resources like the database connection, open files, same class variables and global variables and so on. Sooner or later, your worker thread (or any library used from it) will affect the main thread somehow and break other requests and it will be almost impossible to debug.

You're really better off using separate worker processes. delayed_job for example is a really small dependency and easy to use.

Andreas