views:

364

answers:

4

Our Rails application has two environments that we deploy to servers a Staging environment and the default Production environment.

The staging.rb file is a copy of the production.rb form the config/environments folder. The difference between the two is whiny nils is set to true:

config.whiny_nils = true

Since the rails application mostly used for it's API, we ran it on one of our internal staging servers for the developers to work with. This worked, without a hitch for almost 4 months. When the time came to transfer to our production server, the stack began to crash consistently whenever a POST or PUT came in with a large (sometimes VERY VERY large) body. When testing between the two servers the same requests were handled perfectly by the staging server.

The most frustrating part of the crashes/hangs were the lack of logs or tracing down where in the stack (nginx, phusion passenger, ruby 1.9 patch level 243, rails 2.3.4) where the crash was happening. Nothing appeared in the nginx error log, rails log, or anywhere that we could find. Since we were running the production server with updated versions of nginx, passenger, and ruby(higher patch level then staging, but still 1.9) we began reverting each component one at a time, even going so far as transferring all of the executables and support files (basically everything we had installed in /usr/local) to the production machine to no avail. Just as we were about to wipe the machine and try every step again someone suggested switching the production machine to the "staging" environment . . . and like magic, issue resolved!

Wanting to know what could have possibly caused the error, we began combing through the rails core, our own code, and all of our plugins looking for some clue as to what could possibly causing such a massive hang/crash in the production environment, once again to no avail.

The only clue I could find was the behavior. When testing "on" the app(one of the pages the rails app actually serves up) I would crash the app by sending the request, and then after frequent refreshes (usually 3-4) I would be able to produce an error from the Nginx log and eventually the app would begin to handle requests again. The error is as follows:

    Error during failsafe response: incompatible character encodings: UTF-8 and ASCII-8BIT
    2009/10/09 17:52:40 [error] 8691#0: *88 upstream prematurely closed connection while reading response header from upstream, client: *my ip address*, server: myapp.mydomain.com, request: "GET /api/sections/4/edit        HTTP/1.1", upstream: "passenger://unix:/tmp/passenger.8677/master/helper_server.sock:", host: "myapp.mydomain.com"
    *** Exception NoMethodError in application (undefined method `each' for nil:NilClass) (process 8703): from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/rack/request_handler.rb:95:in `process_request'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_request_handler.rb:206:in `main_loop'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/railz/application_spawner.rb:376:in `start_request_handler'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/railz/application_spawner.rb:334:in `block in handle_spawn_application'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/utils.rb:182:in `safe_fork'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/railz/application_spawner.rb:332:in `handle_spawn_application'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server.rb:351:in `main_loop'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server.rb:195:in `start_synchronously'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server.rb:162:in `start'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/railz/application_spawner.rb:213:in `start'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/spawn_manager.rb:261:in `block (2 levels) in spawn_rails_application'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server_collection.rb:126:in `lookup_or_add'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/spawn_manager.rb:255:in `block in spawn_rails_application'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server_collection.rb:80:in `block in synchronize'
    from :8:in `synchronize'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server_collection.rb:79:in `synchronize'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/spawn_manager.rb:254:in `spawn_rails_application'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/spawn_manager.rb:153:in `spawn_application'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/spawn_manager.rb:286:in `handle_spawn_application'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server.rb:351:in `main_loop'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server.rb:195:in `start_synchronously'
    from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/bin/passenger-spawn-server:61:in `'

Normally when a character encoding error arises, my first turn is to ruby 1.9 . . . however as you can tell from my testing, it was the same version on both machines!

After all of that, I guess I am wondering . . . would anyone have any ideas what is going on? Obviously we can run our app in staging for the time being, but I worry I may have found something deeper that will have to be addressed. Does have any ideas on the next place I should look for where this is happening?

Our setup: Mac OS X Server: 10.6.1,
Rails 2.3.4,
Ruby 1.9p243,
Nginx 0.8.17,
Passenger 2.2.5

Our Required Gems:
environment.rb
daemons
rmagick
test.rb
rspec
rspec-rails
factory-girl
rack-test

Our installed plugins:
acts-as-dag (active record plugin for creating directed-acyclic-graphs)
daemon_generator
globalize2
no-peeping-toms (for testing)
thinking-sphinx


UPDATE (In response to khelll):

I have attempted to add the config.whiny_nils = true to the production environment, however the crash still occurs.

Also, I went back to our staging server and set the environment to "Production" . . .same crash!

Some clarification on what I meant by a "large" request body. One of the POST/PUT's that would consistently crash the application was approximately 20,000 characters (of json). Since the API is consistently used throughout the day with small PUTS/POSTS and remained up, but only crashed/hung when these larger requests were made, I assumed the two were connected.

As far as Rack/Ruby 1.9 is concerned. Due to the amount of information out there about Rack and 1.9, I did upgrade our Rack gem to the latest in the git repository(which supposedly fixed some of the 1.9 issues). I had read about a significant amount of difficulty regarding rewindable_input, ruby 1.9 and more… however since I wasn't getting the rewindable_input error that I had experienced with my other 1.9 app, I did assume that this was a different problem. Also, I ruled out Rack when changing the rails environment resolved the issue (when I searched through the Rack source code, there didn't seem to be any environment specific methods that would be causing the error).

Hope this helps!


UPDATE in response to pauliephonic

There are no messages hitting the rails logs at all (which actually prompted me to search for some time in our web stack for the issue). My clue that a crash/hang is occurring is that once the large request is made, the application returns nothing but 500 errors on every request, however those 500 errors do not show up in the Rails logs.

Our database configuration is identical (we were using mysql cluster, so it was literally identical, now using local mysql database for the time being, but have confirmed the error exits regardless of the database used)

As far as multiple bytes/unicode is concerned . . . we are working in an internationalized app . . . however I don't think the way rails handles unicode changes between production and others right? As I stated above, this happened on a POST or PUT. The way I would test during my debugging was to go to the same edit pages of one of my large, heavily nested models and just try to "save" it. This would crash the app in production, but not crash the app in staging. Every time I tested the same characters, same content, same button, same behavior . . . different response based on environment. I couldn't even pepper puts statements everywhere in my code because (it appeared that) requests weren't making it to the rails application. I was getting no error messages in my Rails logs or Nginx error logs (save the one I posted on multiple refreshes).

A: 

What I can understand is that config.whiny_nils is making the whole difference. Were you I would look at activesupport/lib/active_support/whiny_nils.rb file (which looks so simple) and try from there to play around it to understand what's making the difference. I believe it's something related to the type of exception you are getting in production environment, which might not be raised when using whiny_nils.

I believe that you need to give more details also regarding "Posting a large body crashes our stack" part, cause it might be a problem with Rack and Ruby 1.9

khelll
A: 

From the look of things, that error is Ruby being persnickety about encodings. You've got a string that it thinks is UTF-8, and you're treating it as raw bytes, probably correctly. You need to identify the problematic string and call buggy_string.force_encoding(Encoding::ASCII_8BIT) on it. But hell if I know which piece of code is operating on that string or why it happens only in production. I wouldn't be surprised to discover the problem is actually deep within the bowels of Passenger.

Regarding the difference between staging and production, it's almost certainly an issue with something that should be moved around as a byte string, and it's being treated as a character string. There's a ton of code that only happens in the production environment (like caching) and if any of that code does this, that's your problem.

The whiny_nils thing is likely irrelevant.

Bob Aman
A: 

I would try to run the app in production on apache/passenger, see if the problem is specific to nginx/passenger

pjb3
A: 

I had a similar issue for the last few months -- not frequently enough to really debug it until recently. In my case, telling Passenger where to put temp buffer files did the trick. What made it difficult to find was not only the lack of error messages in the log files but the fact that this buffer is used not only for multipart posts but any kind of large post body.

PassengerUploadBufferDir /tmp
Adam Elhardt