Our Rails application has two environments that we deploy to servers a Staging environment and the default Production environment.
The staging.rb file is a copy of the production.rb form the config/environments folder. The difference between the two is whiny nils is set to true:
config.whiny_nils = true
Since the rails application mostly used for it's API, we ran it on one of our internal staging servers for the developers to work with. This worked, without a hitch for almost 4 months. When the time came to transfer to our production server, the stack began to crash consistently whenever a POST or PUT came in with a large (sometimes VERY VERY large) body. When testing between the two servers the same requests were handled perfectly by the staging server.
The most frustrating part of the crashes/hangs were the lack of logs or tracing down where in the stack (nginx, phusion passenger, ruby 1.9 patch level 243, rails 2.3.4) where the crash was happening. Nothing appeared in the nginx error log, rails log, or anywhere that we could find. Since we were running the production server with updated versions of nginx, passenger, and ruby(higher patch level then staging, but still 1.9) we began reverting each component one at a time, even going so far as transferring all of the executables and support files (basically everything we had installed in /usr/local) to the production machine to no avail. Just as we were about to wipe the machine and try every step again someone suggested switching the production machine to the "staging" environment . . . and like magic, issue resolved!
Wanting to know what could have possibly caused the error, we began combing through the rails core, our own code, and all of our plugins looking for some clue as to what could possibly causing such a massive hang/crash in the production environment, once again to no avail.
The only clue I could find was the behavior. When testing "on" the app(one of the pages the rails app actually serves up) I would crash the app by sending the request, and then after frequent refreshes (usually 3-4) I would be able to produce an error from the Nginx log and eventually the app would begin to handle requests again. The error is as follows:
Error during failsafe response: incompatible character encodings: UTF-8 and ASCII-8BIT 2009/10/09 17:52:40 [error] 8691#0: *88 upstream prematurely closed connection while reading response header from upstream, client: *my ip address*, server: myapp.mydomain.com, request: "GET /api/sections/4/edit HTTP/1.1", upstream: "passenger://unix:/tmp/passenger.8677/master/helper_server.sock:", host: "myapp.mydomain.com" *** Exception NoMethodError in application (undefined method `each' for nil:NilClass) (process 8703): from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/rack/request_handler.rb:95:in `process_request' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_request_handler.rb:206:in `main_loop' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/railz/application_spawner.rb:376:in `start_request_handler' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/railz/application_spawner.rb:334:in `block in handle_spawn_application' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/utils.rb:182:in `safe_fork' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/railz/application_spawner.rb:332:in `handle_spawn_application' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server.rb:351:in `main_loop' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server.rb:195:in `start_synchronously' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server.rb:162:in `start' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/railz/application_spawner.rb:213:in `start' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/spawn_manager.rb:261:in `block (2 levels) in spawn_rails_application' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server_collection.rb:126:in `lookup_or_add' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/spawn_manager.rb:255:in `block in spawn_rails_application' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server_collection.rb:80:in `block in synchronize' from :8:in `synchronize' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server_collection.rb:79:in `synchronize' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/spawn_manager.rb:254:in `spawn_rails_application' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/spawn_manager.rb:153:in `spawn_application' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/spawn_manager.rb:286:in `handle_spawn_application' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server.rb:351:in `main_loop' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/lib/phusion_passenger/abstract_server.rb:195:in `start_synchronously' from /usr/local/lib/ruby/gems/1.9.1/gems/passenger-2.2.4/bin/passenger-spawn-server:61:in `'
Normally when a character encoding error arises, my first turn is to ruby 1.9 . . . however as you can tell from my testing, it was the same version on both machines!
After all of that, I guess I am wondering . . . would anyone have any ideas what is going on? Obviously we can run our app in staging for the time being, but I worry I may have found something deeper that will have to be addressed. Does have any ideas on the next place I should look for where this is happening?
Our setup:
Mac OS X Server: 10.6.1,
Rails 2.3.4,
Ruby 1.9p243,
Nginx 0.8.17,
Passenger 2.2.5
Our Required Gems:
environment.rb
daemons
rmagick
test.rb
rspec
rspec-rails
factory-girl
rack-test
Our installed plugins:
acts-as-dag (active record plugin for creating directed-acyclic-graphs)
daemon_generator
globalize2
no-peeping-toms (for testing)
thinking-sphinx
UPDATE (In response to khelll):
I have attempted to add the config.whiny_nils = true to the production environment, however the crash still occurs.
Also, I went back to our staging server and set the environment to "Production" . . .same crash!
Some clarification on what I meant by a "large" request body. One of the POST/PUT's that would consistently crash the application was approximately 20,000 characters (of json). Since the API is consistently used throughout the day with small PUTS/POSTS and remained up, but only crashed/hung when these larger requests were made, I assumed the two were connected.
As far as Rack/Ruby 1.9 is concerned. Due to the amount of information out there about Rack and 1.9, I did upgrade our Rack gem to the latest in the git repository(which supposedly fixed some of the 1.9 issues). I had read about a significant amount of difficulty regarding rewindable_input, ruby 1.9 and more… however since I wasn't getting the rewindable_input error that I had experienced with my other 1.9 app, I did assume that this was a different problem. Also, I ruled out Rack when changing the rails environment resolved the issue (when I searched through the Rack source code, there didn't seem to be any environment specific methods that would be causing the error).
Hope this helps!
UPDATE in response to pauliephonic
There are no messages hitting the rails logs at all (which actually prompted me to search for some time in our web stack for the issue). My clue that a crash/hang is occurring is that once the large request is made, the application returns nothing but 500 errors on every request, however those 500 errors do not show up in the Rails logs.
Our database configuration is identical (we were using mysql cluster, so it was literally identical, now using local mysql database for the time being, but have confirmed the error exits regardless of the database used)
As far as multiple bytes/unicode is concerned . . . we are working in an internationalized app . . . however I don't think the way rails handles unicode changes between production and others right? As I stated above, this happened on a POST
or PUT
. The way I would test during my debugging was to go to the same edit pages of one of my large, heavily nested models and just try to "save" it. This would crash the app in production, but not crash the app in staging. Every time I tested the same characters, same content, same button, same behavior . . . different response based on environment. I couldn't even pepper puts
statements everywhere in my code because (it appeared that) requests weren't making it to the rails application. I was getting no error messages in my Rails logs or Nginx error logs (save the one I posted on multiple refreshes).