views:

685

answers:

1

Two RAID volumes, VMware kernel/console running on a RAID1, vmdks live on a RAID5. Entering a login at the console just results in SCSI errors, no password prompt. Praise be, the VMs are actually still running. We're thinking, though, that upon reboot the kernel may not start again and the VMs will be down.

We have database and disk backups of the VMs, but not backups of the vmdks themselves.

What are my options?

Our current best idea is

  1. Use VMware Converter to create live vmdks from the running VMs, as if it was a P2V migration.
  2. Reboot host server and run RAID diagnostics, figure out what in the "h" happened
  3. Attempt to start ESX again, possibly after rebuilding its RAID volume
  4. Possibly have to re-install ESX on its volume and re-attach VMs
  5. If that doesn't work, attach the "live" vmdks created in step 1 to a different VM host.
+1  A: 

It was the backplane. Both drives of the RAID1 and one drive of the RAID5 were inaccessible. Incredibly, the VMware hypervisor continued to run for three days from memory with no access to its host disk, keeping the VMs it managed alive.

At step 3 above we diagnosed the hardware problem and replaced the RAID controller, cables, and backplane. After restart, we re-initialized the RAID by instructing the controller to query the drives for their configurations. Both were degraded and both were repaired successfully.

At step 4, it was not neccessary to reinstall ESX; although, at bootup, it did not want to register the VMs. We had to dig up some buried management stuff to instruct the kernel to resignature the VMs. (Search VM docs for "resignature.")

I believe that our fallback plan would have worked, the VMware Converter images of the VMs that were running "orphaned" were tested and ran fine with no data loss. I highly recommend performing a VMware Converter imaging of any VM that gets into this state, after shutting down as many services as possible and getting the VM into as read-only a state as possible. Loading a vmdk either elsewhere or on the original host as a repair is usually going to be WAY faster than rebuilding a server from the ground up with backups.

Aidan Ryan