Friday 14 January 2011

Virtual machine problems - Hung vm esx 3.5

Why does it always happen like this! Major problems with a virtual machine just as I am about to take the afternoon off! :(

I get an email from our early warning system (nagios - very good by the way http://www.nagios.org ) that one of our virtual machines is down.

So I check and it is correct, it is not responding to ping requests, so I try to get onto the virtual console and get no response! :(

Ok, i'll reset that virtual machine and all will be fine, or so i thought! The reset gets to 95% complete and then just hangs there!

I look on the VMWare website and found this suggesting that the vm should be reset by restarting the management agent services. So I decide to do this, however before I do this, I am going to make sure that this is the only vm on this physical esx 3.5 host, so as to not make the situation worse, so first things first, I set the host into "maintenance mode" and vmotion all the other vms to other hosts in the cluster, once this is done I then have to cancel the maintenance mode request as it can not handle the stuck vm.

I am now in a position to restart the management agents without worry, so I putty to the esx host as root and run the command:

#> service mgmt-vmware restart

Once this completes, I noticed that the host has become disconnected in virtual center, so I await it to become reconnected, whilst waiting I see the reset virtual machine task had been killed.

After a few minutes, the host is reconnected, so I try to power on the vm, but it fails and the vm is now shown in virtual center as invalid! Ok, back to the internet, I now find this article, which suggests I should remove the vm from the inventory and re-add it, so this is what I do.

I attempt the power on again, and it is now stuck at 1% of " registering virtual machine with the host". Ok, it must now be time to raise a priority 1 support call with VMWare as this is a production vm. So I speak to vmware and we setup a webex to find out what is wrong. The support guy, finds a swap file in existence in its directory on the nfs volume, so he removes this. He also finds locks on the files, but this is not reported properly and thinks this could just be a limitation of using nfs volumes with vmware. He carries on and relocates the vm on another physical esx host using vmotion and then tries to power it on. No joy, so he concludes it lust be an issue with permissions and the nfs volume - I am not convinced with this as we have over 100 vms on this volume and this is the only one with a problem. Anyway he goes away!

So, I think let's see what happens if we copy the vm files to a new directory on the storage and then try and add this into the inventory. So I copy the files and change the directory name within the .vmx file. I then add this new fileset to the inventory. I try to power on the vm, bug it now complains about a lock on one of the vmdk files.

Hmmm, what now!

One last thing we can try before resorting to the backup. This started on one particular host, I wonder if this had got its nfs locked screwed up and a reboot would cure it? So I request this particular host to go into "maintenance mode", which now succeeds as the problematic vm is now "down", so I then request a reboot of this host through virtual center.

I wait the few minutes it takes to complete, and then take it out of maintenance mode. I now try to restart the problematic vm, and guess what - it starts corectly - much to my relief! :)

No comments:

Post a Comment