Unable to SSH into ESXi server

You’ve turned on SSH, and even enabled the firewall profile for SSH. But for some reason you still can’t SSH into the ESXi server and get a “Connection Refused” message back when hitting port 22. You’ve also checked your corporate firewall and that’s it’s not blocking comms.

Have a look in Tasks and Events to see if the server’s ramdisk for root is full. There’s a good chance that it is. This is a problem i’ve encountered with vSphere 5.1 recently and is preventing SSH from starting properly.

The cause of the ramdisk filling up is ESXi not truncating /var/log/wtmp properly. That’s another issue, but to fix the immediate problem, vMotion all the VMs off the ESXi server first. You may get errors like “Cannot migrate <vmname>”. It’s all because of the full ramdisk. You can often give it a kick by re-running DRS for the cluster and that gets it going again.

Once all VMs are off the server, reboot it and your problems should go away.

 

Virtual machine must be running in order to be migrated

Lately i’ve run into a strange issue where an ESXi server gets into a state where VMs running on it are not able to be migrated to other ESXi servers. This is the error that comes up when you try to vMotion the problematic VMs.

A general system error occured: Virtual machine must be running in order to be migrated.

I’ve contacted VMware support and they tell me there’s no other way to fix this than to shutdown and powerup the VMs. Simply rebooting them does not work as the problem seems to be linked to their registration with ESXi or vCenter Server. You won’t be able to unregister them directly through tech support mode on ESXi either as it simply won’t let you while the VM is running.

After you do this, you had best reboot the ESXi server as well.

If anyone has come up with any alternate solutions, i’d love to hear from you.

ESXi NIC disconnection issues with the HP NC532i

We’re using HP BL495c G6 blades with ESXi 4.0 U1, and the onboard NICS are disconnecting randomly upon reboot. Sometimes it happens after a single reboot, other times it takes 4 to cause the issue.

The server uses the HP NC532i embedded NIC which is a rebadged Broadcom 57711E. In ESXi, the NIC uses the bnx2x driver.

Things tried that haven’t worked to date:

  • Broadcom v1.48 and v1.52 drivers on the VMware website
  • Upgrading to Update 2
  • Turning off auto negotiation and hard coding the speed and duplex of the switchports and NICS
  • Replaced the system board. It’s not a failed NIC.

The only way to bring these NICS back to life is to reset the internal switchports on the blade enclosure switches (“shut” and “no shut” the ports).

I have been working with VMware to resolve the issue and they tell me that it’s a known issue and they’ve given me some new Broadcom drivers to try (which haven’t worked yet). I’ve since given the logs back to VMware for analysis.

If you’re seeing this issue on similair hardware, i’d very much like to hear from you. I’ll also update this post as I get updates from VMware.

Update: Found the problem! The issue is with the firmware on the Cisco Catalyst 3020 switches the blade enclosure uses. Before firmware 12.2(50), the switch would do over zealous flap detection on internal switchports which would put the port into an err-disabled state. The 12.2(50) firmware disables flap detection for internal ports.
If you are not able to upgrade the firmware, use this workaround. It bumps up the flap detection thresholds so it doesn’t shut down the port upon reboot.
“errdisable flap-setting cause link-flap max-flaps 10 time 10″

How to install vib files in ESXi

At some stage while in contact with VMware support, they may give you drivers to try. These drivers will be in VIB (vSphere Installation Bundle) format and will look something like vmware-esx-drivers-net-bnx2x-400.1.52.12.v40.4-1.0.4.00000.x86_64.vib. I don’t know why this is not documented anywhere in the official docs, but here’s how to install these driver bundles into ESXi.

  1. scp the file to the ESXi server
  2. Run “esxupdate -b <filename>.vib –nosigcheck –nodeps update”.
  3. Run “esxupdate query –vib-view” to confirm that the driver bundle is installed. You may need to reboot for the driver to take effect.

If you are fiddling with different driver revisions, you may need to remove a later version if you want to rollback to a previous version, otherwise ESXi won’t let you reinstall an older bundle. To remove a driver bundle use the following command:

esxupdate -b <driver bundle name> remove

You can get the driver bundle name by running “esxupdate query –vib-view”.

Recovering from accidental VMFS Datastore deletion

When you forcefully remove a Datastore while ESXi servers are still connected to it, all sorts of weird and wonderful things can happen.¬†We recently had a team member accidentally install ESXi onto a LUN, blowing away the Datastore that was on it. ESXi servers become unstable, the VM’s that were running from the Datastore go into a zombie state where they may respond to pings but are not fully there because they are still running from memory but the disks have been removed from under them. The vCenter server will exhibit high DB load as the ESXi servers try to update the statuses of VM’s which aren’t actually there anymore. It’s a mess and we’ve recently had to go through this. It took us an entire day to discover what really happened because the person that did it didn’t even realise he had done it, so we had to check LUN presentation and the rest of it. It had actually stumped us making us think the VMFS filesystem had somehow gotten corrupted, until VMware support jumped onto our servers through WebEx and found the problem.

Here’s how you clean up the mess.

  1. Turn off HA and DRS at the cluster level because it gets in the way
  2. Remove any greyed out VM’s from vCenter Server.
  3. Log into each ESXi server individually and remove the “Unknown” greyed out VM’s from Inventory.
  4. SSH into each server at a time, /sbin/services.sh stop, cd /opt/vmware/vmware/uninstallers and uninstall both aam and vpxa. /sbin/services.sh start. Add back into cluster/VC, doing this will resync the ESX server’s inventory with vCenter Server.
  5. Reboot each ESX server in turn to remove all zombie processes/VMs which will still be running in the background.
  6. Now we need to clean up references to the old Datastore. Go into Inventory -> Datastores, if you still see the old datastore there, click on it, then click on the Virtual Machines tab, then remove anything from there from Inventory. Since none of these VMs exist anymore, it’s a safe operation. The datastore should disappear after that.

It’s not a pretty process. If you can, disconnect any fibre cables before the install, but in the case of blade servers, just be very very careful.