Unable to SSH into ESXi server

You’ve turned on SSH, and even enabled the firewall profile for SSH. But for some reason you still can’t SSH into the ESXi server and get a “Connection Refused” message back when hitting port 22. You’ve also checked your corporate firewall and that’s it’s not blocking comms.

Have a look in Tasks and Events to see if the server’s ramdisk for root is full. There’s a good chance that it is. This is a problem i’ve encountered with vSphere 5.1 recently and is preventing SSH from starting properly.

The cause of the ramdisk filling up is ESXi not truncating /var/log/wtmp properly. That’s another issue, but to fix the immediate problem, vMotion all the VMs off the ESXi server first. You may get errors like “Cannot migrate <vmname>”. It’s all because of the full ramdisk. You can often give it a kick by re-running DRS for the cluster and that gets it going again.

Once all VMs are off the server, reboot it and your problems should go away.


Virtual machine must be running in order to be migrated

Lately i’ve run into a strange issue where an ESXi server gets into a state where VMs running on it are not able to be migrated to other ESXi servers. This is the error that comes up when you try to vMotion the problematic VMs.

A general system error occured: Virtual machine must be running in order to be migrated.

I’ve contacted VMware support and they tell me there’s no other way to fix this than to shutdown and powerup the VMs. Simply rebooting them does not work as the problem seems to be linked to their registration with ESXi or vCenter Server. You won’t be able to unregister them directly through tech support mode on ESXi either as it simply won’t let you while the VM is running.

After you do this, you had best reboot the ESXi server as well.

If anyone has come up with any alternate solutions, i’d love to hear from you.

ESXi NIC disconnection issues with the HP NC532i

We’re using HP BL495c G6 blades with ESXi 4.0 U1, and the onboard NICS are disconnecting randomly upon reboot. Sometimes it happens after a single reboot, other times it takes 4 to cause the issue.

The server uses the HP NC532i embedded NIC which is a rebadged Broadcom 57711E. In ESXi, the NIC uses the bnx2x driver.

Things tried that haven’t worked to date:

  • Broadcom v1.48 and v1.52 drivers on the VMware website
  • Upgrading to Update 2
  • Turning off auto negotiation and hard coding the speed and duplex of the switchports and NICS
  • Replaced the system board. It’s not a failed NIC.

The only way to bring these NICS back to life is to reset the internal switchports on the blade enclosure switches (“shut” and “no shut” the ports).

I have been working with VMware to resolve the issue and they tell me that it’s a known issue and they’ve given me some new Broadcom drivers to try (which haven’t worked yet). I’ve since given the logs back to VMware for analysis.

If you’re seeing this issue on similair hardware, i’d very much like to hear from you. I’ll also update this post as I get updates from VMware.

Update: Found the problem! The issue is with the firmware on the Cisco Catalyst 3020 switches the blade enclosure uses. Before firmware 12.2(50), the switch would do over zealous flap detection on internal switchports which would put the port into an err-disabled state. The 12.2(50) firmware disables flap detection for internal ports.
If you are not able to upgrade the firmware, use this workaround. It bumps up the flap detection thresholds so it doesn’t shut down the port upon reboot.
“errdisable flap-setting cause link-flap max-flaps 10 time 10″