We’re using HP BL495c G6 blades with ESXi 4.0 U1, and the onboard NICS are disconnecting randomly upon reboot. Sometimes it happens after a single reboot, other times it takes 4 to cause the issue.
The server uses the HP NC532i embedded NIC which is a rebadged Broadcom 57711E. In ESXi, the NIC uses the bnx2x driver.
Things tried that haven’t worked to date:
- Broadcom v1.48 and v1.52 drivers on the VMware website
- Upgrading to Update 2
- Turning off auto negotiation and hard coding the speed and duplex of the switchports and NICS
- Replaced the system board. It’s not a failed NIC.
The only way to bring these NICS back to life is to reset the internal switchports on the blade enclosure switches (“shut” and “no shut” the ports).
I have been working with VMware to resolve the issue and they tell me that it’s a known issue and they’ve given me some new Broadcom drivers to try (which haven’t worked yet). I’ve since given the logs back to VMware for analysis.
If you’re seeing this issue on similair hardware, i’d very much like to hear from you. I’ll also update this post as I get updates from VMware.
Update: Found the problem! The issue is with the firmware on the Cisco Catalyst 3020 switches the blade enclosure uses. Before firmware 12.2(50), the switch would do over zealous flap detection on internal switchports which would put the port into an err-disabled state. The 12.2(50) firmware disables flap detection for internal ports.
If you are not able to upgrade the firmware, use this workaround. It bumps up the flap detection thresholds so it doesn’t shut down the port upon reboot.
“errdisable flap-setting cause link-flap max-flaps 10 time 10”