We’ve been having this weird problem for a while where if sitting idle for a day or two, a VM would disconnect from the network and otherwise be unavailable until after a few minutes if you try to vMotion it. But after you reboot it, it worked fine. Networking checked out. Storage checked out. What could be wrong??
Turned out the culprit was memory overcommitment on our ESX servers. Who would have thought! We had more memory allocated to VMs than there was physical memory, even though the green memory indicator said otherwise.
A bit of background as to why this happens. ESX uses a number of techniques to conserve memory which allow you to use physical memory efficiently. When memory is starting to get scarce, ESX will engage a “balloon driver” in idle VMs with VMware Tools installed. This driver will inflate causing apps in memory to write out to the operating system swap. ESX will then reclaim this memory for use in other VMs.
If memory is that tight that it can’t make the OS swap and it can’t grab memory from elsewhere, as a last resort it will start to use VM Swap (different of guest OS swap). When ESX starts a VM, it writes a file of the exact same size as memory is allocated to it. ESX will then swap the more idle VMs to this swap file without the guest knowing about it. This has a major impact on performance and….you guessed it, vMotion. Your VM will be unavailable while ESX swaps everything back into memory. You can use the performance graphs to determine how badly your VM will be affected if it has been swapped. I’ve seen a few megs worth of swap not affect a VM. But a few hundred megs will.
The best way to make ESX avoid having to use VM swap is to have enough physical memory (duh!). Failing that, you can ‘reserve’ memory for the more important VMs using memory reservations. If you set a VMs memory reservation to the amount of memory it has, ESX will never swap it. Eg if you allocate 4096MB to a VM and set it’s memory reservation to 4096MB.
Update: Aparently Update 5 resolves this problem but I haven’t tested it yet.