Aug 4 2010

ESXi NIC disconnection issues with the HP NC532i

Brandon Yap

We’re using HP BL495c G6 blades with ESXi 4.0 U1, and the onboard NICS are disconnecting randomly upon reboot. Sometimes it happens after a single reboot, other times it takes 4 to cause the issue.

The server uses the HP NC532i embedded NIC which is a rebadged Broadcom 57711E. In ESXi, the NIC uses the bnx2x driver.

Things tried that haven’t worked to date:

  • Broadcom v1.48 and v1.52 drivers on the VMware website
  • Upgrading to Update 2
  • Turning off auto negotiation and hard coding the speed and duplex of the switchports and NICS
  • Replaced the system board. It’s not a failed NIC.

The only way to bring these NICS back to life is to reset the internal switchports on the blade enclosure switches (“shut” and “no shut” the ports).

I have been working with VMware to resolve the issue and they tell me that it’s a known issue and they’ve given me some new Broadcom drivers to try (which haven’t worked yet). I’ve since given the logs back to VMware for analysis.

If you’re seeing this issue on similair hardware, i’d very much like to hear from you. I’ll also update this post as I get updates from VMware.

Update: Found the problem! The issue is with the firmware on the Cisco Catalyst 3020 switches the blade enclosure uses. Before firmware 12.2(50), the switch would do over zealous flap detection on internal switchports which would put the port into an err-disabled state. The 12.2(50) firmware disables flap detection for internal ports.
If you are not able to upgrade the firmware, use this workaround. It bumps up the flap detection thresholds so it doesn’t shut down the port upon reboot.
“errdisable flap-setting cause link-flap max-flaps 10 time 10″


Aug 4 2010

How to install vib files in ESXi

Brandon Yap

At some stage while in contact with VMware support, they may give you drivers to try. These drivers will be in VIB (vSphere Installation Bundle) format and will look something like vmware-esx-drivers-net-bnx2x-400.1.52.12.v40.4-1.0.4.00000.x86_64.vib. I don’t know why this is not documented anywhere in the official docs, but here’s how to install these driver bundles into ESXi.

  1. scp the file to the ESXi server
  2. Run “esxupdate -b <filename>.vib –nosigcheck –nodeps update”.
  3. Run “esxupdate query –vib-view” to confirm that the driver bundle is installed. You may need to reboot for the driver to take effect.

If you are fiddling with different driver revisions, you may need to remove a later version if you want to rollback to a previous version, otherwise ESXi won’t let you reinstall an older bundle. To remove a driver bundle use the following command:

esxupdate -b <driver bundle name> remove

You can get the driver bundle name by running “esxupdate query –vib-view”.


Jul 28 2010

High performance Cacti for large installations

Brandon Yap

Cacti works fine out of the box for dozens of servers, but when you start to hit the hundreds with tens of thousands of datasources you will start to run into bottlenecks, namely poll cycles exceeding 5 mins and graphs with gaps in them. Sound familair? Then read on.

Strangely enough Cacti isn’t optimised for large infrastructures using the default install. You need to make a few tweaks in order to make it perform optimally with hundreds of hosts and thousands of datasources. There are three things you do. Install the Cacti Plugin Architecture, Install Boost, and tune the MySQL tables which Cacti uses. I’ll go into more detail below.

Cacti Plugin Architecture

The Cacti Plugin Architecture was designed to allow Cacti to be infinitely extended, capable of doing just about anything. Eventually Cacti will come with this baked in but for now it’s a separate install. The installation instructions are sufficient so there’s no point recapping them here.

http://cactiusers.org/wiki/PluginArchitectureInstall

Boost

Now that we have the Cacti Plugin Architecture setup it’s time to install Boost. Boost labels itself as a large site performance booster for Cacti, written by the people responsible for Cacti themselves. Boost removes the traditional per polling cycle rrd updates that Cacti comes with from factory and replacing them with an on demand system where all rrd data is held in ram and synced to disk at specified times. Note that Boost requires a significent amount more memory so be prepared.

http://docs.cacti.net/plugin:boost

Database Tuning

The default Cacti schema lacks indexes. Creating indexes improves performance dramatically. You’ll find the SQL you need to run against the schema at the link below.

http://bugs.cacti.net/view.php?id=1333

Result

We implemented these changes and are now able to poll 300 hosts with a total of 11500+ RRD’s in under a minute.


Jul 9 2010

iPhone and iOS 4′s persistent wifi

Brandon Yap

There’s not a lot of information surrounding the new persistent wifi feature in iOS 4 and how it affects iPhones. I did some experimentation and managed to find something which i’d like to share with those who are seeking the same information.

First of all on iPod Touches, persistent wifi works as advertised and stays on wether or not the phone is plugged into the charger. This is so it can maintain a constant data connection for Push Notifications and mail checking.

With iPhones the situation is a little different and conditional. Under normal circumstances where 3G is available and cellular data turned on, persistent wifi will NOT be enabled. Only when 3G becomes unavailable or cellular data is turned off will persistent wifi be activated. The reasoning for this is with the 3G radio already enabled and an internet connection present, why waste battery by enabling wifi as well. If you want to enable wifi while 3G and cellular data is also enabled, you’ll need to plug the iPhone into a charger.

So to summarise, persistent wifi (as defined by wifi that stays on even after sleeping for 15 mins) will only be enabled under the following conditions:

  • 3G is unavailable
  • Cellular data is turned off
  • iPhone is plugged into charger

Note: The disabling of cellular data discrete from 3G was introduced in iOS 4 and not present in previous iPhone OS revisions.


Jun 9 2010

Recovering from accidental VMFS Datastore deletion

Brandon Yap

When you forcefully remove a Datastore while ESXi servers are still connected to it, all sorts of weird and wonderful things can happen. We recently had a team member accidentally install ESXi onto a LUN, blowing away the Datastore that was on it. ESXi servers become unstable, the VM’s that were running from the Datastore go into a zombie state where they may respond to pings but are not fully there because they are still running from memory but the disks have been removed from under them. The vCenter server will exhibit high DB load as the ESXi servers try to update the statuses of VM’s which aren’t actually there anymore. It’s a mess and we’ve recently had to go through this. It took us an entire day to discover what really happened because the person that did it didn’t even realise he had done it, so we had to check LUN presentation and the rest of it. It had actually stumped us making us think the VMFS filesystem had somehow gotten corrupted, until VMware support jumped onto our servers through WebEx and found the problem.

Here’s how you clean up the mess.

  1. Turn off HA and DRS at the cluster level because it gets in the way
  2. Remove any greyed out VM’s from vCenter Server.
  3. Log into each ESXi server individually and remove the “Unknown” greyed out VM’s from Inventory.
  4. SSH into each server at a time, /sbin/services.sh stop, cd /opt/vmware/vmware/uninstallers and uninstall both aam and vpxa. /sbin/services.sh start. Add back into cluster/VC, doing this will resync the ESX server’s inventory with vCenter Server.
  5. Reboot each ESX server in turn to remove all zombie processes/VMs which will still be running in the background.
  6. Now we need to clean up references to the old Datastore. Go into Inventory -> Datastores, if you still see the old datastore there, click on it, then click on the Virtual Machines tab, then remove anything from there from Inventory. Since none of these VMs exist anymore, it’s a safe operation. The datastore should disappear after that.

It’s not a pretty process. If you can, disconnect any fibre cables before the install, but in the case of blade servers, just be very very careful.


Nov 26 2009

iPhone/iPod EQ distortion and clipping

Brandon Yap

All iPhones and iPods these days have a feature called Sound Check. It scans all your music files and sets the playback volume so that they are all roughly the same level. iPhone/iPods also have an equaliser (EQ) function which lets you modify the sound of your music by boosting the bass, treble, and so forth.

Both these features by themselves work well, but when combined with today’s music it will cause issues with how your music sounds. Why?

A lot of music nowadays are mastered to be as loud as possible, thus reducing the dynamic range and possibly introducing some clipping in the process (the loudness wars, google it). Now while that itself is alright (well not really, but what can we do about it…), if you use the Sound Check feature and you have other quiet music in your collection, it will bump up the playback volume even more. Combine that with an EQ such as Bass Booster and your music will clip quite noticeably.

A good example of this is Metallica’s Death Magnetic. It was already mastered quite loudly and clipping is noticeably present. Turn on Sound Check and Bass Booster and it will clip badly.

So….. if you use Sound Check and notice distortion/clipping in your music, turn it off. It may not be your music.


Oct 26 2009

VM disconnects after vMotion

Brandon Yap

We’ve been having this weird problem for a while where if sitting idle for a day or two, a VM would disconnect from the network and otherwise be unavailable until after a few minutes if you try to vMotion it. But after you reboot it, it worked fine. Networking checked out. Storage checked out. What could be wrong??

Turned out the culprit was memory overcommitment on our ESX servers. Who would have thought! We had more memory allocated to VMs than there was physical memory, even though the green memory indicator said otherwise.

A bit of background as to why this happens. ESX uses a number of techniques to conserve memory which allow you to use physical memory efficiently. When memory is starting to get scarce, ESX will engage a “balloon driver” in idle VMs with VMware Tools installed. This driver will inflate causing apps in memory to write out to the operating system swap. ESX will then reclaim this memory for use in other VMs.

If memory is that tight that it can’t make the OS swap and it can’t grab memory from elsewhere, as a last resort it will start to use VM Swap (different of guest OS swap). When ESX starts a VM, it writes a file of the exact same size as memory is allocated to it. ESX will then swap the more idle VMs to this swap file without the guest knowing about it. This has a major impact on performance and….you guessed it, vMotion. Your VM will be unavailable while ESX swaps everything back into memory. You can use the performance graphs to determine how badly your VM will be affected if it has been swapped. I’ve seen a few megs worth of swap not affect a VM. But a few hundred megs will.

The best way to make ESX avoid having to use VM swap is to have enough physical memory (duh!). Failing that, you can ‘reserve’ memory for the more important VMs using memory reservations. If you set a VMs memory reservation to the amount of memory it has, ESX will never swap it. Eg if you allocate 4096MB to a VM and set it’s memory reservation to 4096MB.

Update: Aparently Update 5 resolves this problem but I haven’t tested it yet.


Oct 26 2009

Suspending Windows 7 in VMware Fusion

Brandon Yap

When running Windows 7 as a virtual machine under VMware Fusion, you have the option of Sleeping or Hibernating from within Windows, or suspending the VM from Fusion.

All the methods ultimately serve the same purpose but achieve it’s goals differently.

Sleeping and Suspending were very quick and I suspect invokes the same function, as in VMware Tools calls the Sleep function from within the OS. Contents in memory are held there and the VM is frozen in it’s current state. Hibernation on the other hand dumps everything from memory into a file within Windows, then halts the system, where upon waking needs to read everything back from this file back into memory. This is what makes Hibernating take a lot longer to initiate and recover from.

In my experience i’ve found that Sleeping or Suspending Windows 7 is quicker.


Oct 26 2009

An iPhone users guide to the Telstra NextG network

Brandon Yap

The Telstra NextG network is undoubtedly the largest and highest performing cellular mobile network in Australia, which makes it a perfect pair with your iPhone.

This article aims to give you a through rundown of things you need to know to make use of this network. After all you are paying an arm and a leg for it!

Network Selection

By default the Carrier selection will be set to Automatic. There are 4 networks to choose from. Telstra Mobile, 3TELSTRA, YES OPTUS, and vodafone AU. 3TELSTRA is the joint 3 and Telstra network operating on the 2100 MHz spectrum. Telstra Mobile is the NextG branded network on 850 MHz. This is the one you want. It has better penetration, better coverage, and faster data speeds. In certain situations the iPhone will lock onto 3TELSTRA and not go back to Telstra Mobile. If this happens to you regularly, then choose Telstra Mobile as a carrier instead of leaving it on Automatic.

Tethering

Tethering is a feature of the iPhone which lets you use your device as a modem. Telstra enabled iPhone tethering on 5th December 2009 via a carrier update. You can tether via either USB or Bluetooth. To enable, go to Settings -> General -> Network -> Internet Tethering.

Free Content

Apps that are free to browse, all the Telstra owned ones:

Yellow Pages, White Pages, Whereis, Sensis.

Keeping tabs on your data usage

Wether prepaid or post paid, you will be able to view your data usage online or through a number of apps available in the online store. “Quota” and “Consume” are two outstanding apps. You can also view your data usage through the BigPond Mobile for iPhone portal. You’ll find the link in Safari bookmarks.

APN

The default APN for an iPhone on the Telstra network is “telstra.iph”. This allows you to access the BigPond Mobile for iPhone portal. The portal contains links to various other BigPond areas such as Yellow Pages, White Pages, and most importantly My Account. This APN implements the use of a proxy. Some applications do not work well with this APN and are known to crash and behave oddly. If this happens, then switch your APN to “telstra.internet”. This APN does not use a proxy and works fine with all applications, but you will lose access to the BigPond portal.

Update: On 11/02/2010, Telstra released the 5.2 carrier update which disabled the ability to change APNS.

Application usage notes

Google Reader and RSS feeds in general do not seem to work properly when used with the telstra.iph access point. Users on Whirlpool and Google themselves have stated that Telstra does extra compression on the telstra.iph APN which messes with the RSS content. Switching to telstra.internet alleviates the problem, as does changing the Google Reader URL to use https.

Update: The Google Reader issue seems to have been resolved.

Editor notes

This article will be updated as I discover more information. If you come across this article and would like to contribute, please let me know and i’ll add any information you might like to share.

Hosted by Horizon Hosting


Oct 26 2009

How to quickly drain your Macs battery

Brandon Yap

Firstly why would you do this? The simple answer is when calibrating your battery, the last thing you want to do is sit around waiting for your battery to drain normally. This procedures accelerates the process.

There are a number of ways to drain your Macs battery such as playing a game, or watching a DVD. But I have found a quicker way to do it which doesn’t require you to be at your computer.

Just open up one terminal per processor core and run this in it:

yes > /dev/null

Don’t run it without the /dev/null otherwise it will consume lots of memory and eventually lock up your system. Now if you open another Terminal an run “top”, you should see 0% idle. Let it run this way for a while and your battery will drain very quickly. Once it reaches around 5%, I would suggest stopping all the yes processes and let it drain by itself. Otherwise when you powerup your system these processes will start running again.