vmware – HA issues

Most of the time your HA issues are going to be DNS related. So ensure that your vcenter can ping all your hosts by FQDN without issue.  In some cases though a stubborn server may not want to play the game even when everything is configured properly.

This method is considered a “last effort” as you’ll need to run some CLI commands on the ESX box. But i have found it useful in a few situations.

This page has a great write up on which files HA uses and how to temporary stop the HA service. http://itknowledgeexchange.techtarget.com/virtualization-pro/vmware-ha-failure-got-you-down/

Remember to get to the console on ESXi you logon to the console press Alt-F1 then type “unsupported” (note: you cannot see what you are typing), then enter the root password.

The main bits are as follows;

Stop the HA service

service vmware-aam stop

Check that HA has stopped (if not then use kill command to kill them)

ps ax | grep aam | grep -v grep

Move the current HA config files to a backup directory (before restarting HA)

cd /etc/opt/vmware/aam

mkdir .old

mv * .old

mv .[a-z]* .old

Then back to your vcenter and select Reconfigure for VMware HA on the effected host. Fingers crossed that it starts up and reconfigures without any issues.

vmware – issues stopping / starting a virtual machine

I’ve had this issue in vSphere where a machine appears to be powered on (both to vCenter and ESXi) but is not actually running.

I get this when trying to power off the virtual machine “…cannot be performed in the current state (powered on)” – which is somewhat strange.

So i have resorted to CLI to check the machines status and then force it off.

This page at vmware explains the various methods (as you progress through them they get more extreme): http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1014165

Unfortunetely for me none of the above methods worked. Both vCenter the Host directely think that the VM is running – though there is no process’s on the box and i cannot run the “stop hard” command via the CLI. I get the same error back “The attempted operation cannot be performed in the current state”, even though get state = on. Hmmm.

Updated : 7/10/2009

Looks like the problem was related to one of our 3.5 ESXi boxes. In our cluster we had a few servers that were upgraded to vSphere while some were left at 3.5. Seemed to work without issue for a while, but DRS eventually didn’t seem to work between them. Upgrading all our hosts to vSphere has fixed the problem – though introduced us to the bugs in vSphere.

Updated : 27/01/20010

I’ve had the oppisite problem with vSphere. Machines that will not delete becuase they are not in the correct state (Powered Off). Unfortuentely the only fix i have at the moment is to reset the host that the virtual mahines are currently running on.

 

vSphere – ctrl-alt-del greyed out

This bug has hit me. Looks like users with roles like vm user / power user cannot send “ctrl-alt-del” via the console even though they have the correct permissions. Our users cannot use ctrl-alt-ins as they are connected via RDP to a machine that has the console installed.

Found this : http://communities.vmware.com/thread/220683;jsessionid=480C8A2C9B9EACA9FF2BB4E1BECA2D53?start=15&tstart=0

Looks like its a known bug and will be fixed in the upcoming VC4.0 update 1 sometime Q3 2009 :(

Luckily vSphere was setup in our pre-production environment – the machines i have running in production are still 3.5 with VC2.5.

VirtualBox – crashing / freezing

I’ve had some problems since my upgrade to virtualbox 2.2.0 on OpenSolaris. After some time all of my linux boxes seem to just die. The virtual machine just stops responding. Strangely there was no problem with my windows vms after the update.

From what i can tell it looks like the upgrade turned off “IO APIC” – this is the bit that seemed to cause the problem. Re-enabling this on all of my linux boxes seems to have fixed the problem. I’ll continue testing for another week and update this post if any problems re-occur.

Updated : 01/09/2009

Here is a bit more on IO APIC from the virtualbox wiki…  (from a windows perspective)
http://www.virtualbox.org/wiki/Migrate_Windows

The hardware dependent portion of the Windows kernel is dubbed “Hardware Abstraction Layer” (HAL). While hardware vendor specific HALs have become very rare, there are still a number of HALs shipped by Microsoft. Here are the most common HALs (for more information, refer to this article: http://support.microsoft.com/kb/309283):

Hal.dll (Standard PC)
Halacpi.dll (ACPI HAL)
Halaacpi.dll (ACPI HAL with IO APIC)

If you perform a Windows installation with default settings in VirtualBox, Halacpi.dll will be chosen as VirtualBox enables ACPI by default but disables the IO APIC by default. A standard installation on a modern physical PC or VMware will usually result in Halaacpi.dll being chosen as most systems nowadays have an IO APIC and VMware chose to virtualize it by default (VirtualBox disables the IO APIC because it is more expensive to virtualize than a standard PIC). So as a first step, you either have to enable IO APIC support in VirtualBox or replace the HAL. Replacing the HAL can be done by booting the VM from the Windows CD and performing a repair installation.

Updated : 5/09/2009

I’ve had even more problems with opensolaris crashing completely after upgrading to the newer versions of virtualbox (3.0.4), and have since reverted back to 2.2.0 which has fixed alot of the hanging issues i have encountered

ESX – network utilization

One of the best articles i have found on this subject is here : http://blog.scottlowe.org/2008/07/16/understanding-nic-utilization-in-vmware-esx/

There is some additional information here on setting up an etherchannel on the cisco side : http://blog.scottlowe.org/2006/12/04/esx-server-nic-teaming-and-vlan-trunking/

This can be handy if you need a single VM to use both physical nics in a load-balanced manner – both outbound and inbound. Of course its not really that simple though. This will really only add a benefit if the VM is communicating to multiple destinations (using ip hash – a single destination from a single VM with one IP will always be limited to the same physical nic).

switch(config)#int port-channel 1
switch(config-if)#description NIC team for ESX server
switch(config-if)#int gi0/1
switch(config-if)#channel-group 1 mode on
switch(config-if)#int gi0/2
switch(config-if)#channel-group 1 mode on

As per the article ensure you are using the same etherchannel method. The first command shows your current load-blance method, the 2nd command changes it to ip hash.

show etherchannel load-balance
port-channel load-balance src-dst-ip

Another solution is to use multiple iSCSI paths. This is newly supported within vSphere, see this post on setting up multiple paths : http://goingvirtual.wordpress.com/2009/07/17/vsphere-4-0-with-software-iscsi-and-2-paths/

Here is another good article on iSCSI within vSphere : http://www.delltechcenter.com/page/A+“Multivendor+Post”+on+using+iSCSI+with+VMware+vSphere

Some important points on using EMC Clariion with vSphere : http://virtualgeek.typepad.com/virtual_geek/2009/08/important-note-for-all-emc-clariion-customers-using-iscsi-and-vsphere.html