Opensolaris – ZFS recovery after kernel panic

Recently i hit what i thought was a huge disaster with my ZFS array. Essentially i was unable to import my zpool without causing the kernel to panic and reboot. Still unsure of the exact reason, but it didn’t seem to be due to a hardware fault. (zpool import showed all disks as ONLINE)

When i tried to import with zpool import -f tank the machine would lockup and reboot (panic).

The kernel panic;  (key line)

> genunix: [ID 361072 kern.notice] zfs: freeing free segment (offset=3540185931776 size=22528)

Nothing i could do would fix it… tried both of these options in the system file with no success;

set zfs:zfs_recover=1
set aok=1

After a quick email from a Sun Engineer (kudos to Victor), the zdb command line that fixed it;

zdb -e -bcsvL <poolname>

zdb is a read only diagnostic tool, but seemed to read through the sectors that had the corrupt data and fix things??  (not sure how a read only tool does that) – the run took well over 15hrs.

Updated: 20/10/2009

Apparently if you have set zfs:zfs_recover=1 in your system file the zdb command will operate in a different manner fixing the issues it encounters.

Remember to run a zpool scrub <poolname> if you are lucky enough to get it back online.

This thread has some additional info…

http://opensolaris.org/jive/message.jspa?messageID=479553

Update 31/05/2012

This command has also helped me when i cant mount a pool in RW mode

zpool import -F -f -o readonly=on -R /mnt/temp zpool2

Opensolaris – white screen on logon

I had this problem when i enabled 3d effects on my server. The screen just goes white and you cannot see anything. Even after a reboot as soon as you logon to gnome the screen goes white.

To fix you’ll need to logon using a “failsafe terminal” session, then run gnome-cleanup from the command line. This will blow away any of your gnome settings (although they are actually backed up to a file), but it’ll mean you’ll be able to logon again without issue.

vmware – HA issues

Most of the time your HA issues are going to be DNS related. So ensure that your vcenter can ping all your hosts by FQDN without issue.  In some cases though a stubborn server may not want to play the game even when everything is configured properly.

This method is considered a “last effort” as you’ll need to run some CLI commands on the ESX box. But i have found it useful in a few situations.

This page has a great write up on which files HA uses and how to temporary stop the HA service. http://itknowledgeexchange.techtarget.com/virtualization-pro/vmware-ha-failure-got-you-down/

Remember to get to the console on ESXi you logon to the console press Alt-F1 then type “unsupported” (note: you cannot see what you are typing), then enter the root password.

The main bits are as follows;

Stop the HA service

service vmware-aam stop

Check that HA has stopped (if not then use kill command to kill them)

ps ax | grep aam | grep -v grep

Move the current HA config files to a backup directory (before restarting HA)

cd /etc/opt/vmware/aam

mkdir .old

mv * .old

mv .[a-z]* .old

Then back to your vcenter and select Reconfigure for VMware HA on the effected host. Fingers crossed that it starts up and reconfigures without any issues.

vmware – issues stopping / starting a virtual machine

I’ve had this issue in vSphere where a machine appears to be powered on (both to vCenter and ESXi) but is not actually running.

I get this when trying to power off the virtual machine “…cannot be performed in the current state (powered on)” – which is somewhat strange.

So i have resorted to CLI to check the machines status and then force it off.

This page at vmware explains the various methods (as you progress through them they get more extreme): http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1014165

Unfortunetely for me none of the above methods worked. Both vCenter the Host directely think that the VM is running – though there is no process’s on the box and i cannot run the “stop hard” command via the CLI. I get the same error back “The attempted operation cannot be performed in the current state”, even though get state = on. Hmmm.

Updated : 7/10/2009

Looks like the problem was related to one of our 3.5 ESXi boxes. In our cluster we had a few servers that were upgraded to vSphere while some were left at 3.5. Seemed to work without issue for a while, but DRS eventually didn’t seem to work between them. Upgrading all our hosts to vSphere has fixed the problem – though introduced us to the bugs in vSphere.

Updated : 27/01/20010

I’ve had the oppisite problem with vSphere. Machines that will not delete becuase they are not in the correct state (Powered Off). Unfortuentely the only fix i have at the moment is to reset the host that the virtual mahines are currently running on.

 

vSphere – ctrl-alt-del greyed out

This bug has hit me. Looks like users with roles like vm user / power user cannot send “ctrl-alt-del” via the console even though they have the correct permissions. Our users cannot use ctrl-alt-ins as they are connected via RDP to a machine that has the console installed.

Found this : http://communities.vmware.com/thread/220683;jsessionid=480C8A2C9B9EACA9FF2BB4E1BECA2D53?start=15&tstart=0

Looks like its a known bug and will be fixed in the upcoming VC4.0 update 1 sometime Q3 2009 :(

Luckily vSphere was setup in our pre-production environment – the machines i have running in production are still 3.5 with VC2.5.