HP Gen 8 servers and networking issues – TG3 driver

There is a bug in the tg3 driver on the ESXi hosts (1gbit broadcom cards in the new hosts). If the network card is put under load and netqueue is enabled it will sometimes decide to drop all traffic. Essentially i’ve disabled netqueue and the problems have gone away…. as per this vm kb :

http://kb.vmware.com/kb/2035701

The isues will present themselves as log entires like so;

2012-11-19T18:58:52.137Z cpu17:4155)<6>tg3 : vmnic8: RX NetQ allocated on 1
2012-11-19T18:58:52.138Z cpu17:4155)<6>tg3 : vmnic8: NetQ set RX Filter: 1 [00:50:56:71:46:87 0]
2012-11-19T18:58:52.138Z cpu17:4155)<6>tg3 : vmnic7: RX NetQ allocated on 1
2012-11-19T18:58:52.138Z cpu17:4155)<6>tg3 : vmnic7: NetQ set RX Filter: 1 [00:50:56:71:46:87 0]
2012-11-19T18:59:12.139Z cpu21:4155)<6>tg3 : vmnic4: NetQ remove RX filter: 1
2012-11-19T18:59:12.139Z cpu21:4155)<6>tg3 : vmnic4: Free NetQ RX Queue: 1
2012-11-19T18:59:22.137Z cpu24:4155)<6>tg3 : vmnic4: RX NetQ allocated on 1
2012-11-19T18:59:22.138Z cpu24:4155)<6>tg3 : vmnic4: NetQ set RX Filter: 1 [00:50:56:71:46:87 0]
2012-11-19T18:59:42.138Z cpu21:4155)<6>tg3 : vmnic7: NetQ remove RX filter: 1
2012-11-19T18:59:42.138Z cpu21:4155)<6>tg3 : vmnic7: Free NetQ RX Queue: 1
2012-11-19T18:59:42.140Z cpu21:4155)<6>tg3 : vmnic4: NetQ remove RX filter: 1
2012-11-19T18:59:42.140Z cpu21:4155)<6>tg3 : vmnic4: Free NetQ RX Queue: 1
2012-11-19T19:00:02.139Z cpu28:4155)<6>tg3 : vmnic8: NetQ remove RX filter: 1

vmware : IDE to SCSI

I’ve found that vmware converter (this may be fixed in newer verions) creates vmware guests with an IDE controller. There can be performance issues if you choose to remain with this particular controller… Best bet is to change it to one of the various vmware SCSI controllers…

Depending on which windows operating system you are running depends on which controller you use….  http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1006621

Guest Operating System
Adapter Type
Windows 2003, 2008, Vista
lsilogic
Windows NT, 2000, XP
buslogic
Linux
lsilogic

 

http://sanbarrow.com/vmdk/vmx-ide2scsi.html

You can easily change the type of the virtual controller for a given disk.
Lets have a look at an example.

# Disk DescriptorFile

version=1
CID=fffffffe
parentCID=ffffffff
createType=”twoGbMaxExtentFlat”
# Extent description
RW 4193792 FLAT “diskname-f001.vmdk” 0
RW 2097664 FLAT “diskname-f002.vmdk” 0
# The Disk Data Base
#DDB
ddb.adapterType = “ide”
ddb.virtualHWVersion = “3”
ddb.geometry.cylinders = “6241”
ddb.geometry.heads = “16”
ddb.geometry.sectors = “63”

The disk above uses a virtual ide-controller.

ddb.adapterType = “buslogic” This entry converts the disk into a SCSI-disk with BusLogic Controller

ddb.adapterType = “lsilogic”   This entry converts the disk into a SCSI-disk with LSILogic Controller
ddb.adapterType = “ide”   This entry converts the disk into a IDE-disk with Intel-IDE Controller

This changes the harddisk – but doesn’t change the controller itself.

ide0.present = “TRUE”
ide1.present = “TRUE”
scsi0.virtualDev = “lsilogic”
scsi0.virtualDev = “buslogic”
scsi1.virtualDev = “lsilogic”
scsi1.virtualDev = “buslogic”

Use entries like this in your *.vmx file. By the way, you can have LSI-logic and BUS-logic controllers in one VM.

Think twice before you make changes like this with a boot-disk.

Bluescreen 07b – mass-storage driver:
Activate the apropriate driver in the registry: intelide.sys or vmscsi.sys or symmpi.sys – you may have to add files as well.

If you get the above issue on a w2k8 box you might be able to enable the LSI_SAS driver before you convert the machine to SCSI controller.

  1. Boot machine with IDE controller
  2. Take a snapshot (for failback)
  3. Regedit and find the following key \\HKLM\SYSTEM\ControlSet001\Services\LSI_SAS
  4. Change the “Start” dword from 4 to 0
  5. Shutdown the machine
  6. Remove all the virtual disks (do not delete the disks, just remove them)
  7. Create copies of each .vmdk file (cp) (for failback)
  8. Edit the .vmdk file for each disk (vi)
  9. Change the “adaptertype” to “lsilogic” (if w2k8)
  10. Re-add existing disks (this should also bring in a LSI SAS controller)
  11. Boot the machine

Black screen with cursor blinking in the topleft of the screen:
Write a new partition boot-sector.

OpenSolaris – RTL8111/8168B issues

I’ve got an integrated RTL8111 nic which seemed to work fine under opensolaris 2008.11. But if the nic was put under load for a various length of time it seemed to just drop off the network.

At first i thought it was my SMB service dying, but after a quick ping i relised i had lost the entire TCP/IP stack on that particular card. Hmm…

It does come back online if you are patient and wait for about 5mins or so.

The web shows that there is some known issues with some cards dropping under load. Most places recommend to get a certified pci-e intel nic and your problems will go away. I’m considering this the last possible option, as i don’t particularly want to spend any more money.

The driver that seems to be at fault is the rge native driver… I have found this bug link that “could” be the issue, but might be specifically for the Realtek 8111C. Add the following at the end of the /etc/system file;

set ip:dohwcksum = 0

This setting is short for “do hardware checksum”. From what i have read setting this to zero moves the checksum calculations from the network card to your cpu (it doesn’t open your system to less error checking etc)

Update: This seemed to initially fix the problem for me, but the issue still occurred again after some time.

This forum thread also pointed to a similar issue… http://opensolaris.org/jive/thread.jspa?threadID=91282

Another solution may be found at the sun HLC site.  From the HLC i have found the home of the driver for the  RTL8111/8168B.  http://homepage2.nifty.com/mrym3/taiyodo/eng/

This driver is called gani Driver link here

This page has a good bit on moving from rge to gani driver…. http://schlaepfer.nine.ch/twiki/bin/view/Schlaepfer/SelfMadeNas2. The only problem is that creating the gani driver doest seem to be straight forward.

1. In /etc/driver_aliases find rge “pci10ec,8168” and exchange it with gani “pci10ec,8168”
2. Move /etc/hostname.rge0 to /etc/hostname.gani0
3. Reboot the system

Update2: I thought i had this driver working properly, but after a reboot everything stopped. I couldn’t even ping an ip on the same subnet. Driver was still loaded as i could ping my own ip.

Now i’m looking into the parameters on the rge driver. To get a list of the variables the device has to modify type…

ndd -get /dev/rge0 \?

of the parameters that are listed only the read and write ones can be changed. adv_pause_cap relates to duplex settings and adv_1000fdx_cap relates to speed. If you disable either of these parameters then they are not negotiated with your switch. Probably not worth touching these ones unless you want to run a gb card at 100 half duplex or something.

I’m experimenting with disabling adv_asym_pause_cap at the moment to see if that helps. By default this is enabled. This can be disabled via..

ndd -set /dev/rge0 adv_asym_pause_cap 0

Update3: so far so good? — the above seems to have removed the issue. I still have the  ip:dohwcksum = 0 setting in the /etc/system file. I might try removing that.

Update4: removing ip:dohwcksum = 0 did not re-create the issue – so leaving it off.

Update5: problem came back (but took much longer to appear). Hmmm…..

Now I’ve got another problem with the rge driver. My CIFS write speed has dropped right back to about 2MB/s. There doesn’t seem to be any issues with the read speed which still pulls through about 70MB/s.

It doesn’t seem to be a CPU bottleneck, so again I’m blaming the rge drivers…  The adv_asym_pause_cap parameter did not seem to make any difference to this particular issue – so I’m not blaming that. I’m currently stuck on this one. hmmm…

Looks like i’m going to have to give up on this one and get a Intel pci-e card. I’ll update this post if a new card fixes all the above problems (therefore pointing at the rge driver as the culprit)

Update6: I’ve got the Intel card, and the problems have not re-appeared as of yet. I’ll update if the problem does show itself, but i believe the problem was the rge driver. Hopefully the rge driver is fixed / updated in future releases of opensolaris.

Please leave a message if anyone has made any progress with the rge driver. Cheer.