OpenSolaris – RTL8111/8168B issues

I’ve got an integrated RTL8111 nic which seemed to work fine under opensolaris 2008.11. But if the nic was put under load for a various length of time it seemed to just drop off the network.

At first i thought it was my SMB service dying, but after a quick ping i relised i had lost the entire TCP/IP stack on that particular card. Hmm…

It does come back online if you are patient and wait for about 5mins or so.

The web shows that there is some known issues with some cards dropping under load. Most places recommend to get a certified pci-e intel nic and your problems will go away. I’m considering this the last possible option, as i don’t particularly want to spend any more money.

The driver that seems to be at fault is the rge native driver… I have found this bug link that “could” be the issue, but might be specifically for the Realtek 8111C. Add the following at the end of the /etc/system file;

set ip:dohwcksum = 0

This setting is short for “do hardware checksum”. From what i have read setting this to zero moves the checksum calculations from the network card to your cpu (it doesn’t open your system to less error checking etc)

Update: This seemed to initially fix the problem for me, but the issue still occurred again after some time.

This forum thread also pointed to a similar issue…

Another solution may be found at the sun HLC site.  From the HLC i have found the home of the driver for the  RTL8111/8168B.

This driver is called gani Driver link here

This page has a good bit on moving from rge to gani driver…. The only problem is that creating the gani driver doest seem to be straight forward.

1. In /etc/driver_aliases find rge “pci10ec,8168” and exchange it with gani “pci10ec,8168”
2. Move /etc/hostname.rge0 to /etc/hostname.gani0
3. Reboot the system

Update2: I thought i had this driver working properly, but after a reboot everything stopped. I couldn’t even ping an ip on the same subnet. Driver was still loaded as i could ping my own ip.

Now i’m looking into the parameters on the rge driver. To get a list of the variables the device has to modify type…

ndd -get /dev/rge0 \?

of the parameters that are listed only the read and write ones can be changed. adv_pause_cap relates to duplex settings and adv_1000fdx_cap relates to speed. If you disable either of these parameters then they are not negotiated with your switch. Probably not worth touching these ones unless you want to run a gb card at 100 half duplex or something.

I’m experimenting with disabling adv_asym_pause_cap at the moment to see if that helps. By default this is enabled. This can be disabled via..

ndd -set /dev/rge0 adv_asym_pause_cap 0

Update3: so far so good? — the above seems to have removed the issue. I still have the  ip:dohwcksum = 0 setting in the /etc/system file. I might try removing that.

Update4: removing ip:dohwcksum = 0 did not re-create the issue – so leaving it off.

Update5: problem came back (but took much longer to appear). Hmmm…..

Now I’ve got another problem with the rge driver. My CIFS write speed has dropped right back to about 2MB/s. There doesn’t seem to be any issues with the read speed which still pulls through about 70MB/s.

It doesn’t seem to be a CPU bottleneck, so again I’m blaming the rge drivers…  The adv_asym_pause_cap parameter did not seem to make any difference to this particular issue – so I’m not blaming that. I’m currently stuck on this one. hmmm…

Looks like i’m going to have to give up on this one and get a Intel pci-e card. I’ll update this post if a new card fixes all the above problems (therefore pointing at the rge driver as the culprit)

Update6: I’ve got the Intel card, and the problems have not re-appeared as of yet. I’ll update if the problem does show itself, but i believe the problem was the rge driver. Hopefully the rge driver is fixed / updated in future releases of opensolaris.

Please leave a message if anyone has made any progress with the rge driver. Cheer.