OpenSolaris – RTL8111/8168B issues

I’ve got an integrated RTL8111 nic which seemed to work fine under opensolaris 2008.11. But if the nic was put under load for a various length of time it seemed to just drop off the network.

At first i thought it was my SMB service dying, but after a quick ping i relised i had lost the entire TCP/IP stack on that particular card. Hmm…

It does come back online if you are patient and wait for about 5mins or so.

The web shows that there is some known issues with some cards dropping under load. Most places recommend to get a certified pci-e intel nic and your problems will go away. I’m considering this the last possible option, as i don’t particularly want to spend any more money.

The driver that seems to be at fault is the rge native driver… I have found this bug link that “could” be the issue, but might be specifically for the Realtek 8111C. Add the following at the end of the /etc/system file;

set ip:dohwcksum = 0

This setting is short for “do hardware checksum”. From what i have read setting this to zero moves the checksum calculations from the network card to your cpu (it doesn’t open your system to less error checking etc)

Update: This seemed to initially fix the problem for me, but the issue still occurred again after some time.

This forum thread also pointed to a similar issue… http://opensolaris.org/jive/thread.jspa?threadID=91282

Another solution may be found at the sun HLC site.  From the HLC i have found the home of the driver for the  RTL8111/8168B.  http://homepage2.nifty.com/mrym3/taiyodo/eng/

This driver is called gani Driver link here

This page has a good bit on moving from rge to gani driver…. http://schlaepfer.nine.ch/twiki/bin/view/Schlaepfer/SelfMadeNas2. The only problem is that creating the gani driver doest seem to be straight forward.

1. In /etc/driver_aliases find rge “pci10ec,8168” and exchange it with gani “pci10ec,8168”
2. Move /etc/hostname.rge0 to /etc/hostname.gani0
3. Reboot the system

Update2: I thought i had this driver working properly, but after a reboot everything stopped. I couldn’t even ping an ip on the same subnet. Driver was still loaded as i could ping my own ip.

Now i’m looking into the parameters on the rge driver. To get a list of the variables the device has to modify type…

ndd -get /dev/rge0 \?

of the parameters that are listed only the read and write ones can be changed. adv_pause_cap relates to duplex settings and adv_1000fdx_cap relates to speed. If you disable either of these parameters then they are not negotiated with your switch. Probably not worth touching these ones unless you want to run a gb card at 100 half duplex or something.

I’m experimenting with disabling adv_asym_pause_cap at the moment to see if that helps. By default this is enabled. This can be disabled via..

ndd -set /dev/rge0 adv_asym_pause_cap 0

Update3: so far so good? — the above seems to have removed the issue. I still have the  ip:dohwcksum = 0 setting in the /etc/system file. I might try removing that.

Update4: removing ip:dohwcksum = 0 did not re-create the issue – so leaving it off.

Update5: problem came back (but took much longer to appear). Hmmm…..

Now I’ve got another problem with the rge driver. My CIFS write speed has dropped right back to about 2MB/s. There doesn’t seem to be any issues with the read speed which still pulls through about 70MB/s.

It doesn’t seem to be a CPU bottleneck, so again I’m blaming the rge drivers…  The adv_asym_pause_cap parameter did not seem to make any difference to this particular issue – so I’m not blaming that. I’m currently stuck on this one. hmmm…

Looks like i’m going to have to give up on this one and get a Intel pci-e card. I’ll update this post if a new card fixes all the above problems (therefore pointing at the rge driver as the culprit)

Update6: I’ve got the Intel card, and the problems have not re-appeared as of yet. I’ll update if the problem does show itself, but i believe the problem was the rge driver. Hopefully the rge driver is fixed / updated in future releases of opensolaris.

Please leave a message if anyone has made any progress with the rge driver. Cheer.

31 thoughts on “OpenSolaris – RTL8111/8168B issues

  1. Thanks for committing your thoughts and experiments to your blog. I have been building an OpenSolaris based file server myself, and I ran in to the exact same issue with the rge and gani drivers. Based on your experiences, I set ndd -set /dev/rge0 adv_asym_pause_cap 0, and it seems to have fixed the issue for me. I do not have ip:dohwcksum = 0 in my /etc/system file.

    • Hi John, the problem came back again for me (tried without the ip:dohwchksum = 0), unsure if its worth me putting that setting back on. Hopefully you have better results.

  2. I think I’m seeing the issue come up again for me. Like you said, it’s taking much longer for the issue to crop up, but it is most definitely dying after transferring a large amount of data.

    Have you had a chance to try out an Inter card yet? If so, which one? Looking on newegg, I see one of $24 and others for a lot more.

    • I’ve placed an order for a “INTEL PRO/1000 PT DESKTOP ADAPTER PCI-E” card, but it was out of stock at the time so i wont be getting it until Tuesday. I’ll throw the details up as soon as i get it.

  3. Interestingly enough, after a bit (sorry no hard numbers), my file server seems to come to its senses and start responding to the network again. Very peculiar.

    • yeah, same thing for me. Most of the time the server will come back within 5 mins or so. Whoops, i never mentioned that bit (i better update the post to include that)

  4. Hi folks,

    I am new to OpenSolaris and tried to build a NAS with an Atom 330 Board, which has the Realtek 8111c onboard and only one PCI slot vacant. But I need this for an additional SATA controller, so I need to use the rge. I ran in exactly the same problems, but disabling hardware checksumming in OS 2008.11 leaded to poor network performance so I have left it on. netstat -i shows some collisions and errors that should not occur on a duplex line, but not many (6-27 or so). I have not tried your tricks yet, but I think I have to look for another mainboard if the issue will not be fixed soon…. but many thanks for blogging your stories..

  5. I’ve hit the same problem with my Tranquil PC BBS2, based on the same NIC and have written a bit more about it on my blog. Luckily, Seb at Sun recently posted on the indiana-discuss list about the BBS2 and has since put me in contact with Winson from the kernel engineering team.

    He’s now poking my box remotely and trying to work out why it’s getting into this state. Fingers crossed, he’ll be able to come up with a solution :)

    I’ll put up a post when I have any more news and add another comment here.

  6. Winson’s managed to improve the stall detection considerably, so OpenSolaris can recover the connection within 30 seconds, rather than over 5 minutes now. Unfortunately, the stalls still happen under high load but he’s continuing to look into it.

    He thinks that it’s possibly a power supply issue at the moment. What are your thoughts on this?

    I’m using a Tranquil BBS2 barebones system, which is built on the Intel D945GCLF2. I don’t know what PSU’s in it (yet), so can’t really say whether it might be overloaded until I take a look inside.

  7. It could be a power supply issue, but I bet it is not.

    I have a Intel D945GCLF2 Mainboard (Realtek 8111C) in a Chembro case with 180W Power Supply. The Wattage actually used is 60 Watts (at startup peaks to 100).

    The Ethernet connections gets lost occasionally even if there is NO load on CPU and Ethernet. On the other Hand I’ve transfered Gigabytes without loss, so it seems the behaviour seems to be totally random.

    I’ve disabled the onboard Audio and Printer port – at least the Audio did have some conflicts with its IRQs.

    I currently testing the ip:dohwcksum “hack” and it seems to work somehow. But the thruput is not as stable as it should be (CIFS bandwith while copying shows ups and downs (From 8MB/s to 96 Bytes/s). While this was the same as before, I think this is far to slow for a gigabit connection and also the driver quality could be better.

    I will try the gami driver …

  8. @Dominic

    hmmmm, i’ve actually swapped out my psu for a seasonic 550W (i had a generic 450W previously) – draws about 100W idle and 130W load. Its not that powerful a psu, but thought it should be enough?

  9. Thanks for the feedback. I agree now that it isn’t a power supply issue after removing four disks from the machine, reducing the consumption by 20W. The power supply has plenty of breathing room now yet the issue still occurs. I’d have thought your PSU is ample Daz!

    Winson thinks he can get the stall detection down to five seconds, which would be a great improvement. I’ve had two instances of permanent lockups though, which I’m trying to recreate at the moment – have any of you seen this (where it doesn’t return after 5 mins)?

  10. @Dominic
    I havent had it not resolve it self after a period of time. Sometimes it can take up to 10mins to come back. But eventually does. Might be worth trying the driver that Michael has mentioned?

    @Michael Keller
    Thanks Michael – i’ll give it a go some time soon. Is there a link to the older driver on the gani site?

  11. Daz, like you, I bought an Intel EXPI9300PTBLK card to replace the on-board Realtek NIC. It (and the e1000g driver) are rock solid, so far. I’ve used it for nearly a week now with no problems.

  12. I’ve got this issue too :(
    What I’ve found very interesting, is that my Mac can rape the card for all its worth, for as long as I want, at high speeds, no problems.
    I’ll play music off the network share, decompress RAR files off the network share, transfer hundreds of gigabytes, no problem.

    But when I try to do that with my Windows Machine, it’ll crap out after transferring 10gb or so.

    Maybe that can help you?

    Keep us posted :)

    Ducky

  13. FWIW, I was having the same issue with the RGE driver (board is ASUS M3A78-EH with the RTL8111/8168B PCI Express Gigabit Ethernet controller) running Opensolaris 2009.06. Im using it as a NAS via NFS with VMWare ESXi as the NFS client. Under heavy load, every hour or two, the network would dissapear for 5 to ten minutes before reappearing.

    I tried all the fixes listed above, and many more with no success. I switched to the gani driver (2.6.4) and Ive just run flat out NFS load on it for 24 hours with no interruption.

    Awesome! Hope this helps someone!

    Nick

  14. Yes, seems like the new driver works. The “-9” variant seems to work for me and is reported to work for others, though I don’t think it will be the final production release.

    Nevertheless, to install it, do:

    1) download it from here: http://homepage2.nifty.com/mrym3/taiyodo/rge-6888015-9.tar.gz
    2) follow these instructions: :)

    (1) unload existing rge:
    unplumb rge port
    # ifconfig rge0 unplumb

    find module id of rge
    # modinfo | grep rge
    if the result is:
    200 fffffffff889e000 d420 320 1 rge (Realtek 1Gb Ethernet)
    then,
    # modunload -i 200

    (2) load the new rge
    if you use 64bit kernel:
    # modload ./amd64/rge

    if you use 32bit kernel:
    # modload ./i386/rge

    ensure the new rge is loaded and running
    # modinfo | grep rge
    200 fffffffff889e000 d420 320 1 rge (Realtek 1Gb Ethernet mcast.3)

    (3) plumb the new rge
    # ifconfig rge0 plumb …….

    (4) then test your applications.


    here was the before/after on my modinfo line:

    before: 155 fffffffff7eb5000 a9d8 66 1 m (Realtek 1Gb Ethernet)
    after: 155 fffffffff7f36000 d868 66 1 rge (Realtek 1Gb Ethernet 6888015-9)

  15. I ran into this issue with rge hanging and a fter installing the -9 driver it took care of the problems. Just extract and run make; make install.

  16. ACTUALLY — sorry — I take back my comment about the -9 driver not working 100% of the time. Turns out a system reboot resets the driver.

    There are discussions on the aforementioned forum thread about permanently installing the driver.

    Turns out you can skip the modunload / ifconfig stuff if you want to just ‘install’ it rather than just patch it in until the next reboot. So you can install by:

    1) copy rge into /platform/i86pc/kernel/drv (add a /amd64 on the end if you’re on a 64 bit platform)

    2) restart!

    3) verify that you are running the patched driver: modinfo | grep rge, look for a line that ends with -9).

  17. Pingback: Setup/Run Source Server (SRCDS) under Solaris 10 - Akensai.com

  18. Pingback: DYI NAS: Part1 - ShainMiley.com

  19. So I had the same problem with the built in NIC on a regular motherboard. Upgrading the drivers didn’t make the problem go away. So I bought an Intel PRO/1000 GT Desktop Adapter for $30 and it’s been working just fine. 900 GB so far over a few days.

    Seems like Intel is always the way to go. I tried building a file server using an AMD motherboard with lots of SATA ports, but got all kinds of problems.

    Thanks for all the help !

  20. Moreover, all methods aand channels are not within easy reach and budget to get
    into. A good copywriter is able to write something that
    will sway your consumers into buying products.
    The networks tha it should connect to should att the minimum be Linked – In, twitter and Facebook.
    If all this is too much for you to take on, it may be better to hire an eemail markwting
    service. Onne of tthe tools that you are gong to want tto get right off the bat is the autoresponder tool used by many businesses.

    Also visit my weeb page: email marketing services prices

Leave a Reply

Your email address will not be published.