zfs compression and latency

Since im using ZFS as storage via NFS for my some of my vmware environments i need to ensure that latency on my disk is reduced where ever possible.

There is alot of talk about ZFS compression being “faster” than a non-compressed pool due to less physical data being pulled off the drives. This of course depends on the system powering ZFS, but i wanted to run some tests specifically on latency. Throughput is fine in some situations, but latency is a killer when it comes to lots of small reads and writes (in the case of hosting virtual machines)

I recently completed some basic tests focusing on the differences in latency when ZFS compression (lzjb) is enabled or disabled. IOMeter was my tool of choice and i hit my ZFS box via a mapped drive.

I’m not concerned with the actual figures, but the difference between the figures

I have run the test multiple times (to eliminate caching as a factor) and can validate that compression (on my system anyhow) increases latency

Basic Results from a “All in one” test suite… (similar results across all my tests)

ZFS uncompressed:

IOps : 2376.68
Read MBps : 15.14
Write MBps : 15.36
Average Response Time : 0.42
Average Read Response Time : 0.42
Average Write Response Time : 0.43
Average Transaction Time : 0.42

ZFS compressed: (lzjb)

IOps : 1901.82
Read MBps : 12.09
Write MBps : 12.28
Average Response Time : 0.53
Average Read Response Time : 0.44
Average Write Response Time : 0.61
Average Transaction Time : 0.53

As you can see from the results, the AWRT especially is much higher due to compression. I wouldn’t recommend using zfs compression where latency is a large factor (virtual machines)

Note: Under all the tests performed the CPU (dual core) on the zfs box was never 100% – eliminating that as a bottleneck.

OpenSolaris – Samba server

Time to share your newly created ZFS volume via samba to your windows clients.  There is some CIFS / SMB support built into the kernel now, but i’ve grown used to the SMB server…

Fire up add software – click filesystems – enable filter for “smb”, there are three packages generally. I get all three, but you only need the kernel update and the server package. The other is the SMB client.

Once installed make sure you enable the server in servicesgui.

Ensure the filesystem does not have any permission issues. I usually run chmod -R 777 /share just to ensure everyone can access the files without issue.

Add some users into smb password file (U need to create the users and sync the passwords). I usually create a guest user profile

useradd guest

smbpasswd -a guest – it should prompt for password twice (this is the password you use from windows). Press enter twice to leave the password blank.

The configuration can be done via /etc/sfw/smb.conf or via the shared folders admin gui.

I prefer doing the admin via the /etc/sfw/smb.conffile as it tends to let you have more control than the basic options available to you via the GUI. The contents of the file are as follows;  (note: i have included alot of the setting as an example which may contridict other settings)

[global] – global settings, the following are obvious

workgroup = workgroup

server string = opensolaris

wins support = yes – lets your server act as a WINS box


[share] – share name

path = /raidz1/share – share path

available = yes – enabled?

browseable = yes

public = yes

valid users = user1, user2 – only these users can access the share

writable = yes – equivalent to read / write in windows share properties

read only = yes – sets the default permissions to read only

write list = user1, user2 – these users can write to the share. Overrides above “read only” setting.

There are some good examples within /etc/sfw/smb.conf-example. Look there for some tips.

You also have an option of managing samba via the web – SWAT (samba web admin t). To get this up an running enable the swat service svc:/network/swat:default then browse to http://server:901

Optimizing SMB

I’ve found that adding this to /etc/sfw/smb.conf helps throughput in some cases. Try for yourself;  (it tends to put a higher load on cpu)

[global]

aio read size = 1
aio write size = 1

Further to this entry i have discovered that the built in CIFS / SMB service is much more efficient since it is included as part of the kernel. See my other posts on setting up cifs

Updated : 9/08/2009

I’ve swapped back to samba due to the issues i’ve had with cifs in the later releases. Remember if you wish to swap back to samba yo uneed to remove the sharesmb properties from each of your zfs shares – else on reboot zfs will re-enable the server/smb service.

There are some additional settings to ensure that your file server is the master browser for your workgroup. Put these under your [global]

[global]
domain master = Yes
local master = Yes
preferred master = Yes
os level = 35

Apparently on windows the os level reaches only 32 – so setting this to 35 ensures that your file server remains the master browser when an election is performed.

Opensolaris : Citrix XenServer / ESX – Hooking into ZFS

To share your zfs pool via NFS (that works with Citrix Xen / ESX) to a host called “esxhost”;

zfs set sharenfs=rw,nosuid,root=esxhost tank/nfs

Note : You MUST have a resolvable name from the opensolaris box. i.e. you should be able to ping it. I have tried with ip’s only and it will fail. I have edited the /etc/hosts file to include the following line for my config;

# Copyright 2007 Sun Microsystems, Inc. All rights reserved.
# Use is subject to license terms.
#
# ident “%Z%%M% %I% %E% SMI”
#
# Internet host table
#
192.168.9.120 esxhost

This also requires that you are using both DNS and Files in your /etc/nsswitch.conf file. You should have a line like so;

# You must also set up the /etc/resolv.conf file for DNS name
# server lookup. See resolv.conf(4). For lookup via mdns
# svc:/network/dns/multicast:default must also be enabled. See mdnsd(1M)
hosts: files dns mdns

# Note that IPv4 addresses are searched for in all of the ipnodes databases
# before searching the hosts databases.
ipnodes: files dns mdns

i’ve also run this before hand; (to allow full access)

chmod -R 777 /tank/nfs

Update : check this guide http://blog.laspina.ca/ubiquitous/running-zfs-over-nfs-as-a-vmware-store

Update 2: there are known issues with waiting for sync when using both NFS and ZFS together…. There are reasons why you shouldnt do this, but in a test enviornemnt disabling sync at ZFS level may help performance (zfs set sync=disabled)

I like this idea of spliting up your SSD too… again in test enviornment no problems, in production i would utilize the entire drive to the tasks https://blogs.oracle.com/ds/entry/make_the_most_of_your

opensolaris / zfs – whitebox build

I’ve built a little server for home use, but it pales in comparison to this beast… This type of setup would be perfect for a lab / test environment that requires lots of fast and reliable disk. SCSI drives are fading out, SATA can perform if its setup right. When you look at the price of the entire build you wonder why corporations continue to spend the big bucks on the big storage names.

Check out this build (very nice clear guide)   http://www.stringliterals.com/?p=77

rpc-4020b (1)

Awesome piece of work.

zfs – playing with various configs

If you dont have the disks available to build up a zpool and have a play with zfs you can actually just use files created with the mkfile command… The commands are exactly the same.

mkfile 64m disk1

mkfile 64m disk2

mkfile 64m disk3

mkfile 10m disk4

mkfile 100m disk5

mkfile 100m disk6

Now you can create a zpool using the above files… (i’m using raidz for this setup)

zpool create test raidz /fullpath/disk1 /fullpath/disk2 /fullpath/disk3

if you now want to expand this pool using another three drives (files) you can run this command

zpool add test raidz /fullpath/disk4 /fullpath/disk5 /fullpath/disk6

Check the status of the zpool

zpool status test

NAME STATE READ WRITE CKSUM

test ONLINE 0 0 0
raidz1 ONLINE 0 0 0
/export/home/daz/disk1 ONLINE 0 0 0
/export/home/daz/disk2 ONLINE 0 0 0
/export/home/daz/disk3 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
/export/home/daz/disk4 ONLINE 0 0 0
/export/home/daz/disk5 ONLINE 0 0 0
/export/home/daz/disk6 ONLINE 0 0 0

errors: No known data errors


Now time to replace a drive (perhaps you wish to slowly increase your space) Note: all drives in that particular raidz pool need to be replaced with larger drives before the additional space is shown.

mkfile 200m disk7

mkfile 200m disk8

mkfile 200m disk9

Check the size of the zpool first;

zpool list test

NAME SIZE USED AVAIL CAP HEALTH ALTROOT
test 464M 349M 115M 75% ONLINE –

Now replace all of the smaller drives with the larger ones…

zpool replace test /export/home/daz/disk1 /export/home/daz/disk7
zpool replace test /export/home/daz/disk2 /export/home/daz/disk8
zpool replace test /export/home/daz/disk3 /export/home/daz/disk9

The space will show up if you bounce the box, i’ve heard that sometimes you may need to export and import but i’ve never had to do that.

Flashing si3114 to sata only bios

This is the perfect controller for adding additional sata drives into opensolaris (well in terms of price, the bandwidth of the PCI slot is the only negative part). The si3114 comes with the default bios that supports various raid configs, but this requires additional drivers to be loaded.

Essentially the “raid” on the card is called fakeraid as it does not actually process any data itself, but hooks into the cpu via a driver and lets your cpu do all the work.

Instead we will flash the bios to be a sata only controller (no raid). If we are using ZFS its better to just present the disks and let the OS take care of the work.

You will need these tools;

bio-003114-x10_5403 – the various bios’s for the si3114 card

siflashtool – the flashing tool

Note : you must plug in a hard drive into the card or else the flash will not work.

From the zip above you want to grab this file for the bios flash “b5403.bin” the other is for raid and can be ignored.

Now you’ll need to grab your trusty bootable flash drive / usb stick. If you dont have one check out HP’s tool for creating one (else you could use a floppy boot disk if you still have one). Copy the files onto it, and boot it up.

The instructions say you can flash in windows, but i never had any luck with that – instead found booting to dos a much more reliable method. This is the commandline to run it;

SiFlashTool /File:b5403.bin

Done.

Opensolaris – where has my memory gone?

Use this command in 2008.11 to get details on where your memory is currently being used…

echo ::memstat | pfexec mdb -k

Page Summary                Pages                MB  %Tot
————     —————-  —————-  —-
Kernel                     263992              1031   34%
ZFS File Data               91917               359   12%
Anon                       376867              1472   48%
Exec and libs               11484                44    1%
Page cache                   3387                13    0%
Free (cachelist)             9766                38    1%
Free (freelist)             24807                96    3%

Total                      782220              3055
Physical                   782219              3055

Note: ZFS should eat up the remainder of your ram after a bit of use.

“ZFS File Data” is the one to look at – if it is low then most of your ram may be eaten up in other areas of the system.

From the output above you can see that i have 3GB installed. I have a few VirtualBox VM’s running on my server which show up as “Anon”, they are consuming almost half of my ram.

zfs – java management gui

It hasn’t made it into opensolaris yet. But from what i’ve heard it should be making an appearance over from the solaris 10 OS soon. Here is a screenshot of what it looks like…

zfsscreenshot

Should make managing zfs a bit easier – though its already quite easy.  Perhaps if you have quite alot of zpools / zfs file systems it will look prettier.   ;)

Troubleshooting – Time Slider (zfs snapshots)

1. snapshot complains about no access to cron

This problem i came across was after i was playing with crontab. It looks like the zfs snapshot service uses an account called “zfssnap” and if it doesnt have access to cron then it will have issues creating / checking snapshots. Check the file /etc/cron.d/cron.allow and ensure that “zfssnap” is in there. The issues i had looked like this in the log…   (check the logs via the log file viewer)

Checking for non-recursive missed // snapshots  rpool

Checking for recursive missed // snapshots protected rpool/backup rpool/export rpool/ROOT unprotected

crontab: you are not authorized to use cron.  Sorry.

crontab: you are not authorized to use cron.  Sorry.

Error: Unable to add cron job!

Moving service to maintenance mode.

The actual crontab lives in the /var/spool/cron/crontab/zfssnap file. (don’t edit this manually)

Restart the services by clearing the maintenance status then if required enable or restart like so…

svcadm clear auto-snapshot:frequent

svcadm enable auto-snapshot:frequent

Check that all zfs snapshot services are running as expected….

svcs -a | grep snapshot

online         22:26:12 svc:/system/filesystem/zfs/auto-snapshot:weekly

online          9:06:36 svc:/system/filesystem/zfs/auto-snapshot:monthly

online          9:11:23 svc:/system/filesystem/zfs/auto-snapshot:daily

online          9:12:00 svc:/system/filesystem/zfs/auto-snapshot:hourly

online          9:23:57 svc:/system/filesystem/zfs/auto-snapshot:frequent

2. snapshot fails with dataset busy error

Seen something similar to this in the logs? …

Checking for recursive missed // snapshots protected rpool/backup rpool/export rpool/ROOT unprotected

Last snapshot for svc:/system/filesystem/zfs/auto-snapshot:frequent taken on Sun Mar 15 22:26 2009

which was greater than the 15 minutes schedule. Taking snapshot now.

cannot create snapshot ‘rpool/ROOT/opensolaris@zfs-auto-snap:frequent-2009-03-16-09:06’: dataset is busy

no snapshots were created

Error: Unable to take recursive snapshots of rpool/ROOT@zfs-auto-snap:frequent-2009-03-16-09:06.

Moving service svc:/system/filesystem/zfs/auto-snapshot:frequent to maintenance mode.

Here is an bit from this site – “This problem is being caused by the old (IE: read non-active) boot environments not being mounted and it is trying to snapshot them. You can’t ‘svcadm clear’ or ‘svcadm enable’ them because they will still fail.”

Apparently a bug with the zfs snapshots similar to /root/opensolaris type pools — anyhow to fix i’ve just used a custom setup in time slider. Clear all the services set to “maintenance” then launch time-slider-setup and configure to exclude the problem pools.

Update : As per Johns comments below you can disable the snapshots on the offending zfs system using the following command…

zfs set com.sun:auto-snapshot=false rpool/ROOT

As above to clear “maintenance” status on the effected services run the following command…

svcadm clear auto-snapshot:hourly

svcadm clear auto-snapshot:frequent

Now run this to ensure all the SMF services are running without issue…

svcs -x

If all is well you will get no output.

zfs – checking your zpool throughput

This is quite a good diagnostic for checking your disk throughput. Try copying data to and from your zpool while your running this command on the host…

zpool iostat -v unprotected 2

capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
unprotected  1.39T   668G     18      7  1.35M   161K
c7d0       696G   403M      1      2  55.1K  21.3K
c9d0       584G   112G      8      2   631K  69.3K
c7d1       141G   555G      8      2   697K  70.0K
----------  -----  -----  -----  -----  -----  -----

The above command will keep displaying the above output every 2 seconds (average during that time). I’ve used it a few times to ensure that all disks are being used (in write operations) where needed. Of course read op’s may not be typically across all disks as it will depend where the data is…

As you can see in the output from my “unprotected” zpool, my disk “c7d0” is near full so less write operations will be on this disk. In my scenario most of my reads also come from this disk, this was due me copying most of the data into this zpool when there was only this single disk.

I’ve heard rumor of a zfs feature in future that will re-balance the data across all the disks (unsure if its live or on a set schedule)

Another way to show some disk throughput figures is to run the iostat command like so…

iostat -exn 10

extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
cmdk17    1.0    0.0   71.5    0.0  0.0  0.0   10.9   0   1
cmdk18    0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
cmdk19    0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
cmdk20    0.8    0.0   33.5    0.0  0.0  0.0   13.5   0   1
cmdk21    0.4    0.0    0.5    0.0  0.0  0.0   15.5   0   1
cmdk22    0.8    0.0   66.3    0.0  0.0  0.0    9.0   0   1
cmdk23    0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0

cmdk24    0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0

extended device statistics       —- errors —

                     extended device statistics       ---- errors --- 


r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot device
0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 c11d1
0.0    7.7    0.0   25.8  0.0  0.0    2.3    4.9   0   3   0   0   0   0 c8d0
0.0   17.6    0.0  238.0  0.0  0.0    0.0    0.3   0   0   0   0   0   0 c9d0
0.0    1.0    0.0    0.8  0.0  0.0    0.0    0.3   0   0   0   0   0   0 c7t0d0
0.0    1.0    0.0    0.8  0.0  0.0    0.0    0.2   0   0   0   0   0   0 c7t2d0
0.0    1.0    0.0    0.8  0.0  0.0    0.0    0.3   0   0   0   0   0   0 c7t3d0
0.7   21.1   29.9  315.0  0.0  0.0    0.0    1.1   0   1   0   0   0   0 c7t4d0
0.7   20.9   29.8  314.9  0.0  0.0    0.0    1.7   0   2   0   0   0   0 c7t5d0
0.8   21.0   34.1  315.0  0.0  0.0    0.0    1.2   0   1   0   0   0   0 c7t6d0
0.5   20.8   21.3  314.8  0.0  0.0    0.0    1.1   0   1   0   0   0   0 c7t7d0

This should show you all your disks and update on a 5 second interval. Copying data back and forth to your drives will show various stats.