vmware and load balancing NFS trunks

Straight from the post below, this is the best way (currently) to load balance your NFS datastores…  No MPIO magic here unfortunately.

http://communities.vmware.com/message/1466595#1466595

Basically you can setup IP alias on the NFS side and then setup multiple connections (using a unique IP) per datastore  on each ESX host. This works well if you are using a team of nics running IP-hash load balancing…

Static EtherChannel.

My setup is as follows:

ESXi 4.0 U1, Cisco 3750 Switches, and NetApp NFS on the storage side.

I have a total of 8 nics. I divided the nics into 3 groups.

2 nics on vSwitch0 for Mgmt & vMotion
3 nics on vSwitch1 for VM’s (Multiple port groups (3 VLANS))
3 nics on vSwitch2 for IP Storage (Mostly NFS, a little iSCSI)
(One vSwitch3 I also have a VM port group for iSCSI access from within the VM)

Since I have 3 nics on my IPStorage port group I needed a way to be able to utilize all three nics and not have the server just use one for ingress and egress traffic. This was done by:

Setting up a static EtherChannel on the cisco switch (Port Channel).
Configuring the cisco switch to IP Hash
Configure the vSwitch to “Route based on IP Hash” as well.

The next part is to create multiple datastores on the NFS device. Each of my NFS datastores is about 500GB in size. Reason for this is that my larger luns are iSCSI and are access directly from the VM using the MS iSCSI initiator from the VM itself.
My NetApp NAS has an address of, let say, 192.168.1.50. So all my data stores are accessible by utilizing the address of “\\192.168.1.50\NFS-Store#”. This will not be useful as the esx box and the cisco switch will always use the same nic/port to access the nas device. This is due to the algorithm (IP HASH) to decide what link it’ll go over. So to resolve the issue, I added IP aliases to the NFS box. NetApp allows to have multiple Ip addresses pointing to the same NFS export, I suspect EMC would do the same. So, I added 2 aliases 51 & 52. Now my NFS datastores are accessible by using Ip address 192.168.1.50,.51, & .52.

So I went ahead and added the datastores to the ESX box using the multiple IP addresses:

Datastore1 = \\192.168.1.50\NFS-Store1
Datastore2 = \\192.168.1.51\NFS-Store2
Datastore3 = \\192.168.1.52\NFS-Store3

If you have more datastores it’ll just repeat: Datastore4 = \\192.168.1.50\NFS-Store4 and so on…

Since having multiple datastores and address to each, the 3 nics on the ESX box dedicated to IP Storage get utilized. It does not aggregate the bandwidth but it does use all three to send and recieve packets. So the fastest speed you will get is 1Gbit, theoretically, each way for traffic but, it is better than trying to cram all the traffic over 1 nic.

I also enabled Jumbo Frames on the vSwitch as well as the vmNic for IP-Stroage. (need the best performance!)
I should mention that your NFS storage device should have EtherChannel setup on it as well. Otherwise, you’ll be on the same boat just on the other end of it.

Hope it helps!

Larry B.

I should mention that you should not use different addresses to access the same NFS share (datastore). It is not supported and may cause you issues.

vmware – hp procurve lacp / trunk

Cisco’s Etherchannel and HP’s LACP are very similar – probably why I assumed both are supported by vmware. But as per below – it is not the case.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004048

From a procurve perspective the differences between a “trunk” and “static lacp” trunk is the total bandwidth per connection. A typical end point on a “trunk” trunk can transmit various connections down both pipes, but only receive down one – sometimes refered to as TLB (transmit load balancing). In vmwares case when you are using a “trunk” trunk you will have ip hash set as a load balancer, effectively meaning a different nic will be used on the vmware side for each connection a virtual machine makes.

This most probably explains why we hit a 1GBit limit per connection over vmware since a switch “trunk” can only receive back down a single interface. Where as LACP the interface is the team itself (i.e. it is considered as a larger single interface) – and can load balance. Matches up with what i’ve seen on the live switch statistics.

HP’s LACP – should be used where possible – between switches and servers that support it. LACP is a protocol that is wrapped around packets (why both ends need to support it).

“trunk” trunks  – have to be used with vmware at the moment – limits each connections bandwidth to a single interface. (i.e. you will never get more than 1gbit per connection if your nics are all 1gbit).

Linux – install vmware tools onto guest

fire up the vm, then run the following after initiating a vmware tools install…

mount /dev/cdrom /mnt/cdrom
cd /tmp
tar zxf /mnt/cdrom/VMwareTools-x.x.x.gz
cd vmware-tools-disstrib
./vmware-install.pl

Then just follow the prompts through to the end.

If your running fedora or similar make sure your’ve got gcc and kernel headers…. (you’ll probably have to update kernel too)

yum update
shutdown -r now
yum install -y gcc make kernel-devel perl

Ubuntu 12.x

apt-get install open-vm-tools

some notes from fedora 13…

Did you also copy the missing/misplaced include file?

(Having just updated the kernel I am getting the original messages, so have copied them below as I workaround the problem)

= = = First I get:

What is the location of the directory of C header files that match your running
kernel? [/usr/src/linux/include] /usr/src/kernels/2.6.33.5-112.fc13.x86_64/include

The directory of kernel headers (version @@VMWARE@@ UTS_RELEASE) does not match
your running kernel (version 2.6.33.5-112.fc13.x86_64). Even if the module
were to compile successfully, it would not load into the running kernel.

= = = Then over in another session at
/usr/src/kernels/2.6.33.5-112.fc13.x86_64/include

[Tom@tlsf13a include]$ find . -iname ‘*relea*’
./config/kernel.release
./generated/utsrelease.h
[Tom@tlsf13a include]$ sudo cp -p generated/utsrelease.h linux/

= = = Then back in first session:

What is the location of the directory of C header files that match your running
kernel? [/usr/src/linux/include] /usr/src/kernels/2.6.33.5-112.fc13.x86_64/include

Extracting the sources of the vmmemctl module.

= = = and the vmware-config-tools.pl runs ….
(well, all but vmci builds … :-/ )

vmware srm – replicating over a wan – optimizations

I’ve been working through a SRM setup and have been looking at ways to optimize the amount of traffic that is sent over the WAN. The first obvious move is to move your vmware swap files off the replicated LUNS.

Another way is to reduce the sync window. i.e. how often is yoru replication technology trying to keep the source and destination in sync? — Increasing this window can sometimes help you out. But that all depends on your delta’s.

For example – In the case of a windows page file on an active server (SQL etc) it could “potentially” change the whole file within an hour. If your replication was set to every hour and the page file was 4gb then you’d be sending at least 4gb every hour. Changing the sync on your replicated LUN to 8hrs instead would mean you’d only send the 4gb of “delta” (i.e. blocks that have changed since original snap)

Problem is that you would typically want to sync your virtual machines on a more frequent schedule than 8hrs. So this is where you need to move your windows page files onto a separate LUN (also replicated), but on a larger sync window (perhaps only once if your servers are static).

monitoring changed blocks (CBT) for replication of vmware virtual machines

Check this link for a great script to monitor changes made to a virtual machine via the vmware CBT API. This is perfect for finding culprit machines that are generating a lot of replication traffic if you are replicating over a WAN.

http://www.vmguru.com/index.php/articles-mainmenu-62/scripting/105-using-powershell-to-track-block-change-sizes-over-time

If you have some disks on a virtual machine you don’t want this script to capture then just set them as independent disks (so no snapshots can take place). This is handy if you have your windows page file on a separate disk that you don’t want to be measured as a part of the CBT changes.

CBT_Tracker