what exactly does "Networking limits" mean
I was re-reading Nexenta documentation and saw the page at http://nexenta.com/corp/downloads. When comparing the three versions of Nexenta under the Community edition is lists "Networking limits" under the Other restrictions row...
So what exactly does that mean? Can someone at Nexenta please elaborate (Linda Kately, Andrew Galloway, etc). and consider updating Nexenta's documentation to reflect this limitation?
I have been looking again at my VMWare based Nexenta rig and have noticed that esxtop shows that about 50% of the packets to/from the Nexenta VM are being dropped (this happens when I am using the storage as iSCSI or NFS). It looks like an artificial limitation set at the 1 Gbit/s but that might just be a coincidence. I would be really peeved if this is the "limitation" and it wasn't ever mentioned as a possible cause in all the forum posts out there talking about performance under ESXi.
I added that because i can't get confirmation on networking limits. I have read in some of our documentation that community is limited to 1gbe interfaces but i can't get confirmation.
Are you seeing 1gbe limitation on community? If it is so.. i will write up the rfe to have it removed.
I just got confirmation that the limit no longer exists. I will take it off the sites.
Yes, I have tested extensively and have (almost) never been able to exceed 1gbit/s speeds on my 3.1.2 community edition, when accessing my pool through NFS or iscsi. I have experimented with mutipath scsi IO, etc and always seem to be hitting my head on a glass ceiling. Everything points to an artificial networking limit in Nexenta or excessively high latency that I can't explain.
However, when SSH to the Nexenta OS and perform similar (but simpler) tests natively I have been able to get better performance. For example, I can execute 'dd if=/dev/zero of=/volumes/tank/VirtualSAN/testfile bs=8192 conv=fdatasync' I get MUCH better sequential read/write results (about 2.5x better). It is hard to measure random IO natively in Nexenta's OS.
are you using 10gbe?
No, I am not using 10 Gbe.
This Nexenta build is a Virtual Machine, hosted on a ESXi server. Technically the NIC speed is whatever VMWare/Nexenta is able to handle within the limits of RAM/CPU/Nexenta e1000 driver quality.
This Virtual Nexenta takes my server's hard disks and combines them into a pool of 4 mirrored zvols +cache +zil and then exports that same storage pool back to the SAME VMWare ESXi host for use as iSCSI/NFS. I do this to give my server reliable fault tolerant storage without replacing the raid card/HBA.
VMWare's esxtop utility shows the vswitch and vNIC associated with the Nexenta VM as dropping about half the packets. I also see excessive latency of any VM on that Datastore (using VMWare ESXi 5.1's Performance monitoring tabs), and also when using a VM running IOMeter 2006.
Something in Nexenta or VMWare is introducing excessive packet loss and latency. Keep in mind that this is ALL virtual and never flows through an actual physical NIC.
Unrelated side note: I noticed that logging in to Nexenta's OS and writing to the volume and share as I demonstrated above using dd seems to cause Nexenta's NFS server to malfunction... VMWare stopped recognizing the NFS share and all VMs on that Datastore disappeared; this problem persisted even after I restarted the NFS Server in the Nexenta Web UI. I had to reboot the Nexenta VM entirely to have everything responsive again (I did not have to touch VMware at all, so that points to Nexenta in my mind).
You might want to try the new vsa.. It is tuned for vmware.
It is in beta now, but is basically the same code as will launch
I would be happy to try it out, but I'm unsure of how that will solve some of the issues I've been seeing.
I believe the appropriate link is http://info.nexenta.com/VSABetaForm_Lpg.html (the one you sent implies it was already installed on a system)
Hi, I've just noticed the same problem when viewing the networking performance using ESXtop.
The Virtual Machine Guest instance of NexentaStor shows a continuous 48% (+/-3%) %DRPRX on its virtual NIC.
I've looked this up on lots of forums and the standard advice seems to be that this is a known side effect of the E1000 network drivers and the rule of thumb has been to change to the VMXNET3 driver, which as I understand is incompatible with the present version of NS.
This is fairly easy to show, simply by initiating a large copy to/from the NS and appears regardless of using CIFS or NFS.
I feel like this surely has to be a false-negative, as if I were actually losing ~50% of my incoming packets, things would feel much slower.
Other guest VMs using the same Vswitch bound to the same physical NIC do not exhibit this behavior.
ESXTOP.jpg - SYCORAX is the NS instance and ITHAQUA is the Windows VM (132.2 KB)
yeah i find that hard to believe too. that high of dropped packets would screw tcp performance into the ground. constant slow start etc...
Dan Swartzendruber wrote:
yeah i find that hard to believe too. that high of dropped packets would screw tcp performance into the ground. constant slow start etc...
Yes the %50 metric seems very high, but there is most definitely an impact (exactly how much is debatable)... The fact that VMWare ESXi THINKS the Nexenta VM is under heavy cpu/network load affects its ability to schedule resource use appropriately and in turn make the most of the underlying hardware.
Simple tests using dd like what I have shown above demonstrate that the Nexenta VM can push a LOT of data to/from disk natively, but as soon as you try to use that storage over iSCSI/NFS the performance is (roughly) cut in half. I agree that vmware-reported packet loss of %50 does not necessarily mean that performance will be reduced by %50 as well. I have never seen data points correlate perfectly to give us a perfect smoking gun, but the numbers are close enough to me that the network performance should be placed under a microscope.
Adam: have you performed similar dd tests to read/write from your zfs pools natively from the Nexenta OS and then via a NFS share? I believe a "false negative" would show that the exact same dd command in both scenarios would be nearly equal.
Thanks for the ideas, folks...
Here's a ~5GB transfer via NFS though completely locally. So I would assume the data still has to traverse the network stack, which in leau of a proper VMXNET3 driver, means that we're going to be limited to wire speed, understadably we're not going to get the 125MB/s theoretical max, but somewhere between 75 - 90MB/s would be a conservatively nice target.
root@sycorax:/export/home/admin# mount 192.168.0.20:/volumes/Data/test /mnt/nas01 root@sycorax:/mnt/nas01# dd if=/dev/zero of=sometestfile bs=1M count=5000 5000+0 records in 5000+0 records out 5242880000 bytes (5.2 GB) copied, 223.76* seconds, 24.5 MB/s*
*These numbers were averaged over four consecutive runs Not so good, eh?
So let's do it again, this time keeping it all inside the file system but within the same zpool. Admittedly, we should see a lot better performance, since it's not traversing the network stack...
root@sycorax:/mnt/nas01# dd if=/dev/zero of=/volumes/Data/test/sometestfile bs=1M count=5000 5000+0 records in 5000+0 records out 5242880000 bytes (5.2 GB) copied, 11.7579 seconds, 446 MB/s
So I guess that's pretty definitive, we're taking a massive hit on network performance for unknown reasons.
Any insight is appreciated! AJM
Sorry to hear you're having problems, but i'm glad to hear i'm not the only one observing these bottlenecks. I agree with your assessment (and what you posted in that other thread). Misery loves company I guess.
RE: what exactly does "Networking limits" mean - Added by Linda Kateley 11 months ago
can you retry with a block size that would match what zfs will do? like 8k, 16k.. up to 128k.
one tweak that usually help is setting up jumbo frames.
if you are running esx 5, using vaai has been seen to improve perf. for zfs zil devices increase perf, as all nfs traffic is sync.
Linda, the nfs sync thing is for nfs writes initiated by esxi (on behalf of virtualized guest disk controllers.) If a guest is doing a loopback NFS mount, that should not be a factor (unless he mounted sync?) 24MB/sec is abysmal.
For the record, on my Dell Poweredge 2900, ESXi 5.0u1 hosted Nexenta 3.1.3 CE VM from the OS's shell
dd if=/dev/zero of=/volumes/tank/myNFSShare/file bs=8k count=1000000 is able to write about 220MB/s
If I mount that same share (i.e. loopback) on the Nexenta shell, over the vmxnet3 interface
mount myserverip:/volumes/tank/myNFSShare /foo
dd if=/dev/zero of=/foo/file2 bs=8k count=1000000 is able to write about 145MB/s
So clearly the Nexenta / ESXi network is a bottleneck... but just how much can i isolate to be "Nexenta's fault"?
I came up with another test that you might want to perform on your own setups. The idea is to test how much i can push over a vmxnet3 interface and remove bottlenecks wherever I can. So I created two Ubuntu Server x64 10.04.4 VMs with vmxnet3 interfaces.
As a "control" here is the first command on host1:
dd if=/dev/zero of=/dev/null bs=8k count=1000000 which resulted in about 2GB/s (because the source and destination are RAM only, there are no disk bottlenecks)
Then to test vmxnet3 I executed these two commands, essentially DDing RAM from one VM over a TCP pipe to another VM:
dd if=/dev/zero bs=8k count=1000000 | nc host2sip 1234
nc -kl 1234 > /dev/null
I was able to push about 175 MB/s between the two vmxnet3 enabled VMs. That's much more of a bottleneck than I expected! If you compare raw RAM (best case speed) of 175 MB/s and the Nexenta NFS writing to disk speed of 145 MB/s it doesn't look too bad. And so Nexenta + NFS are imposing a ~30MB/s penalty which I don't think is too terrible.
I was thinking maybe my old 2900 isn't cutting it anymore, and performed the same ubuntu/vmxnet3 VM to VM transfer described above on a badass Dell r710 and was able to get about 330MB/s from one ubuntu's RAM to the other. I haven't tested any Nexenta builds on this server, but think it is safe to say it would be faster.
I'm not completely satisfied yet, but i think the main constraint in all cases is the TCP based transport, which is unfortunately used for all VMWare storage except DAS and FC. I might try a better network tester like iperf to see the true vmxnet3 bottleneck, but that is kind of a moot point IMO since we would still be held to NFS or iSCSI performance constraints.
I'd love to hear from anyone who can get better than 330MB/s between two VMs using a simple sequential block transfer like dd (or iometer) over TCP.
I'd also like to hear some of the sequential read/write performance numbers for people who are running a physical nexenta server over 10 GBe. Can you push better than 330MB/s over NFS or iSCSI?
Matt, for inter-VM communications, have you tried enabling the Unrestricted setting for the two VMs in the Hardware => VMCI device entry? Be curious how much (if any) that might help?
No I hadn't enabled VMCI, I will definitely try that. So many variables to try and keep track of !
Just tried enabling VMCI on both ubuntu vms, rebooted and tried again with no performance improvement. It is almost identical (175MB/s).
Soft reboot? I know some ESXi settings don't take unless you power the VM off and back on. If that doesn't help, you may have just run into a limitation? Although I have gotten anywhere between 3-5gb/sec raw tcp BW between two VMs with VMCI on.
hard reboot. VMWare wouldn't even let me change that setting without shutting down 'cold' first.
I'm planning on posting something similar to VMware's communities. I am really surprised at this bottleneck that i am experiencing.... i almost hope i messed up somewhere!
I wonder if this is some nc lossage? I just tried both ways between a ubuntu 10 and ubuntu 12 VM. I can consistently get about 5gb/sec with e1000 VNIC. Or I guess it could be vmxnet3 lossage?
I'm not sure what you mean by "nc lossage"? Can you share how you tested your setup to achieve that kind of performance?
BTW, this morning I tried iperf on the faster r710 server between two vmxnet3 enabled ubuntu 10.04.4 VMs. As expected the TCP performance was MUCH better than my simple dd over netcat test... iperf showed 16 Gbit/s over TCP between the VMs, which is pretty awesome. I am fairly certain that iperf is multithreaded/multiprocess, and so it is no surprise that it would outperform a single threaded netcat transport.
I found a good article that highlights what I have seen hinted at on forums and in documentation a few times:
In short, to maximize single threaded NFS v3 performance (even on a 10 GBe link) you will have to set up multiple NFS shares over multiple IPs and distribute your VMS across different NFS datastores in order to get a poor man's load balancing (similar to MPIO on iSCSI). This allows you to maximize your network pipe and push your ZFS volumes to the max.
Oh and by the way, VMWare's documentation states that VMCI is only useful if you program your network enabled appliacations to use it instead of the OS's native TCP/IP stack. I am not sure if using VMXNet3 inherently uses VMCI, but my hunch says no (otherwise VMWare documentation would just say to use VMXNet3)
By 'nc lossage', I meant some kind of suboptimal implementation. Can't say I've ever used it, so I have no idea :( Interesting article you linked. There is one possible way to get some load balancing in a less complicated way. Scott Lowe's 5.0 book says that ESXi will load balance with a single NFS datastore if you use multiple DNS names to reference the multiple IP addresses. Don't remember the specifics, and my book is at home right now. I didn't know iperf is multithreaded. I'm not sure I understand why that would affect a bulk throughput test, though - usually it helps workloads that are IOP-sensitive, no? A single sending thread and a single receiving thread should be able to max out the 'wire' if you have a large enough send window, I'd think? I'm going to retry my iperf test with ubuntu 10/12 only after adding a vmxnet3 VNIC to each one...
I don't have Scott Lowe's book, but I have a recent blog post that says 1 NFS datastore accessed via multiple DNS-IP paths will not result in better performance. To get the performance, you have to use multiple datastores the way i suggested above.
So the dns round robin trick might be used to give more path resiliency, but it looks like no additional performance.
I am curious if Scott's book conflicts with the article here:
Holy smokes! I removed the e1000 VNICs on both machines, replacing with vmxnet3. Re-ran my iperf test both ways. Getting just about 23gb/sec!
Interesting. I'll have to re-read that part of his book. Maybe I am misremembering. Or maybe he is wrong, but he seems to have a lot of cred.
One thing I am curious about. The first article you linked is specifically for ESXi 4.x, whereas I am referencing the book on 5.0 - possible they changed things there. I will check when I get home...
I don't know how good the CPU is on Matt's r710. My host is a x9scl-f motherboard with E31230 xeon CPU (quad core 3.2ghz) [in other words, a whitebox :)]. I don't know if it helps or not, but I have vmci unrestricted on both VMs.
I made VMCI unrestricted on the Dell r710 and I am not sure if it made a difference. After running iperf over and over, my speed ranged from 17 gbit/s to 21 gbit/s. The dell r710 has 2 x Intel X5680 CPU (6 core x 3.33 Ghz) with 128 GB RAM... these boxes are under active use running about ~20 powered on VMs so there is bound to be some performance impact.
I gave the iperf/ubuntu VMs more vCPUs and it didn't seem to make a difference in improving the speed... but 21 Gbit/s for a single VM with a single vCPU is pretty good :)
Back to the original topic though, i think that my 3 generation old Dell 2900 2 x E5420 (4 core x 2.5 Ghz) CPUs just can't push network traffic that fast. Intel has made huge advancements in their CPUs regarding virtualization in the last few years so this isn't a huge surprise.
Dan - Please let me know if a review of your Scott Lowe book provides new insight. I agree, he is one of my top 10 preferred VMWare bloggers.
Yeah, I'm guessing when you are pumping that much data thru a virtual wire, it's all about CPU cycles. I will update you when I re-read that chapter...
I read that section at the bottom of the page in question. Page 330 "Mastering VMware vSphere 5" by Scott Lowe. He explicitly states that using a DNS name with multiple IPs in the NFS config for the share will load balance, using both links and increasing throughput. I infer this is new in 5.0. I don't currently have a spare NIC to test this - I will see if I can scare one up.
The latter part of this URL confirms that nfs performance will improve in esxi 5.0 if you use round robin DNS.
He is a senior storage exec at EMC which owns VMWare so I believe him:)
So that's good for me at home where I run free esxi 5 but bad at work where I run esxi 4.1.
To speed up VMWare I always need to do two things:
- Make sure the MPIO policy is roundrobin.
- This can be done from GUI. Select storage device and view properties then Manage paths and set to round robin
- Change the switch io-path parameter from 1000 to 1. From CLI:
- esxcli storage nmp psp roundrobin deviceconfig set --type=iops --iops=1 --device=naa.xxxxxxx To view the devices available you may want to do: -esxcli storage nmp device list
I still do see some inconsistent behaviour (latency lags I can't pinpoint). Writing from VM to a 3 mirror sets pool gives me up to approx 180 MB/s which almast matches local speed. Reads however are doing up to 380MB/s (i have 4 x 1 Gb available) when cachd in RAM, But when not cached in ram it drops to approx 160 MB/s while local speeds are approx 500 MB/s. These are about sequential read measured with iozone. I would have suspected better reads, but this may because zfs does COW and might have cause fragmentation on my data during these benchmarks. (16 threads simultaneously) resulting in more random reads during the seq. read cycle.
Be carefull with the compression setting on iSCSI zvols as iozone benchmark data gets reduced enormousle by compression (easy to compress data). The measures above are with disabled compression.