iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5
Added by Ashley Watson about 1 year ago
Hi,
We have presented multiple LUNs to our ESX5i farm via NexentaStor 3.1.1. I was in the process of storage vmotioning around 40 VMs onto the NexentaStor box as part of a test. I believe Vsphere was running 8 in parallel and queing up the rest (as expected). After around 30 minutes, I noticed the svmotions starting to hang (about 6 VMs had completed). Eventually after more than 30 minutes I was able to cancel all queued and inprogress svmotions. After then the iscsi target on the NexentaStor box was unresponsive (in terms of iscsi - everything else worked fine including the webui). It was only when I was able to reboot the NexentaStor box, that the iSCSI connections to the box from the farm started to operate properly again - but the storage drop out obviously causes multiple issues with our VMs). I see this interesting support note from VMware for Vsphere5i in relation to svmotion failures and VAAI - could this be the same issue under NexentaStor? http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1027808
On a related note, when things work, we are seeing some extremely impressive IO out of a SuperMicro chassis with 15x6gbs 2tb SATA drives+1SSD+24gb RAM (1GbE and 10GbE interfaces) with USAS2-L8I HBA cards (with a couple of $2 UIO->standard brackets ;-)).
I see that Nexenta was powering VMworld labs but this seemed to be presenting storage as NFS rather than iSCSI - any real world experiences with vSphere5 and 3.1.1 with iscsi and VAAI and high loads?
Replies
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
We again saw a storage hang - this time it was a single storage vmotion co-inciding with a VM deletion... it looks suspiciously like a locking issue. caused by VAAI. Has VAAI support on NexentaStor 3.1.1 been fully reliability tested with Vsphere5/iscsi? Specifically the HardwareAcceleratedLocking feature? Does Nexenta recommend disabling VAAI on each ESXi host as per this URL?
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
We disabled VAAI on all our hosts as a precaution. After around 12 hours the iscsi target dropped out randomly to random hosts under high loads and this caused chaos with our vSphere5 cluster. I have spent the last 12 hours recovering from the failure. From all the tests I have run, I can only conclude that the combination of iscsi/VSphere5 under NexentaStor 3.1.1 is just not ready for prime time. I don't believe the issues we are having are specific to Vsphere5 as our other iscsi SANS (HP MSA 2000G1s/G2s) are running perfectly. The annoying thing to us is that when Nexenta works, it works really well - but when storage drops out it's a real disaster. We were also considering a shift to 8gb/s FC infrastructure but as we understand it the FC plugins are built on top of the Comstar stack - so there is a high chance I guess we'd get the same drop outs via FC. One option we have is to junk iSCSI in favour of NFS - does 3.1.1. fully support VAAI over NFS? Why was storage presented to VMworld labs with NFS rather than iscsi? Are most VMware vsphere5 shops running Nexenta 3.1.1 using NFS until iscsi reliability can be proven?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by FREDY . about 1 year ago
Oh Dear,
I am in process of upgrade my ESX hosts to version 5 and then migrate all VMs to Nexenta Enteprise Edition box running via the Fibre Channel plugin. These things I am reading from you don't get me excited at all.
Yes Nexenta was powering VMworld Labs and they had 10Gb + NFS. NFS in my opinion is greatest above any storage type (FC or iSCSI) and I would use if I had 10Gb network. The only disvantage is that currently doesn't support VAAI.
Ashley, is this version 3.1.1 that you are using installed from scratch from ISO or have you upgraded from 3.0.5 ?
It would be good if someone could look into this closely as the situation you are describing seems to be a rather a stress situation (svMotion many VMs at the same time) and maybe not common to find even by QA.
Thanks
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
hey Fredy, this was a fresh install of 3.1.0 CE and then an upgrade of the OS to 3.1.1. Our farm consists of 6 UCS blades (12 cores, 96gb ram each) + an HP DL360G7 (12 cores,72gb ram - used as our management/backup host). Our Nexentabox is a 24gb ram, 4 core, 15x2tb spindles+1x128gb SSD L2ARC cache using 2xUSAS2-L8I in a SuperMicro 16 bay chassis with a quad 1GbE+a dual 10GbE. The NexentaStor iscsi service again crashed this morning under heavy load. When it drops out, any vSphere host running any VMs with their storage off the Nexenta box becomes unmanageable until the NexentaStor box is rebooted. I saw similar behaviour when we were presenting some ESX4i hosts storage via Openfiler more than a year ago when that also used to drop out under heavy concurrent load (due to issues with SCSI reservations in that case).
NFS sounds great but how do you multipath under VMware vsphere5 other than connecting to the same NFS share on different IP addresses - or do you create a bond of 2 interfaces together?
Going forwards our UCS environment will only be able to run with FC attached storage (not our call) - which means we are going to be very nervous about using Nexenta as we could suffer from these same types of storage drop outs. It's looking likely our future primary storage will be presented from a group of FC presented EVA4400s instead of what I hoped would be a Nexenta cluster.
The other issue that is pointed out is that on a normal hardware SAN with dual controllers, it's always possible to upgrade/restart individual controllers so the storage hopefully never goes down, whereas the only way to achieve this under Nexenta is via an HA configuration which adds complexity and cost to a base configuration. I would imagine that in most VMware farms, dual contorller and/or HA is a basic requirement.
cheers
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by FREDY . about 1 year ago
Hi Ashley,
I have another post where my CIFS crashed twice in a month because a single disk failed. I am double checking if could be a hardware issue and would ask you the if you consider that possibility as well. In my case it doesn't seem to be otherwise the whole system would stop and I can still ping and SSH into the box, although CIFS, NMC, NMS, etc has crashed and only with a reboot is possible to bring them back.
With regards NFS and multipah my understanding is that at your SAN side you will have either aggregated interfaces (in the case you have the network connections going to the same or to stackable switches), if not then you have to use IPMP which is the same as bond 1 on Linux (Active/Passive). With 10Ge you don't really need to use both links as active as you have more than enough bandwidth, but if you have aggregataed interfaces and multiple IPs you will end up using. One thing I would point is to watch you processor usage. Sometimes NFS could bit a bit more CPU consuming than iSCSI/FC so maybe consider add a second 4 core processor. The advantage start with its simplicity and less number of Datastores to manage and the performance the same if not slightly better. One more thing to point is the number of spindle disks you have (15) to the number of ESXi servers (6). Have you checked if you they are not running out of IOPS under heavy load. I see you don't have a ZIL device. I am wondering if under heavy load the system isn't timing out somewhere that might be happening similar thing that happened with mine when a disk failed recently. Although on your case disks have not failed but because of some timeout the system might get confused because of the latency. Consider remove one 2TB disk and add a 32GB SLC for ZIL, you will get a lot on write performance.
On my environment I have two head nodes and HA. Yes it's the way to do upgrades. Although it works most of the time I have to say that I am not 100% happy with its maturity at the moment. I had to do several manual changes for tuning which I don't feel comfortable at all and would much rather they came out of the box, so in my opinion the HA plug in has to be worked out better yet.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
hi Fredy, I'm 100% sure this is not a hardware issue. When the iscsi traffic dropped out, I was still able to access CIFS/NMC/NMS/SSH etc into the box - there were no errors showing anywhere.
I could understand if the initial crash was caused by I/O overloading, but the second time the iscsi services dropped, there was no significant load. We currently have 3xHP MSA iscsi devices and while not the fastest in the world, we never have these types of drop outs under similar load patterns. The MSA devices are carved up into vdisks of 10 or 12 spindles (typically in raid10). The tests I was running was taking a 4TB slice (consisting of around 30 VMs) from one of the MSAs and storage vmotioning it onto the Nexenta box. IOs were rock solid on the MSAs - iscsi dropped out on Nexenta even after the initial storage vmotions.
While it would be great to add a ZIL device to the mix, I'm still really concerned about the maturity and stability of block presented storage via iSCSI/FC - particularly with VSphere5.
Another issue I'm dealing with is that when we reboot an ESX5i host (both DL360G7 and Cisco UCS blades), I see exceptionally long ESXi boot times (around 40 minutes). As soon as I remove the iscsi presentation to the Nexenta host, the boot time returns to normal - around 5 minutes. When I raised this with VMware as part of our VMware support agreement, their response was that as the Nexenta box is not on the HCL they are unable to support us, which unfortunately makes Nexenta a non starter for anything other than a backup target. IMHO, Nexenta need to take the VMware HCL very seriously as it's a show stopper for their target audience - but can understand that this is difficult with open storage. As a minimum, Nexenta should certify a generic SuperMicro chassis + common drives + common chipset controller like SASLSI2008 on GbE/10GbE for iSCSI and NFS.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by FREDY . about 1 year ago
Hi Ashley,
I share the same concerns of you. I haven't' tested heavily the iSCSI/FC yet but trying to. Given I have had the same crash using CIFS I am concerned that can affect iSCSI/FC as well. Unless someone can give a clue about the doubt will remain.
About the VMware support that's a bit of a problem. Although some of the people there might know Nexenta and that isn't on the HCL depending on the person you get to support you if he doesn't have much a clue of the problem he will try to push the problem back where he can before spending sometime to find out if the issue is actually the storage of vSphere 5. In the other hand I somehow understand Nexenta's decision to not spend a lot of money on certifying many solutions. Perhaps if that money is really invested on R&D that can result and much better features and quicker bug fixes and QA. At the moment it seems that still isn't happening. Maybe Nexenta along with other Software vendors could push VMware to adjust the compatibility lists(having something similar to a HCL for software solutions) in a way that VMware could not straight way turn that back to the customer just based on the fact he is using Nexenta, depending on the issue of course. Agree with you that for many people that will always be a concern and a stopper.
I would also say that Nexenta could spend a bit more time with QA and producing up-to-date documentation about integration, recommendations and tunning with the various hypervisors. The last document from Nexenta is about VMware 3.5, quiet old. That could help avoid many problems and therefore customers to go to VMware support and be turned back.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by David Bond about 1 year ago
Ashley Watson wrote:
hi Fredy, I'm 100% sure this is not a hardware issue. When the iscsi traffic dropped out, I was still able to access CIFS/NMC/NMS/SSH etc into the box - there were no errors showing anywhere.
So do you have, iSCSI, CIFS, NFS and management in the same interface? What nic do you use, an Intel x520?
We are currently using esxi4.1 and red hat servers with 3.04 and now 3.1.1, not fully tested 3.1.1 yet. So far no iSCSI drop outs. What do you consider high load?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by FREDY . about 1 year ago
No, there is one box that does only CIFS and had a single disk failed a while a go which seems to have stopped the whole system, even knowing the disks are configured as RAIZ2 (that happened twice). It suppose to handle the failure smoothly but only gets fixed under a reboot. Then there is another box that does FC/iSCSI (for VMware stuff (ESXi 4.1) ) which I am going to test, stress and see if any similar happens to it, even when removing a disk intentionally.
I wouldn't say that the high CPU load would cause that, but something to do the way the system handles the disks, timeouts and failures. Under heavy IO that might be getting confused.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by David Bond about 1 year ago
I see that you are using an 8 port sas card with 16 SATA drives, so they are on an expander, what SATA drives are they?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
- When the iscsi services crashed, CPU usage on the box was around 30% (according to ther web ui). All other services carried on running. The CIFS traffic was not impacted. I could still ping all the NICs (on all interfaces) at the time of crash.
- I'm using 2x8 port cards (2xUSAS2-L8I - LSISAS2008 chipset) and each of the 16 disks (15xHDD+1xSSD) disks is attached directly to a separate port. I spec'd this out like this to avoid the use of SAS expanders to supposedly minimise the issues I'd have (silly me!)
- I'm using Seagate Barracuda XT 2TB drives - they are 6Gb/s. They were at the sweet spot in terms of price/performance (ST32000641AS). The firmware is marked as CC13, the date of manufacture is marked as Aug 2011.
- I have 10GbE on one pair of 10Gbe NICs - this is used for CIFS (AOC-STG-i2 - 82598EB chipset)
- I have 4x1GbE ports (2 nics onboard for front end traffic+management WebUI), 2 nics via an HP NC360T (Intel 82571EB chipset) - for iSCSI.
- Perhaps it was an I/O stall of some sort, but none of our iscsi SANs behave in the same way under load.
- In terms of high loads - I can only say that the problem occurred atleast 3 times - the intial one under simultaneous svmotions, the second crash seemed to be LUN locking contention issues (I guess) while delivering low load iscsi traffic to around 20 VMs.
- We are busy attempting to set up a lab environment (the hardware is similar but different raid controllers/disks etc etc) so we can attempt to reproduce this issue without casuing any downtime.
As always any advice is appreciated!
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
Just in case it helps anyone else. We set up an a 2nd box with NexentaStor on and after updating I saw this; NMS Version 3.1.1 (r9415) NMC Version 3.1.1 (r9415) NMV Version 3.1.0 (r9371) OS Version 3.1.1
However on the original box which had the issues, it was only showing the OS Version to be 3.1.1- the other components were showing as 3.1.0- perhaps that 3.1.1 CE hadn't been pushed to the repositiories at the same time as the other compoents. I've updated our orignal box now so the builds match above.
We are going to run some more tests to see where things go from here on in.
thanks for all your help so far guys - we appreciate it.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by David Bond about 1 year ago
Has it been behaving better the new install so far?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
We've been running some tests on a test box, with some very interesting results. the test box is running 3.1.1.
We present say a 2TB LUN to our ESX lab (2xESX5i hosts) and another 2TB LUN to HyperV. The ESXi LUN was formatted as a 1MB block size in VMFS5 format (new native VMFS5 standard). Both Zvols were created on the same ZFS volume. On our test box we don't have an SSD l2arc and we don't have a ZIL device.
We set multiple SVMotions running moving VMs (thin provisioned) from an MSA 2000 G1 to the NexentaBox. We check vCentre and see that 8 SVmotions are running. All goes well, until we attempt to cancel one of the SVmotions. This then causes some sort of blocking issue which blocks all other traffic to that specific LUN effectively hanging the other svmotions and anything else running on that LUN. This causes LUN timeouts and effectively "breaks" the storage tot he ESXi cluster. The other LUN we had presented to the HyperV (on the same storage pool) continued to run fine throughout this....
We can reproduce this issue with VAAI both disabled and enabled on the ESXi hosts.
When we had the prodcution storage fail in a similar way (see top of thread), it was when we tried to delete a VM on the same LUN as a LUN having multiple storage vmotions being pushed to it - again pointing to some form of blocking/deadlock issue.
so this tells me;
- the disks themselves are not the cause. %Busy state never went past 80% on the disks themselves.
- because the HyperV server continued to run from the same Nexenta instance (different LUN), this tells me that the iscsi service under Nexenta didn't technically drop out - atleast for the other LUNs anyway... this eliminates any hardware related issues form the mix.
- the fact everything was running fine until an svmotion was cancelled indicates this is definitely some sort of blocking issue.
- These tests were performed under ESXi build 469512, Build 474610 (released yesterday) addressed some VAAI related issues; http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2001075
We'll run some further tests over the next few days (like trying this with VMFS4 filesystem with 8MB block size with ESX5i hosts etc), but my questions in the meantime would be....
- How do I pursuade VMware/NexentaStor to attempt to reproduce this issue - even if this turns up bugs in VMware, I'm unable to log tickets with VMware due to NexentaStor HCL status?
- Is there anyone out there with access to atleast one ESX5i host and one NexentaStorCE box (GbE) + one other iscsi SAN whjo could attempt a similar test?
cheers Ashley
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Matt Mabis about 1 year ago
Hey Ashley,
I wanted to confirm as well i have also experienced the iSCSI drop issue with Nexenta 3.1.1 altho i experienced it also with OpenIndianna in my tests. These issues always seem to appear when i start heavy stress testing the box. the DAVG on the Hosts will significantly climb even though if i go into the Storage Server usage isnt coorilating to the issue. with these high DAVG (Latencies from Disk) eventually under extreme stress like sVmotions or provisioning (using VMware View) seems to trigger it for me the luns will all go APD on the hosts and the only way i have ever been able to recover is by restarting the Storage Server, after it comes back up the hosts recover i have tried disabling/enabling comstar and iscsi but it didnt seem to work only a reboot of the storage server ever seems to clear up the issue
My Storage Server hardware is (X5603 Processor + 8GB ECC DDR3) 4 Nics (2 iSCSI) (2 Main networks with different subnets) all piped through the same Switch on different vLans and Subnets.
I also believe that Solaris has issues with my PCI-Express OCZ Revo Card as i saw a lot of latency issues from it when i tried using it as a lun as well (now removed) (even though it worked fine with others OS' Openfiler was what i was using before)
i would have used the FC Stack which i was running on OF but sadly Comstar doesnt play well with the old 2GB Fibre Switch that i have..
To answer your question from the VMware Perspective though theres not much i think they will do on this issue as it needs to be presented with hardware on the HCL.
I can Confirm for you what you have been experiencing though it doesnt make any sense on why the iscsi channels seem to drop but they do and when they do the only way i have ever been able to recover is a Reboot of the Nexenta/Openindianna box... i have tried using Nexenta Core/Stor as well as OpenIndianna and all 3 provided the same results with Comstar.. i can tell you that Openfiler has issues with iscsi and what seems to be abort handling issues in the code for SCST
Hope this helps.. Matt
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
We have been able to reproduce the issue in the following way;
- From a single ESX5i host running build474610, we set 4 storage vmotions running from an MSA 2000 iscsi SAN through to a NexentaStor box with a 2TB LUN in VMFS5 1MB file system. Only 2 storage vmotions run at one time.
- We wait for around 5 minutes until both storage vmotions are well under way, then we try and cancel 1 of them.
- We see the NexentaStor LUN "hanging" and the storage vmotion stuck in the cancelling stage. Any VMs running in this same LUN hang. When we try and browse the VMFS5 datastore, the vCentreUI hangs. If we try to connect directly to the ESX5i host, it hangs at the "Loading Inventory" stage. We see the IOs hitting the NexentaStor box drop to 0.
- When we remove the LUN mapping via NexentaStor, then the storage vmotions cancel (expected behavior). We then present the LUN again and the VMs on the ESX5i host hich were hanging then start working again.
However, and here is the interesting thing....
- If we run the same tests but to a NexentaStor box except the 2TB LUN is formatted as VMFS3 8MB block size, we can reliably cancel a storage vmotion without hanging the host - however for around 15 seconds immediatly after the cancel, other VMs residing on the same NexentaStor LUN are seen to hang.
Can someone please try and reproduce this issue and advise a path forward - for now it looks like if we want to use NexentaStor and iscsi we should stick with VMFS3 with ESX5i and these type of quirks between NexentaStor/VMware are wider undertood.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Matt Mabis about 1 year ago
Hey Ashlea,
i would mention as well all of my LUNS are VMFS3 that experience this issue... running ESXi 5.0 Build 469512
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by David Bond about 1 year ago
Have you tried it with esxi4.1 ? Is the behaviour the same, the hanging for a few seconds after? I will try it also when back at work on Monday with 4.1.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by FREDY . about 1 year ago
Hi Ashley,
Thanks for sharing these results with us. I am currently in process of migrate my 4.1 hosts to 5.0 and the storage to Nexenta 3.1.1 so what you are describing is certainly something that is concerning me and I am sure other people too. Before doing that I built a Lab with 5.0 environment with iSCSI to try reproduce these things you are reporting. Will try that and report back soon to see if also happens with me. I have done a couple of Storage vMotions yesterday (putting a Datastore in maintenance mode for example) and things seems that went Ok, but will put more load on it and try more number of concurrent svMotions. Will try also to cancel a svMotion while it's running.
Just to clarify, you said you are moving stuff from the MSA to the Nexenta ? Have you tried to svMotion stuff inside the MSA from one LUN to another with ESXi 5.0 and VMFS5 and cancel it to find out if that also hangs ? Does this MSA support VAAI at all ? I am suspecting it could be something do to with ESXi. As you said in a previous post ESX build 474610 have addressed some VAAI issues. Although it says for people using EMC VMAX perhaps they are related if this is the problem as well.
I have ESXi4 hosts connected via Fibre Channel to Nexenta and will try also do the same tests in order to compare the results. However in this configuration there is no VAAI integration. If I can reproduce any of the issues that you described with either ESX4.1 or ESX5.0 I will report back to Nexenta support as this box is a Enterprise Edition so I don't think there is any difference with the Community version for what we are talking about.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
Hello. Sorry for not discovering this thread.
Some VAAI-on-3.1 things. First off, our VAAI is based on the T10 standards, so it will only work with T10-using VMware 5.0 or later.
Second off, if you can reproduce this hang, please, once it hangs, ssh in to the NS box, become root and utter these things:
dumpadm -y -s <a-directory-with-space>
Make sure dumpadm will create a kernel core dump where you have space, for example:
(0)# dumpadm Dump content: kernel pages Dump device: swap Savecore directory: /var/crash/everywhere Savecore enabled: yes Save compressed: on (0)#
My /var/crash/everywhere directory has lots of free space.
Then utter "reboot -d" (it's hanging, right?) and get the kernel core dump while it's hanging. Once I have that, I may be able to debug what's ailing you.
It was hard to get ESX5 working, as we had betas with some bugs. We may have missed something during testing, and if you have a reproducible problem, someone ought to look at it.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
thanks guys, I've been off for a couple of days - I'll be relooking at some of these things over the next couple of days and I'll run some more tests.
Dan, thanks for that. when I say hanging - the symptoms we see are the Vmotion cancel hanging and other VMs on that same LUN to the same ESX5i host hang. NexentaStor itself is not hanging. NexentaStor is still delivering iscsi traffic to other LUNs. Make sense? Will the kernel dump help in this situation?
I'm not sure if it's directly related, but when we attach a NexentaStor iscsi storage to an ESX5i host, the boot time of the ESX5i host increases dramatically (reproducable on multiple hosts, different vendors even nested ESX5i instances). I've been following a couple of threads here; http://communities.vmware.com/message/1830846#1830846 and http://communities.vmware.com/message/1830801#1830801 Interesting thing here is that the Nexenta target is being picked up as an VMWSATPALUA array - is this correct or should it be presenting itself as an VMWSATPDEFAULT_AA? Initially I thought the boot delays was a VMware issue of what appeared to be repeated scans of the CDROM device node, but it's looking like it's an iscsi target related issue.
I did notice some strange storage path behaviour in one of the paths to the storage being dead - perhaps that was related to the initial issue as well? I was only able to fix that issue by overriding the storage type with a VMWSATPDEFAULT_AA rule on the LUN.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
Ashley Watson wrote:
thanks guys, I've been off for a couple of days - I'll be relooking at some of these things over the next couple of days and I'll run some more tests.
Dan, thanks for that. when I say hanging - the symptoms we see are the Vmotion cancel hanging and other VMs on that same LUN to the same ESX5i host hang. NexentaStor itself is not hanging. NexentaStor is still delivering iscsi traffic to other LUNs. Make sense? Will the kernel dump help in this situation?
I'm not sure if it's directly related, but when we attach a NexentaStor iscsi storage to an ESX5i host, the boot time of the ESX5i host increases dramatically (reproducable on multiple hosts, different vendors even nested ESX5i instances). I've been following a couple of threads here; http://communities.vmware.com/message/1830846#1830846 and http://communities.vmware.com/message/1830801#1830801 Interesting thing here is that the Nexenta target is being picked up as an VMWSATPALUA array - is this correct or should it be presenting itself as an VMWSATPDEFAULT_AA? Initially I thought the boot delays was a VMware issue of what appeared to be repeated scans of the CDROM device node, but it's looking like it's an iscsi target related issue.
I did notice some strange storage path behaviour in one of the paths to the storage being dead - perhaps that was related to the initial issue as well? I was only able to fix that issue by overriding the storage type with a VMWSATPDEFAULT_AA rule on the LUN.
- I meant svmotion I wish we could edit posts!
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
But traffic to the one LUN is hanging, you say, and that means I'd very much like to see the LUN's state in a kernel core dump (especially compared with ones that continue to move traffic, as you observe). I'm not sure about what we're presenting to VMware as what kind of array. (I'm new to iSCSI, having inherited things, so I'm learning as I go along, pardon any slowness or other newbie-signs.) Your communities.vmware.com pointers suggest it might be a VMware bug, and not ours.
Thanks for your patience, and I hope I can help!
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Matt Mabis about 1 year ago
Hey all,
So far all the tests I have done indicates an issue with a locked LUN coming from the comstar side. when the issue occurs i see tons of aborts on the ESXi side, the only way to release the lun lock is to un-present the lun and then re-present it by removing the lun completely from comstar. this would also remove any stored locks on the lun specifically.
i would recommend checking out that side first as from the VMware side it looks like comstar is locking the lun and not releasing it... i attempted to unlock the lun using the vmkfstools command and it refused.. the only way to release the lock was to remove the lun and then re-add it or reboot the host... you cannot just stop the service to release the lock... is there a way with comstar to release the lock on a lun from the storage side?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
hi all, just had another hang. Slightly different set of triggers this time. We had 2 VMs running from the same LUN (VMFS5) presented via NexentaStor 3.1.1. Both VMs run on same ESX5i host. One VM (DBTest1) doing a database restore, the other VM (AppServerTest1) running lightweight tasks. Both VMs thin provisioned. While AppServerTest1 is running we add a 10GB thin provisioned disk via the vsphere client. the disk appears inside disk manager under the OS (win2008R2). The user then starts to format the disk (quick format - default). AppServerTest1 continues to run for around 15 seconds and then drops out and hangs. Management of the ESX5i host breaks. DBTest1 continues its DB restore.
Eventually after 30 minutes whatever was causing the lock seems to timeout, and I'm then able to manage the host again.
Dan, do you have enough info from this thread to be able to reproduce this type of locking issue in labs on your side or do you still need logs? I'd really like Nexenta to take a look at this (and to look at the slowness on boot, and the target portal group issue with multiple IPs through the webui)?
I have attached the vmkernel log - start at the end and work upwards and you'll see the issue.
vmkernel.log (997.6 KB)
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
Matt --> you may wish to try "stmfadm offline-lu
Ashley --> does the 30 minutes coincide with the database restore finishing on the DBTest1 vm? And while the vmkernel log is interesting, I'd also be interested in the output of "dmesg" on the NS box.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Jarret Lavallee about 1 year ago
Dan,
I have been working on this in my lab. I can replicate this with or without VAAI enabled. I have not tried on ESXi 4.x or 3.x, but I can reproduce this easily on 5.x. I am using NCP, it it may be a little different.
To answer your question about the online and offline, it does not work. The LUN seems to be locked in comstar and we cannot free the lock. On the ESXi 5.0 hosts we try to get a RESERVE (0x16) or any other operation, the host fails the operation with an ABORT (H:0x5). I then presented this LUN to a non ESXi host and found that we failed to do READS (0x28) from the linux iSCSI initiator.
I tried to free up the LUN with the following operations:
- LUN resets from multiple initiators (vmkfstools --lock lunreset /vmfs/devices/disks/naa....)
- stmfadm offline-lu/ online-lu
- stmfadm remove-view / stmfadm add-view
- Restarting the comstar service.
The only way I was able to clear the LUN was to remove it from comstar and then readd it and then remap it. stmfadm delete-lu stmfadm import-lu stmfadm add-view
After doing the procedure above, I was able to get the LUN back online and the datastore mounted and working in ESXi and on the linux initiator.
Let me know if there is anything you would like me to do and I will do it today or tomorrow.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
Interesting that you can replicate this without VAAI. Keep in mind that the VAAI in NS3.1 is derived from the T10 standards, so it won't work with ESXi 4.x or 3.x, just 5.x.
What'd be really interesting is whether or not you can replicate this with NexentaStor 3.0.x and no VAAI (since you can replicate this with no-VAAI on NS3.1). IF that's the case, then it's a longstanding issue predating my arrival (which while annoying and needing a fix, at least gives me relief that I didn't break anything ;). If 3.0 works, then something in 3.1 broke things, and a what-changed approach can proceed.
Did the ABORT give a reason-code to go along with it?
Thanks, Dan
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
the 30 minute hang seemed to be unrelated to when the DB restore completed. After the 30 minute time period, the VM that was hanging appeared in a powered off state. When we powered the VM back on, and then tried to repeat the format the operation completed quickly.
I can also re-confirm that I see the same types of issues on ESX5i connecting to 3.1 CE with both the VAAI enabled and VAAI disabled on the ESX5i host as per VMware instructions by setting these to 0 for disable, or 1 for enable; /DataMover/HardwareAcceleratedMove /DataMover/HardwareAcceleratedInit /VMFS3/HardwareAcceleratedLocking
(http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665)
I'm aware that under Nexenta, VAAI is not supported on ESXi 4.x or 3.x.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
Hi Dan, I'm just wondering if Nexenta has been abe to reproduce these issues internally and are working on a fix and what sort of timescales we might be looking at?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
Interestingly we just checked for updates on one of our NexentaStor CE boxes... Updates were found; nms-comstar 3.0.7-6495-r445 (and a couple of others). After updating, we now see the followng; NMS Version: 3.1.1 (r9415) NMC Version: 3.1.1-6488 (r9435) NMV Version: 3.1.1-6608 (r9456)
Can anyone possibly advise us what has changed in the comstar plugin and do any of the changes relate to any of the issues we have been seeing?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
We haven't been able to reproduce such hangs in our own testing as of yet.
Interesting, however, that you mention both 3.0.7 and 3.1.1, though. I will tell you that one known COMSTAR hang has to do with low-memory situations, and that 6495 (mentioned in your 3.0.7 patch) is the bugid that we used to fix that hang.
I'm not sure if you're on 3.0 or 3.1 now, however. I will tell you that it is possible you now have the fix for the well-known hang, and that you may not see them going forward. Most/all of the hangs we saw addressed by that fix were transient, manifesting as performance blips which subsequently recovered.
Assuming that you're up to date on all of 3.1 AND its patches, do you still see the problem?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
I was running 3.1 then updated to 3.1.1 when it came out, then ran an update last week.. nothing,, then yesterday and it found the comstar ones.
All was going well until we rebooted the storage device. We now don't see the iscsi tabs in there. We see the iscsi target service sticking in offline mode.
If I flick into the expert mode and try and enable the service from the command line to see what is going wrong, then I get an "unsatisfied dependencies". root@NexentaBackup01:/volumes# svcs|grep iscs online 16:56:26 svc:/network/iscsi/initiator:default offline 16:55:08 svc:/network/iscsi/target:default root@NexentaBackup01:/volumes# svcadm enable -s iscsi/target svcadm: Instance "svc:/network/iscsi/target:default" has unsatisfied dependencies.
I try going back to the previous rollback point and then again both running and not running the updates.. same issue. Now I have a storage target without any iscsi, just CIFS. the saga continues..
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
What would be really interesting is to see what dependencies are unsatisifed. Try uttering
svcs -xv iscsi/target
and see why it's complaining. My money's on "stmf" not starting properly.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
One more thing...
Assuming you haven't already, you REALLY REALLY need to upgrade the whole system. One can't patch bits-and-pieces at a time. Please make sure you're up to date on ALL of your NS components, including and especially the kernel. One of my colleagues pointed out that you may be updating selectively, which is always a VERY BAD IDEA.
Thanks.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
thanks for all your help. We never selectively update packages - we always just use "setup appliance upgrade" All we are trying to do is to treat NexentaStor as an appliance - in exactly the same way we treat an MSA 2000 or any other of our iscsi SANs.
yep, the errors were pointing to "stmf" not starting properly.
We were originally on the 3.0.4 then 3.1 release - I believe 3.1 was pulled and re-issued, then we have always just run the upgrade from the CLI, so I don't see how things could have gone bad unless there were issues with the NexentaStor repositories.
To get back to a known standard approach, we went to this page http://www.nexentastor.org/projects/site/wiki/CommunityEdition and downloaded 3.1.1. We reinstalled this onto our SAN, then ran an upgrade after this - this again found the Comstar updates (along with some others), and after updating, the versions read; 3.1.1-6584 (r9461) NMC Version 3.1.1-6622 (r9469) NMV Version 3.1.1-6608 (r9456) OS Version 3.1.1.
We imported the existing volumes and now we've got the thing up and running again (after re-adding the iscsi mappings) We'll run some more tests now over the next few days to see if stability has improved.
Where can I find the release notes corresponding to the builds/patches so I can see what has changed when?
I thought the enterprise edition was essentially based on the same code base as the community edition except different usage restrictions/plugins/capacity. If that is the case, then why is the install media different?
A lot of sections of the NexentaStor.org site seem to be out of date/inconsistent, so it's quite difficult to determine the current state of the product or cross reference any bugs/issues etc to any bugtracker accessible to CE users.
What role does Nexenta see NexentaStor CE/NCP users playing in terms of helping identify issues/bugs etc?
From a technical viewpoint, we currently have the following issues with 3.1.1;
- Long start up times of ESX5i hosts (ESx5i stays on "vmwsatpalua loaded successfully" phase for around 30 mins) when connected to NexentaStor CE target. I have attached the vmkernel file. The bulk of the time is spent between; 2011-09-27T22:42:49.526Z and 2011-09-27T23:16:35.010Z. PM me if you need the full VMware support bundle and I'll send you the megaupload URL.
- Target portal group Web UI not working for multiple IPs - have to drop into the CLI to create them; e.g. "itadm create-tpg tpgiscsi10gb 10.42.3.105:3260 10.42.4.105:3260".
Cheers Ashley
vmkernel.log (310.2 KB)
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ryan W about 1 year ago
CE and EE can be installed from the same media. It's not different, the enterprise key is what makes the plugins available to you for use.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
Hi, As a matter of interest, we have just seen this link describing issues with SCSI UNMAP functionality - it appears to be impacting all T10 storage vendors ... http://virtualgeek.typepad.com/virtual_geek/2011/10/urgent-vaaithin-provision-reclaim-on-hold-workaround.html
I wonder if any of the issues we were orignally seeing were also connected to this? sounds like quite a major one to me. For the time being we have disabled VAAI unmap on ourvSphere5 hosts from this; http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2007427
cheers Ashley
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by FREDY . about 1 year ago
Yesterday I tried to upgrade my ESXi 4.1 U1 to ESXi 5 and it didn't work at all, just caused problems with the installation which I am still trying to fix now. The reason ? These T10 things. First of all the boot took forever. Once I removed the only LUN presented by Nexenta, ESXi booted in normally. After that vCenter couldn't even connect to the hosts and on the logs there were a lot of errors and warning which are not normal to see. Seems something that suppose to be a step ahead is causing these headaches. On Marketing "T10 Capable" might look good but not in practical. What is scary is that nobody hears anything back from Nexenta about this. VMware said something but still not a clear solution. Guess we will have to wait media catches up on this.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
hey guys, I've been following various forums, and it appears there is progress on the long boot times (VMware specific issue to vsphere5) impacting multiple storage vendors. Looks like another serious VMware issue with vSphere5; http://communities.vmware.com/message/1840515#1840515
NexentaStor not being on the HCL makes logging formal support calls with VMware impossible - believe me I've tried! So I'd be interested to know from Nexenta's viewpoint what we are supposed to do when we have an issue with VMware (regardless if it relates to storage) and we need to log a VMware support request...
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by raphael schitz about 1 year ago
Hi guys,
I guess some of you are testing many things around and i wonder if some of those able to reproduce this issue would try with the /VMFS3/FailVolumeOpenIfAPD setting set to 1 Just an idea...
Good luck
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
A heads up from RyanW (thanks!) shows that VMware have now acknowledged the issue and are busy fixing; http://blogs.vmware.com/vsphere/2011/10/slow-booting-of-esxi-50-when-iscsi-is-configured.html
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Andy Shinn about 1 year ago
Ashley, If I am reading this thread correctly, the issues you were having were resolved by actually installing 3.1.1 from physical media and then doing an appliance update? Just curious because I am about to try the same for an issue I am having with my upgrade from 3.0.4 to 3.1.1. My version numbers in the console showed the OS at 3.1.1 but other components only at 3.1.0 which makes me believe the appliance update actually didn't work correctly...
Thanks for the very descriptive troubleshooting. I have been following this thread for a while and there is a lot of information here.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
Yes, when we installed NexentaStor from the 3.1.1 install location of; [[ftp://ftp.nexentastor.org/www/releases/NexentaStor-Community-3.1.1.iso.zip]] and then did an appliance update (and then a rbeoot), versions should be the following; NMS Version:3.1.1-6584 (r9461), NMC Version:3.1.1-6622 (r9469), NMV Version:3.1.1-6608 (r9456), OS Version:3.1.1. If you see any other versions then I suspect for whatever reason, the update didn't go through properly.
After this most of our problems disappeared.
I think most of our problems originally stemmed from the fact we were trying to update from the 3.1.0 media (which is still linked to on the community site) - and I believe 3.1.0 was pulled and replaced with 3.1.1. I think the other issues we had related to the fact it looked like Comstar patches were pushed during the period we were having problems so all should be good now.
However if you are a VMware shop runnign vSphere5 you might want to be aware of the following (and I suggest you hold back on vSphere5 migrations for atleast another few weeks);
- Veeam V5 backup not fully working with vSphere5. A patch is due shortly; [[http://forums.veeam.com/viewtopic.php?f=2&t=8103]]
- You also need to disable SCSI UNMAP on the VMware hosts [[http://blogs.vmware.com/vsphere/2011/09/vaai-thin-provisioning-block-reclaimunmap-issue.html]]
- You'll need to be aware of the slow vSphere5 boot process [[http://blogs.vmware.com/vsphere/2011/10/slow-booting-of-esxi-50-when-iscsi-is-configured.html]].
- We are seeing strange behaviour on 1MB block size VMFS5 file systems - deleting VMs often times out while hanging at 95% (We see this both on NexentaStor and other iSCSI SANS). We are still reseaching this.
On vSphere5 in general, I looks to us like there are still multiple outstanding VMware related issues - eg. deleting a VM on a VMFS5 file system with 1MB block size seems to hang most of the time (both on NexentaStor and on other non VAAI iSCSI SANS).
To be perfectly honest, I'd hold off on vSphere5 migrations if you are able to until storage related issues are wider understood (for both NexentaStor and all other SANs) - it's caused us significant lost time and we are only a dev shop.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by pos _ei_don about 1 year ago
VMware posted an update for their issues. ESXi500-201111001
Can someone report if this patch works, before i update my machines from 4.1?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
yes this resolves the long start up time on vSphere 5 - our boot time of our ESX5i hosts dropped from over 30 minutes to around 5 minutes. We have applied the patch to 9 hosts in total - all good. Make sure your NexentaStor instance is patched as well as there have been numerous fixes at the Comstar layer.
There is a Veeam patch to Veeam B&R v5 that also now supports vSphere5 properly.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Adam Breidenbaugh about 1 year ago
I may be experiencing related issues.
I have two ESXi 5 hosts, plus a VEEAM B&R host connecting to an iSCSI datastore running on Nexenta. At times, the datastore becomes unresponsive and the VMs hang. esxtop shows almost 0 throughput, and the latencies are very high (20,000-60,000.) Only rebooting Nexenta seems to get things going again. It has locked up twice on me in the past week. I suspect it may be occurring during the removal of large snapshots (up to 15G-30G) after VEEAM direct SAN backups complete.
I posted a new thread with more details on my environment in the main (General) forum titled "iSCSI datastore becomes unresponsive."
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Jose Merchan about 1 year ago
Hi
we still have this issue
ESXi 5.0.0 build 504890 Nexenta Comunity 3.1.1 with latest upgrades
is there any solution for this bug?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
Latest patch release pushes up build number to; 515841 and there are specific changes in here around UNMAP etc, so you might want to switch to this build first then retest. We'll be switching to this build over the next few days.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2007680
I'd be interested to know how you get on.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Jose Merchan about 1 year ago
Hi Ashley
We don't think that this esxi patch solves the problem, we've tried before with SCSI UNMAP disabled and we got same results. Nexenta iSCSI server hangs when we take SVMotion and delete a virtual machine simultaneously.
This esxi patch seems simply disables SCSI UNMAP by default, so we have no hope
we will test it anyway, maybe we'll get lucky
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
I wonder if the ATS support is not working. VMware calls it Hardware Accelerated Locking. I need to find my workstation to print the MDB pipeline that can determine if ATS has wedged you. We've found (and can reproduce) VAAI problems with ATS on SCSI over Fibre Channel (but not on iSCSI, oddly enough). If you disable HW Accelerated Locking, Jose, do the hangs go away? If they do, then it's possible you're being bitten by the same bug we're seeing in the lab.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
- Here's the macro. Run "mdb -k" on your NexentaStor box (as root) and utter this:
:stmfilus |::print stmfilut ilulu |::print stmflut luproviderprivate |::print sbdlut slflags slname slats_state
If you see a non-zero ats_handle, that LU is locked.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Jose Merchan about 1 year ago
Hi DAN
thanks,
ok, we will test with VAAI locking diasbled and we will post the results here
I hope this solves the lockings until a patch
we have two questions:
is this a bug in esxi 5 or in nexenta 3.1.1?
and VAAI locking is very important or we can disable it without being noticed?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
If it's the bug we're seeing in the lab, it's an NS 3.1.1 bug. And the ATS operation allows greater concurrency, so you will miss out on potential performance increases if it's disabled. If the MDB pipeline I mentioned above shows non-zero ats_handles after your hang, then it's likely our lab bug.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
Hey Dan, would this same issue potentially present itself when connecting to ESX4.1i hosts? Does this issue impact NFS presented storage to ESX5i?
We don't get any of these types of issues on our older MSA 2000 G2 devices presenting iscsi to ESX5i - is this just because they aren't supporting the same T10 standards? If this is the case it would be extremely beneficial to all current/new Nexenta customers if there could be an option within the GUI to disable all the advanced T10 features until which stage as these issues are wider understood/resolved.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
ESX4.1 doesn't implement the T10-specified VAAI primitives. You also probably won't see this issue on NFS either.
I don't know what an MSA 2000 G2 is, unfortunately. As for the disable-VAAI, it's being worked on (per-dataset and per-feature), but I'd really prefer fixing it in-house so people don't have to disable it at all.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
thanks Dan for your on-going commitment to resolve these issues. From my perspective, people expect storage to just work and when it fails, it can hurt the perception on the reliability of the platform (despite the huge efforts that you guys are putting in to make Nexenta the best storage platform available).
From our perspective, to have default settings that disables the features known to have stability issues (and information about what the known issues are) is critical.
We would much rather have stable lower performing storage over high speed storage at risk of random crashing.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by FREDY . about 1 year ago
Hi Ashley,
Agree with you.
I do expect storage to just work as being a pretty critical part of almost any system. So before adding new features and making a lot of maketing to any storage product I do expect it to be throughly tested and re-tested then made available for production.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
I've made a fascinating discovery in the lab.
If anyone has a reproducible hang, could this person run the following D script on their NexentaStor?
#!/usr/sbin/dtrace -FCs
sbddbufxfer_done:entry /((scsitaskt *)arg0)->task_cdb[0] == 0x89/ { stack(); exit(1); }
If this produces output prior to your reproducible hang, I'd like to know that. This might indicate a specific ATS codepath problem.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
Any volunteers to test a replacement kernel module? Our immediate problem in the lab has disappeared with this fix. I'd like others to kick it around before I declare victory.
Please speak up on this thread if there are willing volunteers!
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by FREDY . about 1 year ago
Hello Dan,
I am one of the affected by this and had to disable ATS because I use Fibre Channel. The problem is when adding a new Datastore not that it locks when in use. It simply doesn't allow to add it.
Unfortunatelly our system is in production so I can't do any changes on it. I could however look try to do that in a separate ESX host and LUN, provided it won't interfere on the production LUNs (could you confirm that?) but I can't apply any fixes to this system until it's available for upgrade at the repositories.
Ideally it would need another physical server and fibre channel card and build up another Nexenta on it to test separatelly. I might be able to find a similar hardware but that won't happen before the first week of January.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
FREDY . wrote:
Hello Dan,
I am one of the affected by this and had to disable ATS because I use Fibre Channel. The problem is when adding a new Datastore not that it locks when in use. It simply doesn't allow to add it.
That's part one of the problem. If you access that store again (say a second attempt to add), the request locks. You're describing my lab environment, actually. :)
Unfortunatelly our system is in production so I can't do any changes on it. I could however look try to do that in a separate ESX host and LUN, provided it won't interfere on the production LUNs (could you confirm that?) but I can't apply any fixes to this system until it's available for upgrade at the repositories.
The ATS internal properties are per-LUN. If you've an ESX production host with ATS disabled, my change won't affect it. No ATS request, no bug (or fix).
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
Hey Dan, If you can email me the links+details of the kernel modules that have changed, I can apply these to our backup host that we sometimes load up with limited development load when we feel brave! Ironically I'm currently busy with a load of svmotions off a failing HP SAN right now (failures caused by lack of multi-pathing to one of the I/O enclosures - best not ask!)
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Jose Merchan about 1 year ago
Hello Dan
We have tested with VAAI locking (ATS) disabled in esxi 5 and seems that this resolve de issue.
Now we can make SVMotions and delete, create, cancel SVMotions, etc... without hang de datastore
Anyway we will stress the system with several SVMotions, deletes, etc... to confirm the workaround
We are volunteers!! :)
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Jose Merchan about 1 year ago
Hello Dan
When we try the script you post above, this is the output
dtrace: failed to compile script ./hola: line 3: syntax error near ")"
Sorry but we don't know anything from Dtrace programming
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
Jose Merchan wrote:
Hello Dan
When we try the script you post above, this is the output
dtrace: failed to compile script ./hola: line 3: syntax error near ")"
Sorry but we don't know anything from Dtrace programming
My paste here corrupted the script.
And as for volunteers, I was looking for folks to try the replacement stmf_sbd kernel module. I'll discuss this with Ashley offline.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Adam Breidenbaugh about 1 year ago
We are currently in the process of creating a test environment that mirrors the setup of production. If I can reproduce these errors in our test environment, I'll be more than happy to try the replacement kernel module. Test environment is a couple days away from completion.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by raphael schitz about 1 year ago
Dan McDonald wrote:
Any volunteers to test a replacement kernel module? Our immediate problem in the lab has disappeared with this fix. I'd like others to kick it around before I declare victory.
Please speak up on this thread if there are willing volunteers!
We would be very interested since we had to cancel nfs to iscsi migration because of the VAAI issue (the ATS one from nexenta and the UNMAP one from VMware) Thanks
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson about 1 year ago
hi Dan, we were running the stmfsbd patch (you sent me on the 22nd December) with hardware accelerated locking on - and we had been solid since we ran with your patch. However, on a virtualised Nexenta test I had run with a fresh 3.1.2 install and then a standard upgrade patches, we are hanging during simultaneous svmotions. Has the stmfsbd patch been incorporated into the main distribution yet?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
It's not in3.1.2, due to an additional, harder to get, race. I will have to cut a new workaround stmf_sbd
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by FREDY . about 1 year ago
That's an interesting. I have been waiting for this to go into the stable distribution for a while. I beleive the lack of Hardware accelerated locking is one of the reasons of a higher latency for some LUNs.
What you mean by cut a new workaround Dan ?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
FREDY -> I have a workaround stmfsbd that has a fix in place, but the fix has been shown to still have locking problems when placed under certain kinds of load. I've been hard-pressed to get an ESXi 5.0 setup running, so I cannot easily test-and-debug. (Also, I work on other pieces of the system as well.) When I say I want to cut a new workaround, that means I wish to make sure that an stmfsbd with fixes is compiled from sources that match 3.1.2.
Dan
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Tommaso Calosi about 1 year ago
Hi Dan,
We are running into the same locking issue with 3.1.2. Can you please update this thread when the patch is included in the repos? Ashley wrote about a patch, could you send it to me too? tcalosiATgmail.com.
Thanks.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by FREDY . about 1 year ago
Yeah,
I have been waiting for this fix for quiet a while as well.
Dan, you mentioned that you had a fix, but that was lacking something else if I understood correctly. Is it just proper QA or test-and-debug or is it something else missing to get this stable ?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
Patch 8475 (for bug 6919) was integrated, and I believe it will be available for NS3.1.2 soon
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by FREDY . about 1 year ago
Dan, does that mean the problem is resolved and has been QA'ed then ?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
FREDY . wrote:
Dan, does that mean the problem is resolved and has been QA'ed then ?
Resolved, yes (modulo one customer's report that multiple concurrent svmotions can cause a similar hang, even with this fix).
QA, it certainly has been tested on the common cases. Most folks here who tried my fixed stmf_sbd reported goodness, and it also got fixed with my own vSphere tests.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Tommaso Calosi about 1 year ago
Hi Dan,
The patch is not in the repos yet. Do you know if we can get it from fromwhere else? We really appreciate your effort.
Thanks
Tom
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
3.1.3 is coming soon and will have the initial fix.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by FREDY . about 1 year ago
Dan, "initial fix" means that is not yet fully fixed ?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
FREDY . wrote:
Dan, "initial fix" means that is not yet fully fixed ?
I mentioned this earlier. One customer, with this fix in place, has still managed to cause an ATS seizure by performing several concurrent svmotions in a row. I've yet to reproduce this successfully myself (where I can catch our end in the act). Enough customers were happy with the existing fix, however, that we're pushing the fix as-is out now.
If/when post-3.1.3 this problem still manifests, we will followup. I'm trying to get a reproduction of this bug in our lab, but without success as of yet. Obviously I'd like to fix things if I can discover the bug this one customer is seeing.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Peter Kranz about 1 year ago
Please keep updating the status of this issue, I've had multiple iSCSI lockups on our vmware iSCSI based hosts connected to Nexenta 3.1.2-8147, and the only way to un-freeze seems to be a reboot of the Nexenta (or the ESXi host).
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald about 1 year ago
Peter Kranz wrote:
Please keep updating the status of this issue, I've had multiple iSCSI lockups on our vmware iSCSI based hosts connected to Nexenta 3.1.2-8147, and the only way to un-freeze seems to be a reboot of the Nexenta (or the ESXi host).
The patch for this problem is 8475, so you don't have the fix in place yet. 3.1.3 should be coming soon, and it will have the fix I described above.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan Swartzendruber about 1 year ago
yay, thanks :)
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Steven Rodenburg about 1 year ago
Hello all,
Any news on 3.1.3 ?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Linda Kateley about 1 year ago
3.1.3 is due very soon
I will post and definately let everyone know when it's out
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Luke Evans about 1 year ago
Hi all,
I'm still afflicted by this one and it's driving me mad!
Any updates on 3.1.3 or is the patch available at all?
Thanks!
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by FREDY . about 1 year ago
Luke, perhaps you should get in touch with Dan McDonald who have being involved on get this fix. He mentioned it seemed Ok now for some people who applied the patch, but if it's not working for you you should report that back. Thanks.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Steven Rodenburg about 1 year ago
Luke Evans wrote:
Hi all,
I'm still afflicted by this one and it's driving me mad!
Any updates on 3.1.3 or is the patch available at all?
Thanks!
Disable VAAI ATS on your ESXi 5 hosts (to make them behave like 4.1) and the instability is gone. You will lose some performance but i recon stabilty is more important.
When 3.1.3 turns out to be a good release, then you can turn ATS back on again.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Chris Ogden 11 months ago
Can anyone confirm that 3.1.3 include this patch and resolves the issue at hand?
Thanks
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Tommaso Calosi 11 months ago
Apparently it's not included...
http://info.nexenta.com/rs/nexenta/images/doc3.1releasenotes3.1.3.pdf
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Chris Ogden 11 months ago
I had read the release notes but wasn't sure if it fell under a different category or just wasn't mentioned intentionally. We were just about to go live with a system when we came across the problems in this thread and experienced them first hand.
If this patch is not included, we will likely be forced to pursue another solution and move away from using Nexenta anytime soon.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald 11 months ago
Wrong. Look in the "Bug Fixes" section --> 6919 is called out. Get 3.1.3 now, folks.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Tommaso Calosi 11 months ago
Dan McDonald wrote:
Peter Kranz wrote:
The patch for this problem is 8475, so you don't have the fix in place yet. 3.1.3 should be coming soon, and it will have the fix I described above.
That's great Dan, I was expecting the patch to come with id 8475 as you mentioned before in this post. Anyway, great job! We'll update soon.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Marcus Oliveira 11 months ago
hi guys,
Just to inform you that this issue still happens in 3.1.3.
Marcus
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald 11 months ago
Marcus,
1.) As frequently as with 3.1.1/3.1.2?
2.) One thing even VMware is suggesting is disabling UNMAP (hardware-assisted erase). Did you try that (but with hardware-assisted locking enabled)?
3.) As I mentioned earlier, there was one report of this prior to the initial fix's integration. If you're a paying customer, Nexenta bug 8499 tracks this, and you should be added to it.
Unlike the initial fix, this one is much harder to track. I'm off on other fires right now, so I might not be able to help as much as I did the first time.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by William Roush 9 months ago
I was getting this issue on 3.1.2 with HardwareAcceleratedLocking turned on (our development LUN would just seize up, all other LUNs doing other things would still operate fine, to get the LUN back online again a reboot of Nexenta was the easiest thing to do).
Turning off HardwareAcceleratedLocking resolved the issue (at least for the past 18 days, I had LUN seizes at least once a week), I patched to 3.1.3, but want to confirm:
Has anyone that experienced this issue prior to 3.1.3 NOT had this issue since upgrading to 3.1.3?
Thanks to the devs on this one, I know a lot of SAN providers had issues with ESXi 5 and VAAI, Dell delayed support for EqualLogic (what we use in production) and we ended up deploying 4.1 due to that.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Steven Rodenburg 9 months ago
I can confirm. Since running 3.1.3 i had all VAAI features turned on (incl. HardwareAcceleratedLocking) and had no problems.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald 8 months ago
Steven Rodenburg wrote:
I can confirm. Since running 3.1.3 i had all VAAI features turned on (incl. HardwareAcceleratedLocking) and had no problems.
Cool. There is one additional corner case (MUCH harder to reach) with VAAI's ATS (HW locking) that is fixed in 3.1.4, BTW. It's possible it was squeezed into 3.1.3 as well, but I know it's in 3.1.4.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Larry Smith 8 months ago
Have you guys also install ESXi5 U1. Which disables UNMAP. I have not had any issues yet since then. Check out my blog if you want to look into some of this. http://elretardoland.com
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald 8 months ago
Larry Smith wrote:
Have you guys also install ESXi5 U1. Which disables UNMAP. I have not had any issues yet since then. Check out my blog if you want to look into some of this. http://elretardoland.com
With UNMAP enabled, the aforementioned other ATS race becomes easier to trigger. I'd be VERY interested in hearing how 3.1.4 fares with UNMAP enabled. My customer(s) who were being bitten seemed to no longer have a problem after the fix in 3.1.4.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Chris Ogden 8 months ago
Dan,
You keep mentioning the additional fixes and stability of 3.1.4 but I can't seem to find it available anywhere, is it publicly available yet?
Chris
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald 8 months ago
Shoot, I thought it was out already. Sorry for speaking out of turn.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Derek Glover 8 months ago
3.1.4 will be out soon hopefully, you guys will be the first to know ;)
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Julian Z 7 months ago
Looks like I'm a victim of this as well. Disabling VAAI and UNMAP hasn't helped :(
My topic on this: http://nexentastor.org/boards/2/topics/8144
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Derek Glover 7 months ago
Day to day wait on the next NS release, I'm really hoping it will be this week fingers crossed
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by FREDY . 7 months ago
Yeah, 3.1.4 seems to be way behind as many people have mentioned "soon" and no signal of it. Seems they always find a last minute thing or didn't plan accordingly for this release.
More important than this is to know what exactly 3.1.4 will fix.
Other than these VAAI issues apparently there are COMSTAR/ZFS problems as well which I don't have much information, but perhaps Nexenta can share with us and if that has anything to do with the delay of 3.1.4 and what exactly that is about.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Derek Glover 7 months ago
I'm working on it!
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Julian Z 7 months ago
My write up on my problem with this: http://nexentastor.org/boards/5/topics/8341
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by FREDY . 7 months ago
Hi Derek,
7 days. Did you get anything from your work on this ?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Derek Glover 7 months ago
It is day to day at this moment. Last I saw from the release team on Monday, it is aimed for mid/end of this week. Let's hope it hits that target.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson 7 months ago
I just did a "setup appliance upgrade" and saw there was a new version.
After upgrading, I get the following versions;
NMS Version 3.1.3.5 (r10429) NMC Version 3.1.3.5 (r10417) NMV Version 3.1.3.5 (r10417) OS Version 3.1.3.5
any chance of a link to the fix list?
Is this the version we have all been waiting for? It's a little strange it's not a 3.1.4 release.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Davide Poletto 7 months ago
Try to just add a ".5" at the end of Release Note's file name of 3.1.3 on its URL and you will find the 3.1.3.5 PDF file: the link about Release Notes on the http://www.nexentastor.org/projects/site/wiki/CommunityEdition page (under "Getting Started and Registration key" paragraph) still links to 3.1.3 but with that silly little URL's hack you will download the 3.1.3.5 PDF.
Regards, Davide.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ryan W 7 months ago
Ashley Watson wrote:
I just did a "setup appliance upgrade" and saw there was a new version.
After upgrading, I get the following versions; [...] any chance of a link to the fix list?
Is this the version we have all been waiting for? It's a little strange it's not a 3.1.4 release.
You must have some magic repo, because I've got nothing new when I run setup appliance upgrade. Is this on one of your CE boxes? I find those get some releases before they are approved for "EE".. effectively making CE users beta testers.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ashley Watson 7 months ago
great stuff. Yes this is the CE version.
thanks for the tip about the release notes;
http://info.nexenta.com/rs/nexenta/images/doc_3.1_release_notes_3.1.3.5.pdf
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Derek Glover 6 months ago
How are you guys doing on this issue since updating to 3.1.3.5?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Steven Rodenburg 6 months ago
Derek Glover wrote:
How are you guys doing on this issue since updating to 3.1.3.5?
So far so good. No hangs or other issues during operation. Ran ESXi 5.0 U1 first, now on 5.1.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Marian Fischer 6 months ago
Hi,
we updatet everything to the newest versions, both, the nexenta (3.1.3.5) and the VMware Cluster. And still there are iSCSI drops and timeouts!!
There are still massive problems with VAAI (ATS) and nexenta:
snip- cpu3:328139)ScsiDeviceIO: 2322: Cmd(0x4124403d38c0) 0x89, CmdSN 0x63463 from world 364295 to dev "naa.600144f09caf0c0000005047680e0003" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0
cpu5:2053)ScsiDeviceIO: 2322: Cmd(0x412440f94fc0) 0x89, CmdSN 0x12a163 from world 151739 to dev "naa.600144f09caf0c0000005047680e0003" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0. cpu7:2055)ScsiDeviceIO: 2322: Cmd(0x4124403a3f80) 0x89, CmdSN 0x12a168 from world 151741 to dev "naa.600144f09caf0c0000005047680e0003" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0. cpu4:2052)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x89 (0x4124020d1740, 151742) to dev "naa.600144f09caf0c0000005047680e0003" on path "vmhba34:C2:T1:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0. Act:NONE cpu4:2052)ScsiDeviceIO: 2322: Cmd(0x4124020d1740) 0x89, CmdSN 0x12a169 from world 151742 to dev "naa.600144f09caf0c0000005047680e0003" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0. /snip-
and many other SCSI write errors ...
so guys, we decided to buy nexenta because of VAAI ... But there are bugs with the locking!! We also wrote a mail to Dan because of Comstar problems 6 weeks ago, but didn't get any feedback yet. So guys, please fix this feature!! This was the main reason for us to buy Nexenta including 5 Y Support.
regards,
Marian
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Tommaso Calosi 5 months ago
Again same issue here with lun locking with VAAI enabled
NMS Version 3.1.3.5 (r10429) NMC Version 3.1.3.5 (r10417) NMV Version 3.1.3.5 (r10417)
This also causes all the connected ESXi hosts to become unresponsive to management.
I wonder how can VMware certify nexenta storages....
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Derek Glover 5 months ago
Is anyone willing to join the Nexenta Beta program to see if they experience similar in the 4.0 release?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Dan McDonald 5 months ago
Tommaso Calosi wrote:
Again same issue here with lun locking with VAAI enabled
NMS Version 3.1.3.5 (r10429) NMC Version 3.1.3.5 (r10417) NMV Version 3.1.3.5 (r10417)
This also causes all the connected ESXi hosts to become unresponsive to management.
I wonder how can VMware certify nexenta storages....
3.1.3.5 should have the bug fixed (Bug 8499) we'd been seeing in-house with ATS (aka. HW-assisted locking) and svmotions.
Can you take a kernel dump of a system that gets wedged like this? That would be most helpful.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by eMiz0r . 4 months ago
We are experiencing the same issues since last week tuesday. Before we had this problem like once every 2 months; randomly ESXi hosts losing datastores causing VM's to crash. The only way to get them back is to reboot the Nexenta box troubling Linux VM's getting in readonly. Last week has been a real nightmare, every time the Nexenta system gets around 20/30% CPU usage (dual hexacore processors & 96GB RAM) for a little spike: we lose datastores. We've had around 7 outages last week and cannot sell this to our customers anymore. It's a production environment hosting around 250 VM's. Despite the dual hexacore processors, average load is around 1.8 and at busy moments +/- 2.1.
However, we still use Nexenta 3.1.2 and will be upgrading to Nexenta 3.1.3.5 within a couple of days (preferably coming night) and we can only hope this will fix our problems. If it doesn't, I'm afraid we have to exit using Nexenta. We do have a supportcontract, but when running a dump from the logs the CPU spikes again and we had another outage. So we're not able to provide any logs :( We have had a little sleep the last few nights, I sincerely hope 3.1.3.5 will be our answer. Even it was for the sake of our own mental health. We don't use FibreChannel but Infiniband. But it that doesn't seem to be the problem, as users report in this thread other protocols are giving the same troubles.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by William Roush 4 months ago
2012-12-26T15:10:21.523Z cpu0:2161)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600144f0bd9c020000004ff1a959000c" state in doubt; requested fast path state update...
2012-12-26T15:10:21.523Z cpu0:2161)ScsiDeviceIO: 2324: Cmd(0x4124007ae1c0) 0x89, CmdSN 0x518dc from world 4654 to dev "naa.600144f0bd9c020000004ff1a959000c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
2012-12-26T15:10:32.407Z cpu1:2161)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600144f0bd9c020000004ff1a959000c" state in doubt; requested fast path state update...
2012-12-26T15:10:32.407Z cpu1:2161)ScsiDeviceIO: 2324: Cmd(0x412400798cc0) 0x89, CmdSN 0x518e1 from world 2066 to dev "naa.600144f0bd9c020000004ff1a959000c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-12-26T15:10:37.528Z cpu1:2161)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.600144f0bd9c020000004ff1a959000c" state in doubt; requested fast path state update...
2012-12-26T15:10:37.528Z cpu1:2161)ScsiDeviceIO: 2324: Cmd(0x412400772a80) 0x89, CmdSN 0x518e5 from world 4654 to dev "naa.600144f0bd9c020000004ff1a959000c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.
Also see this thread:
http://communities.vmware.com/message/2168839#2168839
I'm sort of back on board with suspecting Nexenta, I got the other host to throw errors too till I rebooted NexentaStor, running latest ESXi (5.0.0 914586). I'm trying with VAAI disabled now to see if it's a VAAI issue. I'm on Nexenta 3.1.3, as far as I see, 3.1.3.5 doesn't address any iSCSI/VAAI issues. I'm not sure if it's because of really bad resource contention and a bad lock or what...
Thought my issues were unrelated, but hopped back in here to see the issue I'm having is the same issue here. Completely disabled VAAI now, looking into whether or not it happens again.
I'm tempted to ship my LUN to NFS just to get this to stop!
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Ryan W 4 months ago
William Roush wrote:
I'm sort of back on board with suspecting Nexenta, I got the other host to throw errors too till I rebooted NexentaStor, running latest ESXi (5.0.0 914586).
Not that I think it'll solve your issues, but "Latest ESXi" is 5.1a (5.1.0a 838463) plus any applicable patches there on out.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by William Roush 4 months ago
Ryan W wrote:
William Roush wrote:
I'm sort of back on board with suspecting Nexenta, I got the other host to throw errors too till I rebooted NexentaStor, running latest ESXi (5.0.0 914586).
Not that I think it'll solve your issues, but "Latest ESXi" is 5.1a (5.1.0a 838463) plus any applicable patches there on out.
Grr, sorry, latest 5.0 build, fully patched.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by eMiz0r . 4 months ago
We've changed to 3.1.3.5, but it didn't help. Still we lost several LUN's to different ESXi hosts where the only way to recover was to reboot the SAN. After several more outages in the days after, we decided to move from IB SRP to IPoIB. Since then we almost don't see any H:0x7 errors in the /var/log/messages from the ESXi hosts anymore. Before we had these errors come in at an high rate (20-30 within one second) resulting in a timeout where ESXi masks the storagepath as being dead and sent out an triggered alarm "lost path redundancy".
After switching to IPoIB, we only see H:0x5 errors when putting the Nexenta box under a little stress. It's a dual hexacore SAN with 96GB RAM, where the load onder "heavy conditions" is around 5. Those H:0x5 errors mean according to VMware KB:
"VMKSCSIHOST_ABORT = 0x05 or 0x5
This status is returned if the driver has to abort commands in-flight to the target. This can occur due to a command timeout or parity error in the frame."
Now, we see these errors after Googling popping up on NetApp, EMC, HP storage-equipment with GbE or FC connections. So we don't think it is directly related to Nexenta or Infiniband. Still, could it have to do anything with some kind of overloading the SAN which results in aborting SCSI commands?
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by William Roush 4 months ago
Disabled VAAI, my issues went away, everything is 100% smooth, will roll without it until we hear from everyone here that it's fixed.
Can't plug into production though, they use VAAI :\
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Erik Bussink 4 months ago
Let's hope that this bug will be left behind when we get to move to Nexenta 4.
RE: iscsi target hang during simultaneous svmotions for vsphere5 / NexentaStor 3.1.1 onto vmfs5 - Added by Tommaso Calosi 3 months ago
Dan McDonald wrote:
Tommaso Calosi wrote:
Again same issue here with lun locking with VAAI enabled
NMS Version 3.1.3.5 (r10429) NMC Version 3.1.3.5 (r10417) NMV Version 3.1.3.5 (r10417)
This also causes all the connected ESXi hosts to become unresponsive to management.
I wonder how can VMware certify nexenta storages....
3.1.3.5 should have the bug fixed (Bug 8499) we'd been seeing in-house with ATS (aka. HW-assisted locking) and svmotions.
Can you take a kernel dump of a system that gets wedged like this? That would be most helpful.
There's no kernel dump. Anyway we get this in the hostd.log
~ # cat /var/log/hostd.log |grep naa --> value = "naa.600144f092c1050000004fd1fc990001", 2013-01-24T09:44:42.732Z [43FF6B90 info 'ha-eventmgr'] Event 1330 : Device naa.600144f092c1050000004fd1fc990001 performance has deteriorated. I/O latency increased from average value of 2341 microseconds to 59997 microseconds. --> value = "naa.600144f092c1050000004fd1fc990001", 2013-01-24T09:44:42.735Z [45D02B90 info 'ha-eventmgr'] Event 1331 : Device naa.600144f092c1050000004fd1fc990001 performance has deteriorated. I/O latency increased from average value of 2341 microseconds to 138320 microseconds. --> value = "naa.600144f092c1050000004fd1fc990001", 2013-01-24T09:54:12.430Z [44EB9B90 info 'ha-eventmgr'] Event 1332 : Device naa.600144f092c1050000004fd1fc990001 performance has deteriorated. I/O latency increased from average value of 2343 microseconds to 49458 microseconds. --> value = "naa.600144f092c1050000004fd1fc990001", 2013-01-24T09:58:24.870Z [FFFC5AC0 info 'ha-eventmgr'] Event 1333 : Device naa.600144f092c1050000004fd1fc990001 performance has deteriorated. I/O latency increased from average value of 2350 microseconds to 58455 microseconds. --> value = "naa.600144f092c1050000004fd1fc990001", 2013-01-24T09:59:08.719Z [FFFC5AC0 info 'ha-eventmgr'] Event 1334 : Device naa.600144f092c1050000004fd1fc990001 performance has deteriorated. I/O latency increased from average value of 2352 microseconds to 120104 microseconds. --> value = "naa.600144f092c1050000004fd1fc990001", 2013-01-24T09:59:54.698Z [457E6B90 info 'ha-eventmgr'] Event 1335 : Device naa.600144f092c1050000004fd1fc990001 performance has deteriorated. I/O latency increased from average value of 2359 microseconds to 242165 microseconds. --> value = "naa.600144f092c1050000004fd1fc990001", 2013-01-24T10:00:28.844Z [43FF6B90 info 'ha-eventmgr'] Event 1336 : Device naa.600144f092c1050000004fd1fc990001 performance has deteriorated. I/O latency increased from average value of 2368 microseconds to 123572 microseconds.