iSCSI disconnecting

Added by Elton Smythe over 2 years ago

Hi all, we've got a Nexenta storage box that's been in production for about a month, and we have seen a number of iSCSI disconnects. We are running what we consider to be a pretty good setup for the level of data/performance we're expecting, so we don't feel things are overloaded at all.

The basic setup is:

2x Head Nodes, each with:

  • Dual Quad Core Xeon (E5620 2.4ghz)
  • 96GB RAM
  • 2x Intel SSD as boot drives
  • 2x Intel onboard gig (for heartbeat and management)
  • Intel 10gig AF DA Dual Port NIC (for iSCSI connectivity, going to two separate Juniper 10G switches)
  • 2x LSI 9200-8e HBAs

2x JBODs, each with:

  • 8x Seagate Constellation 2TB
  • 1x STEC ZeusRAM
  • 1x OCZ Talos

We're running a single zpool, which comprises of:

  • 8x Vdevs (each having 2x Seagate 2TB drives)
  • 2x cache (OCZ Talos)
  • 1x ZIL (mirrored STEC ZeusRAM)

It's all Supermicro gear, all the hardware diagnostics come back perfectly, and it's a fresh install. This install has been signed off by Nexenta and certified, and all the hardware is on their HCL.

We're running version Nexenta 3.1.2 with the HA plugin. The only data we store is on ZVols accessed over iSCSI, and in summary our iSCSI config is:

  • Multipath, to each of the two ports on the Intel NIC
  • 8x ZVols, so 8x LUNs presented to our hypervisors
  • Xen and KVM virtualisation running on top of LVM storage
  • About 20 hypervisors connecting to this set up

The ZVol config is the standard with a few changes so:

  • Each Zvol is 8TB, sparse
  • 8kb default block size
  • LZJB compression
  • Write through cache, so the ZIL gets used for writes
  • Snapping every hour, kept for 24h
  • Snapping every day (at 3am with 5 minutes staggers), kept for 30d

We've allocated about 4TB of data now on this system, so 'some' but not 'lots'.

We are not putting a huge demand on it in terms of performance, far from it in fact. We're putting what I consider to basically be 'background load' for a setup of this type. The typical figures are: 100-150 read IOPS 500-700 write IOPS With occasional bursts to 1k read IOPS, 3k write IOPS.

So the error that we're seeing is from the multipath daemon in Linux, specifically this:

Feb 12 14:00:32 hv073 multipathd: checker failed path 8:176 in map mpath1
Feb 12 14:00:32 hv073 multipathd: checker failed path 8:240 in map mpath2
Feb 12 14:00:32 hv073 multipathd: checker failed path 65:16 in map mpath3
Feb 12 14:00:32 hv073 multipathd: checker failed path 65:32 in map mpath4

etc. This is repeated across multiple hypervisors, different switches, etc. We're sure the problem is the Nexenta box. iSCSI reconnects shortly after this so connectivity is restored to VMs.

We've spent quite a bit of time with Nexenta eliminating hardware as a cause, so I think we're comfortable all the hardware is working correctly.

What we do know is that these disconnects happen on the hour boundary. This links in with when we do the hourly snapshotting. There's nothing else that we can see of that is happening on the hour.

One of the concerns that has been raised is that ARC metadata usage. We've heard elsewhere that it's good to get all of this in RAM, but it doesn't really seem feasible for block storage. I've got an Illumos dev box here to do some experimentation on, and I can see the total number of blocks in the system with the "zdb -b tank". I've copied over some sample VMs, meaning 203.1GB of referenced data with 1.21x LZJB actual compression (so 245GB of actual data). This system is showing an arcmetamax of 8152MB, the limit is 12GB so it's never hit the limit. It has no snapshots. It shows a bp_count of 38024600, e.g. one bit of metadata for every 5749 bytes of referenced data. (Average ARC metadata size is 225 bytes.) Using this as a guide, for every 25.5GB referenced data on ZVols, we need 1GB ARC metadata space. So on our current system with ~4TB used, this means we need 156GB of RAM, just for ARC metadata.

It would be really good to hear any thoughts from the community on this? Especially anyone using Nexenta for iSCSI storage and whether they've seen similar problems.


Replies

RE: iSCSI disconnecting - Added by Peter Valadez over 2 years ago

Hi Elton,

Have you made any progress with this problem? We are interesting in setting up a high availability cluster with very similar specs. We will be using the Supermicro 847E26-RJBOD1 jbods with LSI 9205-8e cards in the head nodes. Thanks!

RE: iSCSI disconnecting - Added by Tommy Scherer over 2 years ago

Blow away you current ZVOLs slowly by shrinking 50GB at a time to ensure there will be no distributions to other storage services. Then recreate the ZVOLs with 128k blocks. That should reduce the arc-meta by a factor of 16. that should bring arc meta to under 10GB. That will fit within the default ZFS parameter of 25% of the arc. We had a account on the east coast that had a similar issue the migration took some work ,but fixed the issue.

Content-Type: text/html; charset=utf-8 Set-Cookie: _redmine_session=BAh7BiIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNoSGFzaHsABjoKQHVzZWR7AA%3D%3D--cebfb08d300a85bd88dafd1422210ebe7c9a5873; path=/; HttpOnly Status: 500 Internal Server Error X-Powered-By: Phusion Passenger (mod_rails/mod_rack) 2.0.3 ETag: "4503acc34486e28592fe689d123467ba" X-Runtime: 615ms Content-Length: 11820 Cache-Control: private, max-age=0, must-revalidate redMine 500 error

Internal error

An error occurred on the page you were trying to access.
If you continue to experience problems please contact your redMine administrator for assistance.

Back