Choosing the perfect ZIL device for a "HA Cluster-In-A-Box"

Added by Clemens Kalb over 2 years ago

Hello everyone,

just recently I've had a look at Nexenta and SuperMicro's offering of a "HA Cluster-In-A-Box": http://www.nexenta.com/corp/sbb Apparently, this HA Cluster-In-A-Box consists of a 16x3.5" SAS2 JBOD interconnected to the LSI SAS2 controller of both the left and right node of the cluster. Both nodes are also interconnected directly via dual 10GbE lan interfaces.

To make use of this hardware configuration, one would probably have to employ the NexentaStor "HA Cluster" plugin, that "allows for the use of two NexentaStor instances for active / active failover in front of shared storage".

Fair enough, so far all of this looks like a simple yet powerful, highly-available setup.

One thing I've yet to find out is what kind of ZIL device one would use in this setup. To my understanding of ZFS, it would have to fulfill the following requirements:

  • it needs to be available to both nodes of the cluster so one node can access any data on the ZIL in case of a failure of the other node
  • if it is part of the shared JBOD between the nodes, it needs to be a dual-ported SAS/SAS2 device
  • if it is not part of the JBOD, it needs to be made available to both sides by other means
  • in case of a power failure, data in the device's internal cache must not be lost, or the device will at least have to honor ZFS cache flush requests
  • it needs to be fast

Now what ZIL device would fulfill all of these requirements?

  • A popular choice for a ZIL seems to be the Intel X25-E SSD. To use it in the JBOD, a LSISS9252 based SAS-to-SATA Interposer Card would be required. However, according to http://opensolaris.org/jive/thread.jspa?threadID=121424, it is unclear whether the X25-E will actually respect cache flush requests. Disabling the cache would be another option, but I am not sure whether this will work reliably with the SAS-to-SATA interposer or how bad the performance impact would be.
  • Another popular and very powerful choice for a ZIL seems to be the DDRdrive X1. However, as it is a PCIe-based device, it would not work in a clustered setup. I've had the idea of putting one DDRdrive X1 in both sides of the cluster, creating an iSCSI target of both DDRdrives and using these iSCSI targets in a mirrored ZIL configuration. This might work (remember: there's 10GbE between the nodes), but it does not really sound like a reliable, fast or supported configuration.
  • Some people have recommended a Sandforce-1500 based SSD, as some of these seem to be backed by a supercapacitor. I've found the "OCZ Vertex 2 EX Series SATA II 2.5" SSD". Again, a SAS-to-SATA interposer would be required. There is also the more expensive "OCZ Deneva Reliability 2.5" SLC SSD", which is a native SAS device but seems to be hard to come by. Both are fast and have a built-in supercapacitor to protect from power outages. So far, this looks like a good choice. However, I've also found a rather recent comment on the zfs-discuss mailing list that says: "In particular, anything Sandforce 1500 based was the worst so avoid those at all costs if you dare to try an SSD ZIL." (http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg41336.html).
  • An actual Logzilla from Oracle would probably fulfill all of the requirements if you can somehow fit it into the JBOD. However, it is also the most expensive choice and I'm not sure whether Larry would actually sell you one of these if you do not pre-own compatible SUN storage. ;-)
  • Not using a separate ZIL device at all: This would probably work, if the drives in the pool honor cache flush requests. If they don't, you'd probably have other problems to deal with, anyway. ;-) The downside: it would be rather slow unless you employ expensive, low-capacity 15K drives.

So far, none of these options looks like the perfect choice. Or am I just being too picky or uninformed? What would you recommend?


Replies

RE: Choosing the perfect ZIL device for a "HA Cluster-In-A-Box" - Added by Matt Axsom over 2 years ago

Good question, an official response from Nexenta would be nice. The HCL only has a single device supported. I doubt most customers would understand buying 10 packs of GC-RAMDISKS off e-bay. Same issue you mention with the DDRdrive X1 being tied to the node too:

http://nexenta.com/corp/zfs-hardware-corner/Certified-Components/nexenta-certified-components/zfs-cache-devicesc46m120/

One possible option might be along these lines (Hypervisor with VT-D/IOMMU passthrough):

http://blog.laspina.ca/ubiquitous/encapsulating-vt-d-accelerated-zfs-storage-within-esxi

I'm familiar with some of the other supermicro chasis' and there isn't enough room to fit an interposer in their 2.5" to 3.5" drive sled. Maybe they will release a newer one that pushes the drive forward another 1/2".

-- Matt.

RE: Choosing the perfect ZIL device for a "HA Cluster-In-A-Box" - Added by Clemens Kalb over 2 years ago

Just some quick comments since I have just received the chassis this week:

  • The interposers do fit in the (special) drive sled. It also seems to be possible to order sleds w/ interposers directly (SKU: MCP-220-000-71-ON)
  • They work with all Intel SSDs I tried (X25-V, M and E)
  • They do not work with the other SSDs I tried, an OCZ Vertex 2 EX and a crucial C300. These SSDs will show up with a 0GB capacity if connected via the interposer. The OCZ SSD is also detected as a Hitachi device... I'm not sure what actually causes this issue (HBA, interposer or expander) and it seems almost impossible to get anyone to comment on that. LSI doesn't want to talk to us ("we do not provide direct support for ICs") , SuperMicro and OCZ so far have not responded.

Anyway, I'm afraid that unless you can source native SAS SSDs from somewhere, it seems like the only safe option left is the X25-E.

RE: Choosing the perfect ZIL device for a "HA Cluster-In-A-Box" - Added by bill zheng over 2 years ago

you can probably use ramdisk backed iSCSI LUNs from two remote hosts, and mirror them.

as long as both remote hosts don't go down at the same time, you will have a ZIL and performance will be OK. as long as the remote hosts and your ZFS server don't go down at the same time, you won't lose data. I have not tested this setup yet.

RE: Choosing the perfect ZIL device for a "HA Cluster-In-A-Box" - Added by Gijsbert Janssen van Doorn over 2 years ago

Clemens,

Do you have any updates on this, we are considering buying a SBB and have troubles picking a good ZIL device. Are the Intel X25-E working correctly with the Interposer (no errors).

RE: Choosing the perfect ZIL device for a "HA Cluster-In-A-Box" - Added by Clemens Kalb over 2 years ago

All Intel devices I tried so far seem to work without problems (X25-V, M and E). However, I'm planning to try the mirrored ramdisk approach first (via iSCSI or AVS) to avoid any write endurance/performance issues that come with any SSD currently on the market.

Depending on your budget, here's a good list of devices that should do the trick: http://www.zstor.de/shop.html?page=shop.browse&category_id=18 If you have the money, the Stec ZEUSRAM 8GB would be the best bet for a shared-disk SAS solution.

RE: Choosing the perfect ZIL device for a "HA Cluster-In-A-Box" - Added by Peter Valadez 10 months ago

Hi Clemens,

What is the model of the "special ssd sled" you used with a 2.5 inch ssd? I could not find the MCP-220-000-71-ON tray anywhere (maybe it discontinued?). We did buy a couple of the MCP-220-00043-0N trays (http://www.provantage.com/supermicro-mcp-220-00043-0n~7SUP910X.htm), and they work fine with our SAS ssd's, but there is no way to fit an interposer in these sleds.

The only idea I've come up with that might work is getting an Icy Dock (http://www.newegg.com/Product/Product.aspx?Item=N82E16817994083), and using that with an interposer and a Supermicro drive tray that will fit a 3.5 inch drive with an interposer.

RE: Choosing the perfect ZIL device for a "HA Cluster-In-A-Box" - Added by Tommy Scherer 10 months ago

The Supermicro SBB uses a different a carrier than most Supermicro Chassis. The SBB carrier has a longer carrier which allows the interposer to be fitted. If you are trying to put a 2.5" Intel SSD in a 4th generation carrier I think VA Technologies has them.

there was a firmware update to fix the errors with the interposer for the X 25 E.

RE: Choosing the perfect ZIL device for a "HA Cluster-In-A-Box" - Added by FREDY . 10 months ago

I am seeing people mentioning to use "mirrored ramdisk"., but since the boxes are powered from the same power supplies if you have a power cut that data on the ramdisk will go, so why not simplify and just enable Write back on the Zvol which should produce the same result ?

RE: Choosing the perfect ZIL device for a "HA Cluster-In-A-Box" - Added by Peter Valadez 10 months ago

If you have a proper device to use as a zil device, or slog, then the device will have capacitors (the DDRdrive has a an AC adapter that can be plugged into a UPS) and any DRAM cache the device has will be powered long enough to finish all writes into stable storage. Therefore, a good zil device should not lose any data after a power loss. Some people mirror the zil device to protect against the zil device dying and then the server losing power or crashing within 5 seconds (5 seconds is the default interval for synchronous writes to be commited to pool storage).

Also, I think what Clemens originally proposed was that the zil devices would be on two separate servers, so they wouldn't both be on the same power supplies.

Writeback cache allows asynchronous writes to use your system RAM as a write cache, so any writes still in the write cache would be lost in the event of a power loss.

RE: Choosing the perfect ZIL device for a "HA Cluster-In-A-Box" - Added by FREDY . 10 months ago

I see, there is an small difference. But in a SBB system each server has one power supply which will likely be supplied by the same source, so if the power goes down completely you loose all the writes non committed that were on both ramdisks while with a ZIL device they would stay until you can get the power back to the system.