Forums » Performance discussions »
How To Configure ZFS for Small File IO? Performance is awful!
Added by Edmond DeMattia about 1 year ago
I have a Dell server with 4CPU's and 16GB of RAM, controlling an MD1000 array with a PERC 6E controller. The drives are all configured as independent RAID 0 as this is the only way to present the drives in JBOD mode to the OS.
When I configure NexentaStore Community 3.1.2 for CIFS access, small files are transfered at ~3MB per second while large files transfer around ~40MB per second. The server only has one, 1 GigE NIC but it's not network bound as far as I can tell. I verified it's configured for 1000/full. I configured the drives as follows:
mirror-1 C1T0D0 C1T1D0
mirror-2 C1T2D0 C1T3D0
mirror-3 C1T4D0 C1T5D0
mirror-4 C1T6D0 C1T7D0
Hot Spare C1T8D0
SAS drive for cache SAS drive for log
I configured the CIFS folder for access time=off and the Log Bias to throughput. I set the block size to 64k and dedupe and compression are off.
I've also tried the volume without a cache and log disk with very little difference in performance.
I've read disabling ZIL will increase performance up to 30x but wasn't recommended. Thoughts on how I can make this faster for small file I/O?
Replies
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Dan Swartzendruber about 1 year ago
Well first, do disable the zil (with sync=disabled) at least to re-run your test. It would be nice to know if this was the issue...
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Jason Litka about 1 year ago
Small file IO is terrible everywhere, it's not just ZFS. If your workload involves a lot of tiny reads and writes then SSDs for the ZIL and L2ARC will help.
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Dan Swartzendruber about 1 year ago
Especially for the l2arc. My crucial M4 (64GB) has almost 500MB/s sustained read, but only 100MB/s sustained write. The trick there is that if you can't write to the l2arc fast enough, you just drop those blocks and don't cache them. And of course, that the SSD has virtually zero latency for said reads is what gives you the big win for small random reads.
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Edmond DeMattia about 1 year ago
I disabled sync and it's still very slow with small files. I disabled checksums and no change in performance. I even rebooted for good measure but nothing.
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Roman Strashkin about 1 year ago
I think your problem is "I set the block size to 64k". no?
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Edmond DeMattia about 1 year ago
What would you recommend?
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Edmond DeMattia about 1 year ago
I played with the sizes a bit. Setting it from 64k to 8k, the small files went from a transfer rate of less then 1MB a second to 10MB a second! Now large files are going slower. No happy medium it appears.
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Andrew Galloway about 1 year ago
So first, the environment is by no means preferable. You've introduced a lot of extra layers by having that Perc card in the middle. Layers we have no control over and aren't tuned out of the box to really work well with.
Second, I see "SAS drive for cache SAS drive for log". Bad, no. Do not use a spinning disk for L2ARC or log device -- you will almost always be better off just letting the pool handle ZIL and letting RAM handle ARC with no L2ARC than you will be putting any spinning media into play for either of those workloads. They will almost instantly be your bottleneck.
Thirdly - yes, the block size used is going to have a big affect on performance of small file versus large file. That's just the way it is. 1 MB/s to 10 MB/s being your range, however, to me instantly sounds like "there's something wrong with the testing method, or the hardware environment, or some software settings".
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Edmond DeMattia about 1 year ago
Thank you for the information! I'll be the first to admit I'm not 100% sure what the correct configuration should be. ast week I noticed disabling the cache and log disks improved performance. They have since been removed from the pool.
What would be the optimal block size for hundreds of thousands of small files in the 1-500k range?
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Dan Swartzendruber about 1 year ago
I think it's important to point out that Andrew's advice is specific to Edmond's setup, not a general rule. From what I can see, the MD1000 uses 7200RPM SAS drives, if you had a setup where you had, say, 15K drives, this is not going to be true. since the limitation there will be the 1gb NIC, since 15K SAS drives far exceed the pipe size of the 1gb link, and have much lower latency than the main pool drives. I'm also kind of surprised about the "let the main pool handle the ZIL", as this seems to contradict everything I've read anywhere else, due to supposed conflict between the pool drives having to seek back and forth to handle data writes vs zil writes. I'd appreciate some clarification on this? If I'm misunderstood something here, I'd really like to have it corrected. Going back and re-reading the OP, I see he is doing logbias throughput, not latency, which should not be using the ZIL anyway, no?
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Andrew Galloway about 1 year ago
Actually that advice is a general rule. Never use spinning media for ZIL or L2ARC.
While it is certainly true to say that a 15K disk will have a bottleneck somewhere significantly above a 7.2K disk, it's also true to say that bottleneck is astronomically lower than an SSD.
For ZIL, consider this: when no specific log device is given, ZIL mechanics utilize the pool itself; when usage is low, both the ZIL on the pool or the ZIL on a 15K disk is likely able to keep up -- when load rises, both a single 15K disk and a pool are liable to start having problems -- the pool will, however, probably be capable of handling more of it before bottlenecking than the single 15K disk would (perhaps the 15K disk lasts a little longer, but not much considering its just 1 drive [more on that in a moment]). It is true to say that in a case where the ZIL load is insufficient to tax a 15K disk, its use would 'offload' that write traffic from the pool.. but it is conversely true to say that at such a low load, the pool was probably able to handle it and normal load without reaching a tipping point. There are situations where this could end up being untrue, certainly, but as a general rule of thumb -- do not use spinning disks for ZIL.
Also, see this link -- MULTIPLE 15K disks as log devices is not going to improve ZIL performance at all: http://www.nex7.com/node/12
For L2ARC, consider this: why not just put more spinning disks in the pool itself? They can then be utilized for the reads that miss the ARC (RAM), and can be utilized for writes as well (whereas when in L2ARC, they can only be utilized for reads that miss ARC). I'm a big fan of NO L2ARC until RAM in the system is maxed out (max being either motherboard maximum, or maximum that is financially feasible; after all, filling a motherboard with 8 GB DIMM's is fairly cheap, but with 16 GB DIMM's is not).. and once you're at that point, then and only then look at L2ARC (and only if your working set size still isn't fitting in ARC). When you do that, utilizing spinning media is silly; you should have just put them in the pool and let it absorb the reads that way -- if you're going to have a disk L2ARC, it should be SSD's or nothing.
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Dan Swartzendruber about 1 year ago
Interesting article, thanks for the link. To clarify: my assumption was he already had the SAS devices to be used for L2ARC and ZIL. If you are starting for scratch, I agree, SSD is the way to go. On the other hand, I'm not sold on the L2ARC argument. In his case, the network pipe is going to limit him to about 100MB/sec read. So a 15K SAS drive (capable of sustained 150MB/sec) is not going to break a serious sweat, in which case the interesting factor will be latency. In that case, my 3ms SAS drive will do far better than the 7200RPM SATA drive, and putting the 15K drives in the pool will only help 1/Nth of the reads (assuming of course a significant percentage of the working set fits in L2ARC.) I guess it depends on your working set then? My esxi datastore is about 230GB, and the working set fits almost completely in L2ARC, so for me, it's a big win.
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Andrew Galloway about 1 year ago
I don't like it. The reason I don't like it is when I answer these things in public forums, I tend to think of all potential audiences who will hit upon it in the future, and as such, I want to give generic advice -- and generic advice, as you agree, is stay the heck away from spinning disks for L2ARC & ZIL.
That said, in a small use case as described, yes -- though I'd be careful assuming the 15K SAS disk is a 3ms, 150 MB/s device; what kind of perf you're going to get out of it when used as L2ARC has a lot to do with average block size and the level of eviction going on (eg: amount of writing impeding on read perf). I've never met a 15K SAS disk that maintains 3ms average response time for 8K truly random reads scenarios hammering it. I've ALSO never met a 15K drive capable of 12,800 random read IOPS -- which is the number of 8K blocks that fit in a 100 MB/s gigabit link. So that 15K disk is still quite capable of being your bottleneck, even assuming the connectivity out is a mere gigabit. Obviously 8K is the average for iSCSI; if the only use-case for the box is NFS, then that average jumps up to close to 128K and is nicer.
Though again, when you then say you've got a pool with only a handful of disks, throwing one more disk into it as opposed to putting those reads into the L2ARC does, in theory, make sense. But also bear this in mind -- ZFS doesn't currently do any sort of intelligent decision making about when NOT to use L2ARC. If it's in L2ARC, it uses the L2ARC -- even if the L2ARC is friggin hammered and taking forever to reply and the actual pool is underutilized and has the IOPS to spare -- ZFS will still wait 30, 50, 100+ms for those L2ARC entries as opposed to go ask the pool disks instead (this is a potential area of improvement for ZFS).
Where I have seen a single or couple of 15K SAS disks used as L2ARC, the pool had 20+ disks in it, and the IOPS to spare to take over for the L2ARC drives, AND those 15K disks were completely 100% hammered and average 50+ms latency responses, so they were terrible bottlenecks.. in that case, pulling them out and putting a few more disks into the pool is an obvious way to go (the pool was mirror pairs). Note that yes, the new mirror pair of disks doesn't just automatically receive all the reads the L2ARC disks were getting, but over time and averaging it out, they DO provide a substantial boon in read IOPS potential for the pool, AND they're no longer a bottleneck in the form of L2ARC. :)
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Edmond DeMattia about 1 year ago
I setup another MD1000 with the same mirrored drive configuration but this time with an SSD read and log cache. I have to say the performance improvement is amazing! I'm now seeing wire speeds for large file writes and ~15MB per second writes for 10,000+, 1k files all over CIFS. This is more then acceptable considering an NTFS file server was performing even worse then my original results above.
I left the default block size at 128k. What is the relationship regarding file read/write performance and block size?
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Andrew Galloway about 1 year ago
Edmond DeMattia wrote:
I setup another MD1000 with the same mirrored drive configuration but this time with an SSD read and log cache. I have to say the performance improvement is amazing! I'm now seeing wire speeds for large file writes and ~15MB per second writes for 10,000+, 1k files all over CIFS. This is more then acceptable considering an NTFS file server was performing even worse then my original results above.
Glad the SSD suggestion has paid off. It is not actually unexpected -- ZFS' ability to hide the slowness of spinning media from you relies in large part on SSD's, especially in high IOPS, small block size workloads.
I left the default block size at 128k. What is the relationship regarding file read/write performance and block size?
I wouldn't worry about block size too much in a CIFS/NFS (file level protocol) use-case. Tuning it provides only small incremental improvements in most environments, and often at a cost in other areas that makes it at best a wash, or even net negative. If you're getting decent, acceptable performance at present, I wouldn't advise spending a lot of your time on tuning to eek out a couple extra % points. Down that road lies madness. :)
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Jan Schotsmans about 1 year ago
Dan Swartzendruber wrote:
Especially for the l2arc. My crucial M4 (64GB) has almost 500MB/s sustained read, but only 100MB/s sustained write. The trick there is that if you can't write to the l2arc fast enough, you just drop those blocks and don't cache them. And of course, that the SSD has virtually zero latency for said reads is what gives you the big win for small random reads.
Update the M4's firmware and see the disks fly. With latest firmware those drives do 550/480 R/W, even higher on some disks. Plus, the updated garbage collection routines and TRIM make the disks stay at peak performance almost indefinably.
You should update it anyway before you run into the SMART bug those drives have. At exactly X amount of hours run, they try to update the smart counter and lock up entirely until they are power cycled. Then they'll run EXACLTY 1 more hour and lock up again trying to update the counter. Once you update the firmware, they'll operate as normal (and loads faster then they did before)
The reason for the M4's sad performance with the original firmware, is that Crucial doesn't use Sandforce controllers, like all the other SSD vendors do.
Sandforce develops the firmwares FOR the vendors, where the chipmaker Crucial uses doesn't.
With this, it has taken a while for Crucial to get the hang of developing the firmware.
But I'm glad to say that at this point, they are very good at it.
The M4's have troughput similar to the Sandforce based disks (that is, when the Sandforce SSD's are brand new, their performance deteriorates fast and doesn't get better) and thanks tot he great garbage collection, they also keep that same performance for their entire lifespan, where Sandforce based SSD's deteriorate and keep getting worse with age.
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Dan Swartzendruber about 1 year ago
What is 'latest firmware'? I thought I had that, and the specs on saw show 400/100 or so, not what you are describing?
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Jan Schotsmans about 1 year ago
Most hardware have a firmware, the internal software the device uses to do its internal logic and to communicate with your system, this is done in software so that it can be updated in case of bugs or requested improvements. In case of SSD's, the firmware does write balancing, garbage collection, support SMART and TRIM interfaces, etc.
Every component in a computer has its own firmware, most modern devices can be updated.
And yes, as I said, with the original firmware those drives were sold with, the performance was indeed as you stated about 400/100, with the newest firmwares (0009 and most importantly 0309, which fixes the SMART bug), they are loads faster.
The SMART bug is stated as following:
Correct a condition where an incorrect response to a SMART counter will cause the m4 drive to become unresponsive after 5184 hours of Power-on time.
But since then, Crucial has learned a whole lot about writing the firmware (for the reasons I stated before) and the performance for these drives has increased substantially.
You can download the firmware's for Crucial M4 SSD's here: http://www.crucial.com/help/ssd/index.aspx
At "Firmware and Downloads" select the M4. It'll list the available updates (currently latest is 0309 released in Januari)
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Dan Swartzendruber about 1 year ago
I already have the 9 firmware, and as I said, I don't see anything like the numbers you mention. Here is an example (connected to an m1015 HBA), doing 8GB write and then read:
8589934592 bytes (8.6 GB) copied, 74.6749 s, 115 MB/s 8589934592 bytes (8.6 GB) copied, 22.8495 s, 376 MB/s
This jibes pretty closely with crucial's specs for the 64GB SSD. I don't see how we can be talking about the same unit?
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Jan Schotsmans about 1 year ago
Have to say I'm sorry and that I was mistaken about the write speed (don't post from memory when dead tired :p), the speeds I noted was in 2 Disk RAID0 on my desktop back when they had 0001 firmware. But that still means yours isn't even doing half of what they should be doing, per disk, for writes.
But I am certain they should be doing way higher then what you report:
Check out this review for a comparison between 0001 and 0009 firmware, then add a tad more read and a tad more write for 0309.
http://www.legitreviews.com/article/1697/2/
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by Dan Swartzendruber about 1 year ago
Now I understand the confusion. As I've said before, I have the 64GB unit, you are testing the 256GB unit. If you read the specs for the crucial SSDs (can't speak to any others), the spec'ed write speeds (at least per newegg) are:
64GB => 95MB/s 128GB => 175MB/s 256GB => 260MB/s
RE: How To Configure ZFS for Small File IO? Performance is awful! - Added by David Bond about 1 year ago
In terms of performance, as recommended, you want to keep the block size to 128KB. This value isn't the block size to be used for all writes, it is the maximum block size allowed for writes. If you are not using it for block level access (iSCSI, FC etc) then when a file is written to ZFS its block size is determined based upon its size. If you write a 3KB file, it will be put in a 4KB block, you write a 6KB file it will go into an 8KB block. 500KB, 128KB blocks. The reason you would change this value is if you have files that contain structured data and are read and written to at specific block sizes, ie databases, this way, ZFS wouldnt use 128KB blocks for the 200GB file that you have written when you are reading and writing say 8KB blocks to it, as you can set it to have a max block size of 8KB, so all you writes allign to the structure within the file, and you dont have to do a read, mod, write on every entry update, and you dont need to read 128KB when you wanted on 8KB.
If you have the CPU available, you want compression, this can, depending upon your data, improve performance, as it could reduce that 6KB file into say a 4KB or a 2KB block, reducing the size of the read / write, it could also reduce the number of I/Os needed for larger files, as that 500KB file could be compressed into say 300KB, so now it only needs 3 128KB blocks, not the 4 it had before.
From my understanding, with the zil, the data you are sending will only ever go into to zil if it is below 64KB in size (zfsimmediatewrite_sz), block over 64KB have their data written to the pool directly (not commited, ie the COW isnt carried out, just the data saved) and its meta data (location etc) written into the zil. Blocks under 64KB will have the metadata and block data written into the zil.
You may want to look at configuring jumbo frames, if you can set them on the clients that will be accessing it.
With the logbias, you may want to change that to latency, as you are working with small files, not large, streaming files that throughput would help with.
With the perc controller, do you have the onboard cache enabled, if it has one, if so, disable it, it screws up the performance, I have tested a few years back with an MD1000 and perc 5e, and CIFS performance was crap while it was enabled, it would pulse, data would transfer at 100MB/s to the server, then stop, for a few seconds, start again, stop, every time slowing down. Same happened with the MD1120 we tested a year ago with a perc 6 card but this was testing with iSCSI.