State of write cache on ZFS

A fatal flaw with ZFS is no write cache.

If you have a large ssd and you'd like to write to it, while the ssd writes to spinning rust (so you get 500mb/s instead of 150mb/s), you can't. There are a few hacks, like bcache with zfs, but this is not easy to configure and has to be setup before using your spinning disks, and is hard to change setups.

In 2015 there was a talk on Write Cache for ZFS, but nothing has come since. https://www.youtube.com/watch?v=MkdrnG7GwdE

Solutions:
Should I use bcache, even though its very sketchy?
Should I just wait and hope ZFS adds write chache?
Should I just not be a pussy and buy a license for unRAID and be over with it?

  1. 3 months ago
    Anonymous

    Just a reminder that consumer SSDs will het absolutely destroyed by using it as a cache. The TBW is nowhere near useable for anything resembling longterm.

    • 3 months ago
      Anonymous

      Yes I prepared for this. I am using a Oracle F80 which has 20pbw

    • 3 months ago
      Anonymous

      what about one of those cheapo 16 or 32 gb optane?

      • 3 months ago
        Anonymous

        optane is ideal for this purpose but those small devices suck - they’re only 2 lane pci-e 3. 900p/905p cap out at just over 2GB/s so that’s low if you’re pushing through a 40+ Gbps network, leaving the 5800x which is hella expensive but the best device on the planet

        • 3 months ago
          Anonymous

          I mean in a home server. I'll never go over 1gbs anyway, not worth the extra hardware for my usecase. I just want to know if those consumer optane are up to the level of usage that a regular ssd won't survive, they were sold as cache drives for laptops but I am unsure about the real world performance and endurance

          In an enterprise operation of course you'll need a enterprise level optane. I would love to have enterprise budget but I don't.
          The shit is that the tech is going to be discontinued...
          Benchmarktards only look to the sustained read speeds, no the IOs or random reads

          • 3 months ago
            Anonymous

            optane has something like two orders of magnitude higher write endurance than nand ssd’s. If your network is only 1Gbps, then skip the ssd cache and just max out RAM. You can disable sync/force ack if you so choose to avoid write bottlenecks

            • 3 months ago
              Anonymous

              You can get an 16gb stick for less than 10 euros and a 32 one for around 15, so its way cheaper than ram currently

    • 3 months ago
      Anonymous

      >absolutely destroyed by using it as a cache.
      No they won't. If you have a consumer workload it doesn't matter if you're using it for caching or not.

  2. 3 months ago
    Anonymous

    >A fatal flaw with ZFS is no write cache.
    except for the intent log and SLOG
    did you even google this before you posted shit?

    • 3 months ago
      Anonymous

      Intent Log/SLOG != Write Cache.
      It caches 5 seconds worth of asynchronous writes, then writes it to spinning rust. ZFS actually does have read cache with L2ARC.

      unRAID has full write and read ssd caching. You can event create a cache pool, instead of just one disk.

      • 3 months ago
        Anonymous

        >The purpose of the ZIL in ZFS is to log synchronous operations to disk before it is written to your array.
        Did you even google this?

        • 3 months ago
          Anonymous

          My apologies, I meant to say asynchronous...
          But it is not a full write cache. Please stop defending ZFS, its a 20 year old product based around spinning rust.

          • 3 months ago
            Anonymous

            you did say asynchronous. the ZIL does synchronous. seriously just go look this up. you wouldn't have a separate cache drive for async writes

            • 3 months ago
              Anonymous

              Okay. But still this doesn't solve the problem that ZFS doesn't have a write cache.

              I'm looking at unraid and it seems to be pretty worthwhile for my need case. I think i'll try the trial out.

              • 3 months ago
                Anonymous

                retard.
                also unraid doesn't have have read cache so you're like 0/2 for this whole shit.

              • 3 months ago
                Anonymous

                well then, rip.
                I basically want bcache for my zfs pool. I guess I have no other choice then to just use bcache.

              • 3 months ago
                Anonymous

                alternatively I can risk it and use Bcachefs

            • 3 months ago
              Anonymous

              async writes would just be buffered in RAM, correct? but this creates a weird situation if your SLOG is much bigger than your available RAM. in that case, sync writes could actually be faster than async writes

              • 3 months ago
                Anonymous

                All,writes are buffered in ram. Slog just allows the sync condition to be met asap while the written data still resides in ram and is written from ram to disk. Slog content is only read if there is a failure before the ram content can be written to disk and this the slog is of very limited size and data waiting to be written must be written to disk before that ram can be freed for new writes (sync or async)

  3. 3 months ago
    Anonymous

    >Google unraid
    >it's an os
    why are you comparing filesystems with os.

  4. 3 months ago
    Anonymous

    Async writes are just cached in RAM until they're written to the disk, there's no reason to write them out faster because the thing that makes them asynchronous is that they don't have to wait for an acknowledgement.

    Pic related is why there's a performance benefit to sync writes from SLOG, obviously it doesn't apply to async.

    • 3 months ago
      Anonymous

      yes, but the data to be written must be completely buffered *somewhere* on the server before the async call can return. if the write is larger than the amount of available RAM, it would actually make more sense to buffer it on a bigger SSD, instead of filling up RAM and then needing to wait for the slow HDDs.

      • 3 months ago
        Anonymous

        [...]
        excellent, i see we have the same understanding. it’s still kind of disappointing that ZFS doesn’t handle this, but it is admittedly a rare case

        https://www.ebay.com/itm/385020026876
        https://www.ebay.com/itm/283816638613
        https://www.ebay.com/itm/364044394767

        384 GB of RAM for under $300, just add chassis and power. The situation you're describing is not a thing that actually exists if you're using hardware suited to your task.

        • 3 months ago
          Anonymous

          my cpu only accepts at most 32gb ram.
          Im using a nas, so all my writes are async so a slog will do nothing. I need a real ssd write cache with bcachefs. Why spend money on hardware when a software solution exists. Watch the video is gave you earlier, you'll see bcachefs is just better than ZFS.

          You don't need more ram. Async or sync, I want my data written to an ssd. Just use bcachefs.

          • 3 months ago
            Anonymous

            >all my writes are async so a slog will do nothing.
            You can force all writes to occur synchronously, but that would be silly because it would be slow and you don't care about data integrity.
            >I need a real ssd write cache with bcachefs.
            Why? Writing twice is always going to be slower than writing once. If we assume that you have 10Gb Ethernet, that's roughly 1200 MB/s coming in, assuming nominal overhead. If you can write 500 MB/s to your HDD array, then you have 700 MB/s filling up your write cache. 700 MB/s will fill up 30 GB of RAM in about 43 seconds. In that 43 seconds, you would have written about 22 GB to disk. So after the first 53 GB of a file transfer to your NAS, you'd see your transfer rate fall by about 60% as you're throttled to disk speed. So for every additional GB of your transfer, you lose 10 seconds after running out of memory. You'd lose somewhere around 8 minutes in a 100 GB transfer. You could mitigate this with SSD caching, sure, HOWEVER, you could also buy an old 2U blade for $120, drop a $5 processor and $100 of RAM into it, and then you'd have 160 GB of memory to cache with. For $225, you'd get redundant PSUs, and full speeds for transfers up to 280 GB. How often do you send more than 280 GB all at once to your NAS? For that matter, how often do you send 50 GB to your NAS? Do you even have 10Gb Ethernet? It's entirely likely that you can't saturate your current system even with only 32 GB of RAM.

        • 3 months ago
          Anonymous

          Additionally, storing that much data in ram is a risk for data loss from losing power. bcachefs with an ssd cache would reduce the amount of time data is left unprotected.

          • 3 months ago
            Anonymous

            >His server doesn't even have a UPS
            OP just can't stop sucking cocks, can he?

          • 3 months ago
            Anonymous

            Why the fuck are you so worried about data integrity yet you're installing memeOS and not using backup power and redundancy?

          • 3 months ago
            Anonymous

            you don't care about data loss though

          • 3 months ago
            Anonymous

            Yeah, you're just doing something wrong here buddy.

            Based on you using lame ass hardware (32GB max ram? What? My ZFS array has 256GB and it's a 10 year old garbage ebay server that I paid less than $100 for), I'm gonna go ahead and assume you're using garbage ass disks too instead of faster SAS drives.

            And even then, I'd wager that your network can't actually transfer faster than your garbage ass disks can write. I just don't believe that you're sitting here with a trashcan for a server but running on a 40gbps network.

            I have never had my 48TB ZFS be the bottleneck when transferring data in my office network, even on my 10gbps connected workstations.

            • 3 months ago
              Anonymous

              >running on a 40gbps network
              You don't need a 40Gbps network to write faster than 150MB/s, what the fuck are you even rambling about?
              >My ZFS array has 256GB and it's a 10 year old garbage ebay server
              Not everyone wants to run hot and loud garbage from a decade ago and claiming one should need some huge amount of RAM to write faster than 150MB/s on a bunch of disks is ridiculously stupid.

              • 3 months ago
                Anonymous

                you DO need at least a 10gb connection to write faster than that, though, and I doubt you have that either.

                My 10gbps connections are my limiting factor on my ZFS. I get a 900MB/s write speed without ram caching or any sort of SSD connected to anything. Network writes can only run about 1GB/s because that's just how fast a 10gb network runs.

                The problem is that you're dumb, not the technology.

            • 3 months ago
              Anonymous

              >all my writes are async so a slog will do nothing.
              You can force all writes to occur synchronously, but that would be silly because it would be slow and you don't care about data integrity.
              >I need a real ssd write cache with bcachefs.
              Why? Writing twice is always going to be slower than writing once. If we assume that you have 10Gb Ethernet, that's roughly 1200 MB/s coming in, assuming nominal overhead. If you can write 500 MB/s to your HDD array, then you have 700 MB/s filling up your write cache. 700 MB/s will fill up 30 GB of RAM in about 43 seconds. In that 43 seconds, you would have written about 22 GB to disk. So after the first 53 GB of a file transfer to your NAS, you'd see your transfer rate fall by about 60% as you're throttled to disk speed. So for every additional GB of your transfer, you lose 10 seconds after running out of memory. You'd lose somewhere around 8 minutes in a 100 GB transfer. You could mitigate this with SSD caching, sure, HOWEVER, you could also buy an old 2U blade for $120, drop a $5 processor and $100 of RAM into it, and then you'd have 160 GB of memory to cache with. For $225, you'd get redundant PSUs, and full speeds for transfers up to 280 GB. How often do you send more than 280 GB all at once to your NAS? For that matter, how often do you send 50 GB to your NAS? Do you even have 10Gb Ethernet? It's entirely likely that you can't saturate your current system even with only 32 GB of RAM.

              Why are defending ZFS so much. If ZFS had the option for real ssd write cache, would you use? Obviously...

              Bcachefs is an objectively better fs than ZFS. It fixes many other critical flaws ZFS has, like multiple disks of varying sizes in a pool and easy removal and replacement of disks. No choosing which raid you want, instead you choose #of redundant copies per pool.

              You argument as to why ZFS is still okay is because you can buy old loud enterprise gear. Sorry, I have a quiet haswell based tower nas with 32gb ram. I have 3 hard drives, with 1 as redudancy. But if I also use an oracle f80 with bcachefs, I can have write speed of an ssd, with less hardware, but with smarter software.

              ZFS is 20 years old. Bcachefs is designed by the smartest guy in file systems right now. As I see it, everyone should switch to bcachefs, even if you have a powerful server. Then you might realize, you probably don't need such power hungry hardware as you think.

              • 3 months ago
                Anonymous

                https://bcachefs.org/bcachefs_talk_2022_10.mpv

              • 3 months ago
                Anonymous

                >real ssd write cache, would you use?
                I'd use it if I really cared about data integrity, because I'd be forcing synchronous writes and using it as SLOG. The reason that ZFS doesn't have a "real" write cache is because that's a pointless feature that doesn't do anything except cost money. My argument isn't that you should buy new hardware (which, by the way, is also Haswell era and could easily be made just as quiet and low power), it's that you're not actually doing anything with your NAS that would require an SSD, and if you were actually had a serious enterprise-level workload for it, you'd be better served by getting actual enterprise-level hardware.

                Newer is not always better, especially when it comes to data storage. Maybe it's faster, who cares? How often do you actually transfer hundreds of gigabytes to your NAS all at once? Nightly or weekly backups don't count, because you schedule those for when you're asleep so it doesn't matter if it takes longer.

                And in case you still somehow haven't figured it out: *an SSD write cache will not make your system faster or more efficient, because it all has to be written to disk anyway.*

              • 3 months ago
                Anonymous

                are you stupid? go to 27:00 of https://www.youtube.com/watch?v=MkdrnG7GwdE. You'll see real write cache demolishing stupid zfs slog.

              • 3 months ago
                Anonymous
              • 3 months ago
                Anonymous

                >if ZFS had worse SSD solutions wouldn't you use it?
                No, I wouldn't use SSD write caching because it is literally slower than what my array does without it.

                >I have a quiet haswell based tower nas
                Congratulations? Tower servers exist. My setup is probably quieter than yours, was almost certainly cheaper, gives me way more storage, and outperforms your SSD, based on your "500MB/s" estimate earlier.

                >if I also use an oracle f80
                Ok, serious question. What the fuck did you do to that card to make it so slow?

                You apparently have no UPS, probably aren't using ECC memory, likely have no power supply redundancy, and you somehow broke the shit out of your cache card so that it only gets 500MB/s? WHAT? Did you wedge it into a 1x slot?

                You still haven't admitted to only having gigabit ethernet yet, either.

              • 3 months ago
                Anonymous

                The 500mb/s was a generalization for sata ssds. The f80 gets 2000mb/s. In either case, if you look at the writeback cache video you'll see theres a huge increase in performance,
                !!!!even for random writes!!!!

              • 3 months ago
                Anonymous

                You still haven't admitted to having a 125MB/s max throughput gbe network connection yet.

                You have no idea why people use NAS vs iscsi vs localized storage, do you?

              • 3 months ago
                Anonymous

                ya got me. I use 1 gig, in the future I may upgrade to 10 gig though. Also, write cache random writes are faster than slog writes, so there still is some benefit at 1 gig speed.

              • 3 months ago
                Anonymous

                with bcachefs you can even create a pool cache ssds, which together would be faster than a huge spinning rust array in sequential speeds

  5. 3 months ago
    Anonymous

    Because that breaks ZFS data integrity, idiot. It's intentional.
    This literally is not an issue at all. If you're using ZFS you've already decided to trade a little performance for data integrity. If you haven't, then you've chosen the wrong filesystem.

    • 3 months ago
      Anonymous

      Async writes are just cached in RAM until they're written to the disk, there's no reason to write them out faster because the thing that makes them asynchronous is that they don't have to wait for an acknowledgement.

      Pic related is why there's a performance benefit to sync writes from SLOG, obviously it doesn't apply to async.

      >Google unraid
      >it's an os
      why are you comparing filesystems with os.

      async writes would just be buffered in RAM, correct? but this creates a weird situation if your SLOG is much bigger than your available RAM. in that case, sync writes could actually be faster than async writes

      I've decided to install arch linux and use bchacefs on my nas. Wish me luck. #BleedingEdge.

      • 3 months ago
        Anonymous

        So does bcachefs cache async writes on nonvolatile storage, then? To what end?

        • 3 months ago
          Anonymous

          bcachefs has 3 categories of storage. Write Cache, permanent storage, and read cache.

          For writing: Data is written to the cache pool before being written to permanent storage. If the write cache fills up, data is immediately written to permanent storage.

          https://wiki.archlinux.org/title/Bcachefs
          https://bcachefs.org/bcachefs_talk_2022_10.mpv

          Basically, bcachefs is really the best newest fs out there buts its still not mainlined yet.

          • 3 months ago
            Anonymous

            Okay, but what's the advantage of writing asynchronously to cache instead of directly to storage? You're just duplicating effort at that point.

            • 3 months ago
              Anonymous

              You can give a nas a shitton of data really quickly and temporarily storing in a fast ssd. Eventually, that fast ssd will write to the slow hard drive. Hopefully, you don't overfill the ssd or else your write speed will go to shit since you're writing to spinning rust.

              For the user, it looks like their writing to a fast ssd.

              • 3 months ago
                Anonymous

                Async writes are already assumed to be written instantly, it makes no difference whether they're written out to a cache first or held in RAM. The only time an async write cache would help is when you're so flooded with async writes that it overflows your RAM before it can be written out to the disk, in which case the solution is to either install more RAM (you can pick up DDR3 RDIMMs for $0.60/GB, which is less than 10x the cost of a good SSD and orders of magnitude faster), and/or to optimize your workload because in order for this to cause a problem we're talking a huge amount of data sustained over quite a while.

              • 3 months ago
                Anonymous

                yes, but the data to be written must be completely buffered *somewhere* on the server before the async call can return. if the write is larger than the amount of available RAM, it would actually make more sense to buffer it on a bigger SSD, instead of filling up RAM and then needing to wait for the slow HDDs.

                excellent, i see we have the same understanding. it’s still kind of disappointing that ZFS doesn’t handle this, but it is admittedly a rare case

      • 3 months ago
        Anonymous

        >ssd cache
        >on a NAS
        absolute fucking tard

  6. 3 months ago
    Anonymous

    >(so you get 500mb/s instead of 150mb/s)
    Is ZFS really that slow? I have a mdadm RAID6 array with spinning rust and I can easily get several hundred MB/s transfer rates, though it does bounce up and down a fair bit.

    • 3 months ago
      Anonymous

      With 8x 5TB drives, this guy got 700 MB/s without redundancy and up to 670 MB/s with redundancy.

      https://icesquare.com/wordpress/zfs-performance-mirror-vs-raidz-vs-raidz2-vs-raidz3-vs-striped/

  7. 3 months ago
    Anonymous

    just use btrfs you fuck.

Your email address will not be published. Required fields are marked *