Understanding the basics of ZFS architecture

The Starline ZFS FAQ

Learning the essentials of the ZFS architecture will help you achieve two important goals:

  1. Select the best hardware components for your TrueNAS storage server
  2. Configure your server to give you both the performant and the data security you aim for

Common POSIX filesystems serve one major purpose: retaining data, and its relevant meta-data, in a structured and accessible form. And to format any disk using a specific filesystem you need to use a volume manager (think fdisk, gparted, lvm, etc.) to create volumes (think partitions and RAID groups). Then you have to format those volumes into your preferred filesystem (FAT, EXT4, NTFS, XFS, etc.).

ZFS, on the other hand, is not just a filesystem, it is also a volume manager by and for itself. It can create volumes and assemble and manage its own software RAID structures – known as RAIDZ. This also eliminates the need for a hardware RAID controller. Actually, you are generally discouraged from using one – but this does not mean you cannot use one if you wish to. You just need to know which controllers you should use and in which way with ZFS. More on this point later on. Let us now familiarise ourselves with some of ZFS important terminology such as ZPools, VDEVs, Datasets, Volumes, ARC, and ZIL.   

Index

What is a ZPool?

Can increase a ZPool capacity?

Can I use a JBOD to create a ZPool?

What are VDEVs? And which VDEV types does ZFS support?

Can I increase a ZPool capacity?

What is RAIDZ?

Which RAIDZ is faster? Which RAIDZ is safer?

Which RAIDZ level to choose?

Can I use RAIDZ1 if I have backup of my data?

What is a Hot-Spare?

What is a Cache device?

What is ZIL? And what is a SLOG device?

If I have no NVMe I don’t have ZIL?

So, What is SLOG?

Examples of structured ZPools and VDEVs

TL;DR

What is a ZPool? 

In any ZFS storage solution your data is saved to a ZFS pool (ZPool for short). This ZPool consists of one or more virtual device groups (VDEVs), those VDEVs consist of physical disks (devices). If there are more than one VDEV in the pool they will always be striped.  This means that any disk you plug into your ZFS server must belong to a VDEV.

This also translates to two very important pieces of information:

  • In ZFS your redundancy happens at the VDEV level, not at the pool level.
  • If you lose one complete VDEV, you lose your pool
A ZPool consisting of two striped VDEVs
A ZPool consisting of two striped VDEVs

During writes, data is distributed evenly across devices (disks) based on how much free space each device has. There is an exception to this rule and this is when ZFS knows it is best to save slightly more data to one disk because it fits better within the VDEV/pool. More on this later on.

Can increase a ZPool capacity?

  1. You can increase a ZPool capacity by adding VDEVs or growing one, or more, VDEV.
  2. You can increase the capacity of any VDEV by replacing all disks within that VDEV.

This means: to double the capacity of a VDEV built with 4x 2 TB you need to replace all 2 TB disks one by one, and wait for ZFS to finish its recovery process (AKA resilvering). Once ZFS has finished resilvering the fourth disk you will be able to see the new capacity in your TrueNAS > Pools page.

But growing a pool is a one-way-process. You cannot remove a VDEV from a ZPool or decrease its capacity.

Can I use a JBOD to create a ZPool?

You can create a ZPool on a VDEV that is built using disks of different capacities. But, in this case, your total capacity will be calculated using the capacity of the single smallest disk within your VDEV. For instance: if you create a ZPool on a single VDEV using 8 Disks of 7x 4 TB + 1x 500 GB this will be as good as building a pool using 8x 500 GB. You would be wasting more than 24 TB of storage because of that single 500 GB disk. In this case it would be wiser to build your pool using 7x 4 TB disks, or just add one more identical 4 TB disk.

You can also create a ZPool using multiple VDEVs of mismatching raw capacities. For instance, you can have the following VDEV compositions in a single ZPool:

VDEV0 = 4x 4 TB disks = Raw Capacity of 16 TB
VDEV1 = 4x 6 TB disks = Raw Capacity of 24 TB
VDEV2 = 8x 2 TB disks = Raw Capacity of 16 TB

And your pool structure would look like this:

A ZPool consisting of mismatching VDEVs
A ZPool consisting of mismatching VDEVs

What are VDEVs? And which VDEV types does ZFS support?

VDEVs are ZFS Virtual Devices built by grouping one or more physical disk in a ZFS-compatible VDEV type. VDEVs can be treated as disks. You can stripe and mirror them to form a ZPool. ZFS supports seven different types of VDEVs. VDEV can be any of the following:

Physical DiskCan be a HDD, SDD, PCIe NVME, etc.
FileAn absolute path to a pre-allocated file or image
MirrorA standard software RAID1 mirror
ZFS software RAIDZ1/2/3Non-standard Distributed parity-based software RAID
Hot SpareDisks marked as hot spare for ZFS software RAID
CacheA device for ZFS level 2 adaptive read cache – known as L2ARC
LogA separate log device (SLOG) for ZFS Intent Log (ZIL)*
*By default, the ZIL is written on the same data disks in the ZPool.

What is RAIDZ?

RAIDZ is a non-standard software RAID that ZFS can use to construct VDEVs using available devices. By non-standard we mean it is ZFS-specific and does not conform to other common software RAID standards.

RAIDZ supports the following three levels of RAID:

RAIDZ LevelEquivalent toParity DisksDisk loss toleranceSpeed
RAIDZ1RAID5One-1fastest
RAIDZ2RAID6Two-2slower
RAIDZ3RAID7Three-3slowest

Which RAIDZ is faster? Which RAIDZ is safer?

From the table above you can see that while RAIDZ3 is the safest with a disk-loss tolerance of up to -3 disks it is also going to be the slowest of them, because triple parity bits must be calculated and written to the devices.

RAIDZ3 is also the least space efficient because 3 disks are be used for parity.
RAIDZ1, on the other hand, is the fastest and the most space efficient of the three. But it is also the most dangerous.
RAIDZ2 is a balance in between. It offers higher performance than RAIDZ3 and a better protection than RAIDZ2. It is the setup we usually recommend to our ZFS/TrueNAS customers.

Which RAIDZ level to choose?

Only you can answer this question. And to answer it you must consider three requirements:

  • The level of data protection
  • The usable capacity*
  • The level of system performance

You need to find a middle-ground where you have good data protection, satisfying performance, and adequate capacity. Whatever you do, do not compromise when it comes to data integrity.

A diagram of capacity ZFS

From our experience we can only advise against RAIDZ1. In spite of its relatively fast performance, it offers the worst data protection of the three. If you are so unlucky as to lose one more disk during a pool recovery process (resilvering), there is a good chance of losing your entire pool.

* Usable capacity of any ZPool is determined by the underlying VDEVs structure. 

Can I use RAIDZ1 if I have backup of my data?

Only if you have a watertight and contentious backup plan.  TrueNAS Core supports both replication and cloud backup to various storage providers.

Please keep in mind that if your daily operations rely on the immediate availability of your data then replication (having two identical primary and secondary servers) is the way to go. Cloud backup will translate to downtime since you still have to retrieve the data and make it available throughout your network.

What is a Hot-Spare?

A hot-spare is a device that could be used to replace a failed or faulty device in a ZPool. A hot spare device is inactive in a pool until you instruct ZFS/TrueNAS to use it to replaces the faulty device. Hot-spares are not a must, but they are highly recommended for mission-critical servers because they help shorten the overall recovery time. If you have a spare disk ready

Just keep the two following points in mind:

  1. Hot-spares do occupy server slots and this will affect the usable capacity of your server.
  2. Spare disks must be identical to, or larger than, the disks they will replace.

Hot-spares themselves do not offer any data redundancy. They are a static (sleeping) member of the ZPool that does not contribute neither to the storage capacity nor to the performance. They are there just in case a device might fail. They are also most useful when you want to initiate the recovery process remotely – since the spare is already plugged into the chassis.

With that in mind, we advise you not over stuff your server with hot-spares, because every spare disk you plug in replaces a disk that you can actually use to store your data. You also still have your VDEV layout (RAIDZ2, RAIDZ3) Mirroring, etc.) to cover your back.

Also, consider buying your spares with your new server – especially if you cannot wait for back-orders and couriers.

What is a Cache device?

By default, ZFS uses system RAM to cache the read data blocks. This helps accelerate secondary client data requests because ZFS can then serve them directly from the RAM instead of going back to the pool disks. This primary cache is known as the Adaptive Replacement Cache (ARC) and it always lives in the RAM.

Although it is always advised to add more physical RAM modules when you want to increase the amount of cached reads, sometimes you might want to add a Level-2 Adaptive Replacement Cache (L2ARC) to your system, especially when you run out of RAM slots, when RAM is too expensive to obtain, or when you have some blazing fast NVMe drives to use as L2ARC.

Keep in mind that increasing the amount of system physical RAM is always better than using a L2ARC.

What is ZIL? And what is a SLOG device?

ZIL keeps a special log called the ZFS Intent Log (ZIL). This is where ZFS logs synchronous operations before it is written to the ZPool array. A synchronous write is when a client sends a piece of data to ZFS to be written and waits for an acknowledgement signal (ACK) from ZFS that the data has been written to the ZPool disks before it could send any more data. Although this provides better data safety, but it relatively slows things down, especially when your ZPool is built on top of slow spinning disks.

To help speed things up, and avoid data loss in case of power interruption, ZFS does not use the RAM to cache writes, instead it uses faster flash drives as persistent storage for write cache. This cache accelerates writes by, temporarily, hosting the data on a persistent medium while it gets written to the ZPool’s backing disks, allowing a faster acknowledgement to be sent back to the client and decreasing the time to the next patch of data to be sent out.

If I have no NVMe, I don’t have ZIL?

Your TrueNAS Core server has a ZIL, whither you have dedicated NVMes for it or not. By default, ZIL resides on the same set of disks where the data is stored. It is a tiny, hidden, slice of each member disk within your pool.

TrueNAS RAIDZ ZFS ZPool: A diagram of a ZPool with the ZIL integrated into the pool’s disks
A diagram of a ZPool with the ZIL integrated into the pool’s disks

So, What is SLOG?

Independent NVMe/SSD disks can be used to host the ZIL writes and improve the performance of a ZPool. When you choose to use faster flash drives to speed up ZIL operations, you are basically separating the ZIL from the spinning pool disks. SLOG is that separate NVMe disk you use for the ZIL – hence the term Separate Log, because they are no longer the same disks where the data is saved.

TrueNAS RAIDZ ZFS ZPool: A diagram of a ZPool with separate flash ZIL and L2ARC disks
A diagram of a ZPool with separate flash ZIL and L2ARC disks

Finally, we strongly advise you to mirror your SLOG devices. This helps protect from data loss caused by sudden single SSD device failure. Using enterprise-grade NVMe, with Power Failure Protection (PFP) it a must for any mission critical ZFS server. Enterprise hardware has higher manufacturing standards, can survive years of 24/7 operation, comes with extended warranty, and has a significantly higher TDWPD value – compared to consumer-grade hardware.

Examples of structured ZPools and VDEVs

TrueNAS RAIDZ ZFS ZPool: ZPool with a single VDEV on a single device
ZPool with a single VDEV on a single device
TrueNAS RAIDZ ZFS ZPool: ZPool with a single VDEV on a 4x striped devices
ZPool with a single VDEV on a 4x striped devices
TrueNAS RAIDZ ZFS ZPool: ZPool with 2x VDEVs in RAIDZ1
ZPool with 2x VDEVs in RAIDZ1
TrueNAS RAIDZ ZFS ZPool: ZPool with 2x VDEVs in RAIDZ2
ZPool with 2x VDEVs in RAIDZ2
TrueNAS RAIDZ ZFS ZPool: ZPool with 2x VDEVs in RAIDZ3
ZPool with 2x VDEVs in RAIDZ3
TrueNAS RAIDZ ZFS ZPool ZPool with separate NVMe devices for L2ARC and ZIL
ZPool with separate NVMe devices for L2ARC and ZIL

 TL;DR

ZPools and VDEVs
  • ZPools are created using virtual disk groups (VDEVs) which in turn are built using disks (devices)
  • If a ZPool contains more than one vdev, the VDEVs are striped.
  • A ZPool capacity is the sum of the capacities of its underlying VDEVs – depending on internal VDEVs structure (RAIDZ type, Mirroring, Striping, etc.).
  • You can create a ZPool using a JBOD.
  • The capacity of any VDEV = The number of disks X The capacity of the smallest disk – Parity.
  • You should create a ZPool using VDEVs of the same size (built using identical disks).
  • You can grow a ZPool by adding more VDEVs
  • You cannot remove a VDEV from a ZPool.
  • You can grow a VDEV by replacing all member disks one by one and resilvering the pool.
  • Resilvering is the process of ZFS pool rebuild after a disk replacement/upgrade.
  • You cannot remove a disk from a VDEV.
  • ZPool redundancy is defined at the VDEV level.
  • If you lose a complete VDEV you will lose your pool.
  • There are three levels of RAIDZ: 1,2, and 3. The digit represents its parity redundancy.
  • Do not use RAIDZ1. It offers the worst protection level.
  • Again, do not use RAIDZ1.
TrueNAS
  • You can have multiple pools within your TrueNAS Core server.
  • The best pool setup is a middle-ground between capacity, performance, and data protection.
  • The usable capacity of your ZPool is determined by underlying VDEVs structure. 
  • TrueNAS Core supports both replication and cloud backup.
  • Replication is having one primary TrueNAS server clone its contents on a secondary server. It is preferred that both servers are hardware-identical, but it is not a must.
  • A hot-spare is a device that could be used to replace another faulty device in a ZPool.
  • ZFS has two levels of read cache: ARC and L2ARC
  • ARC lives in the RAM.
  • L2ARC can be fast NVMe or SSD.
  • It is always best to add more physical RAM than to add L2ARC.
  • ZFS has Write Intent Log (ZIL) where ZFS logs synchronous operations before it is written to the ZPool array.
  • ZIL, by default, resides on the same disk where the data is stored.
  • To improve synchronous write operations ZFS can use separate devices for the ZIL log. Those are usually fast NVMes/SSDs. Because they are separate they are called SLOGs.
  • SLOG devices do not have to be BIG, but they must be resilient.
  • You must mirror your SLOG devices to protect against single device failure.
  • You should stripe your L2ARC devices to improve performance.

Individual solution

What are your requirements?

Starline contact

Any questions? Please contact us.

This proven expert for Linux and Ceph platforms joined Starline in 2018. His preference is for sophisticated open source solutions and tricky product developments. Therefore, he is also available for enquiries regarding PetaSAN and TrueNAS (formerly FreeNAS). But he also likes ARM servers from Ambedded or Mac operating systems. In his private life, the engineer enjoys working with 3D printers and robots.

Mohammad Ammar
Technical Support