Day: October 25, 2021

ZFS Concept

ZFS Concept

Pool

ZFS pool (Zpool) is a collection of one or more virtual devices (vdevs), vdev is a group of physical disks. They have following facts.

  • The redundancy level for vdevs can be a single drive, mirror, RAID-Z1, RAID-Z2, and RAID-Z3.
  • After creating a Zpool, it may not be possible to add additional disks to the vdev except mirrors.
  • Add additional vdevs to expand the Zpool is possible.
  • The storage space allocated to the Zpool cannot be decreased.
  • The drives in vdevs that are parts of the Zpool can be exchanged.

If there is a need to change the layout of the Zpool, the data should be backed up and the Zpool destroyed.

Datasets

Datasets is the space emulating a regular file system.

Datasets can be nested, which can possess different settings for snapshots, compression, deduplication and so on.

Volumes

Volumes (zvols) is the space emulating a block devices.

Data Integrity

No overwritten

The copy-on-write mechanism is to keep old data on the disk.

Checksum

Checksum information is written when data is written into disk, then verified when read data from disk. When checksum mismatch detected, use redundant data is used for correction.

Different checksum algorithms are used

  • Fletcher-based checksum
  • SHA-256 hash

ZFS RAID

  • Single - Zpool has a vdev consisting of a single disk, similar to RAID0.
  • Mirror – similar to RAID1.
  • RAIDZ1 – similar to RAID5 but without the write hole issue.
  • RAIDZ2 – similar to RAID6, with 2 disks redundancy.
  • RAIDZ3 – similar to RAID6, with 3 disks redundancy.

RAID write hole in a RAID5/RAID1 occurs when one of the member disks doesn't match the others and by the nature of single-redundant RAID5/RAID1 it is impossible to tell which of the disks is bad.

Errors

Checksum mismatch

ZFS is a self-healing system. If mismatched checksum is detected, ZFS tries to retrieve the data from other disks. If data correct, the system will amend the incorrect data and checksum.

Disk failure

If a disk in a Zpool fails, the pool is set to the degraded state, then data on the failed device is calculated and written to first the spare disk replaces the failed one. This is called resilvering. Once the restoration operation is complete, the status of the Zpool changes back to online. In case of when multiple disks have failed and if there are not enough redundant devices, the Zpool changes its state into unavailable.

Migrate to different system

In old system, export zpool, which unmounts Zpool’s datasets or zvols.

In new system, import zpool, which mount Zpool's datasets or zvols.

Maintenance

Scrubbing

The scrubbing is consistency check operation, and try to repair corrupted data.

No defragmentation

There is no online defragmentation in ZFS, so try to keep zpools below 70% utilization instead.

Copy-on-write

On ZFS, the data changes are stored on a different location than the original location on a disk and then the metadata is updated in that place on the disk. This mechanism guarantees that the old data is safely preserved in case of power loss or system crash that in other cases would result in loss of data.

Snapshots

The snapshot contains information about the original version of the file system to be retained. Snapshots do not require additional disk space within the pool. Once the data rendered in a snapshot is modified, the snapshot will take the disk space since it will now be pointing to the old data.

Clones

The clone is a writeable version of a snapshot. Overwriting the blocks in the cloned volume or file system results in decrementing the reference count on the previous block. The original snapshot that the clone is depending on, can not be deleted.

Rollback

Rollback command is to go back to a previous version of a dataset or a volume. Note that the rollback command cannot revert changes from other snapshots than the most recent one. If to do so, all intermediate snapshots will be automatically destroyed.

Promote

Promote command is to replace an existing volume with its clone.

References

ZFS Essentials – What is pooled storage?
ZFS Essentials – Copy-on-write & snapshots
ZFS Essentials – Data integrity & RAIDZ
RAID Recovery Guide

Disable Copy-On-Write on BTRFS

Disable Copy-On-Write on BTRFS

The issue of COW (copy on write), is fragmentation, because it always write new device block. This is good for SSD, but not good on traditional devices. Even on SSD, if the block size is big, the data to be write would be much larger than actual updated data size. Because of this issue, recommented disable Copy-On-Write on database and VM filesystems.

Methods

Mounting

Disable it by mounting with nodatacow.

Following facts to be considered

  • This implies nodatasum as well
  • COW may still happen if a snapshot is taken
  • COW will still be maintained for existing files
  • COW status can be modified only for empty or newly created files.

File Attribute

For an empty file, add the NOCOW file attribute (use chattr utility with +C)

touch file1
chattr +C file1

For a directory with NOCOW attribute set, new files in it will inherit this attribute.

chattr +C directory1

For old files, copy the original data into the pre-created file, delete original and rename back.

touch vm-image.raw
chattr +C vm-image.raw
fallocate -l10g vm-image.raw

Subvolume (Untested)

Subvolume can not be set nocow separately. This is official answer.

But the files created inherit the attributes from directory, if separately mount subvolume on the directory which has nocow attribute, then the newly created files will inherit nocow attribute as well, regardless of the original volume.

Create directory

mkdir /var/lib/nocow
chattr +C /var/lib/nocow

Create subvolume

mount -o autodefrag,compress=lzo,noatime,space_cache /dev/mapper/zpool1 /mnt/zpool1
btrfs subvolume create /mnt/zpool1/nocow

Mount subvolume

/dev/mapper/zpool1     /var/lib/nocow  btrfs       rw,noatime,compress=lzo,space_cache,autodefrag,subvol=nocow  0 0

Drawback

No checksum, no integrity.

Nodatacow bypasses the very mechanisms that are meant to provide consistency in the filesystem, because the CoW operations are achieved by constructing a completely new metadata tree containing both changes (references to the data, and the csum metadata), and then atomically changing the superblock to point to the new tree.

With nodatacow, writing data and checksum on the physical medium, cause two writes separately. This could cause the data and the checksum mismatch due to I/O error, file corruption could happen.

References

BTRFS FAQ
Setting up a btrfs subvolume with noCOW

ZFS cache and log

ZFS cache and log

There are two kinds of cache, read cache and write cache.

Read cache

Called ARC and L2ARC.

ARC (Adaptive Replacement Cache)

In memory, caching the information that would require in the near future, while discarding the ones that will be needed furthest ahead in time.

This can be set using kernel/module parameter, such as zfs_arc_max.

L2ARC (Level 2 ARC)

In cache device, extension of ARC. Can be created using following command

zpool add tank cache ada3

Note: tank is the pool name, ada3 is the block device used for caching

Write cache

Called ZIL (ZFS Intent Log).

Asynchronous

By default, ZFS will cache write data in memory before write to disk, this is called asynchronous mode.

Synchronous

Synchronous will make sure data written to disk before continue, this can be set using following command

zfs set sync=always mypool/dataset1

ZFS Intent Log (ZIL)

This is the temporary space to store data before write into main disks, this can be used to speed up write operation. The write operation is considered as completed once data written into ZIL device, which is called SLOG (Separate Intent Log) devices, can be defined as follow

zpool add tank log ada3

Note: tank is the pool name, ada3 is the block device used for slog

If worrying SLOG device faulty, it can be mirrored too.

zpool add tank log mirror ada3 ada4

References

Configuring ZFS Cache for High Speed IO
ZFS Performance with Databases (Cached)

Snapshot and Copy on Write

Snapshot and Copy on Write

Two type of snapshots, they are using different ways.

Keep original data

No write to original data, all new data will be in delta file.

VMware

All new data will be in delta file, the original disk file will not be changed. This is very usefull especially in VDI environment, all VDI servers will base on same images and no impact to original disk file.

When deleting a snapshot, the snapshot files are consolidated and written to the parent snapshot disk. If parent is base disk, and all the data from the delta disk will be merged with the virtual machine base disk.

QEMU / KVM: COW mode

COW mode is available on some formats of virtual machine disk as QCOW2. When using the COW mode, no changes are applied to the disk image. All changes are recorded in a separate file preserving the original image. Several COW files can point to the same image to test several configurations simultaneously without jeopardizing the basic system.

QEMU / KVM allows to incorporate changes from a COW file to the original image

Overwrite original disk

Write latest data into original disk, and the original data move to delta disk.

RedHat LVM snapshot

The data in snapshot volume is original data.

ZFS and btrfs

Due to native copy-on-write feature, file usage reference structure always points to new data, and the old data is saved in old reference structure.

Compare

Original Disk Pros Cons
Keep * Support multiple childs without too much performance overhead Deleting snapshot takes time
When disk full, no more write can be done
Overwrite Less overhead - only when first time writing data on new location
Reverting snapshot takes time
* Deleting snapshot fast
Native COW No overhead
Fast dropping snapshot
Fast reverting snapshot
No impact to service when snapshot full, but snapshot corrupt
Unable to delete file when disk full
Disk fragmented easily
* Unable to cache disk write operation

References

Deleting Snapshots
QEMU / KVM: Using the Copy-On-Write mode
Why would I want to disable Copy-On-Write while creating QEMU Images?