Blog

Blog

ZFS Concept

ZFS Concept

Pool

ZFS pool (Zpool) is a collection of one or more virtual devices (vdevs), vdev is a group of physical disks. They have following facts.

  • The redundancy level for vdevs can be a single drive, mirror, RAID-Z1, RAID-Z2, and RAID-Z3.
  • After creating a Zpool, it may not be possible to add additional disks to the vdev except mirrors.
  • Add additional vdevs to expand the Zpool is possible.
  • The storage space allocated to the Zpool cannot be decreased.
  • The drives in vdevs that are parts of the Zpool can be exchanged.

If there is a need to change the layout of the Zpool, the data should be backed up and the Zpool destroyed.

Datasets

Datasets is the space emulating a regular file system.

Datasets can be nested, which can possess different settings for snapshots, compression, deduplication and so on.

Volumes

Volumes (zvols) is the space emulating a block devices.

Data Integrity

No overwritten

The copy-on-write mechanism is to keep old data on the disk.

Checksum

Checksum information is written when data is written into disk, then verified when read data from disk. When checksum mismatch detected, use redundant data is used for correction.

Different checksum algorithms are used

  • Fletcher-based checksum
  • SHA-256 hash

ZFS RAID

  • Single - Zpool has a vdev consisting of a single disk, similar to RAID0.
  • Mirror – similar to RAID1.
  • RAIDZ1 – similar to RAID5 but without the write hole issue.
  • RAIDZ2 – similar to RAID6, with 2 disks redundancy.
  • RAIDZ3 – similar to RAID6, with 3 disks redundancy.

RAID write hole in a RAID5/RAID1 occurs when one of the member disks doesn't match the others and by the nature of single-redundant RAID5/RAID1 it is impossible to tell which of the disks is bad.

Errors

Checksum mismatch

ZFS is a self-healing system. If mismatched checksum is detected, ZFS tries to retrieve the data from other disks. If data correct, the system will amend the incorrect data and checksum.

Disk failure

If a disk in a Zpool fails, the pool is set to the degraded state, then data on the failed device is calculated and written to first the spare disk replaces the failed one. This is called resilvering. Once the restoration operation is complete, the status of the Zpool changes back to online. In case of when multiple disks have failed and if there are not enough redundant devices, the Zpool changes its state into unavailable.

Migrate to different system

In old system, export zpool, which unmounts Zpool’s datasets or zvols.

In new system, import zpool, which mount Zpool's datasets or zvols.

Maintenance

Scrubbing

The scrubbing is consistency check operation, and try to repair corrupted data.

No defragmentation

There is no online defragmentation in ZFS, so try to keep zpools below 70% utilization instead.

Copy-on-write

On ZFS, the data changes are stored on a different location than the original location on a disk and then the metadata is updated in that place on the disk. This mechanism guarantees that the old data is safely preserved in case of power loss or system crash that in other cases would result in loss of data.

Snapshots

The snapshot contains information about the original version of the file system to be retained. Snapshots do not require additional disk space within the pool. Once the data rendered in a snapshot is modified, the snapshot will take the disk space since it will now be pointing to the old data.

Clones

The clone is a writeable version of a snapshot. Overwriting the blocks in the cloned volume or file system results in decrementing the reference count on the previous block. The original snapshot that the clone is depending on, can not be deleted.

Rollback

Rollback command is to go back to a previous version of a dataset or a volume. Note that the rollback command cannot revert changes from other snapshots than the most recent one. If to do so, all intermediate snapshots will be automatically destroyed.

Promote

Promote command is to replace an existing volume with its clone.

References

ZFS Essentials – What is pooled storage?
ZFS Essentials – Copy-on-write & snapshots
ZFS Essentials – Data integrity & RAIDZ
RAID Recovery Guide

Disable Copy-On-Write on BTRFS

Disable Copy-On-Write on BTRFS

The issue of COW (copy on write), is fragmentation, because it always write new device block. This is good for SSD, but not good on traditional devices. Even on SSD, if the block size is big, the data to be write would be much larger than actual updated data size. Because of this issue, recommented disable Copy-On-Write on database and VM filesystems.

Methods

Mounting

Disable it by mounting with nodatacow.

Following facts to be considered

  • This implies nodatasum as well
  • COW may still happen if a snapshot is taken
  • COW will still be maintained for existing files
  • COW status can be modified only for empty or newly created files.

File Attribute

For an empty file, add the NOCOW file attribute (use chattr utility with +C)

touch file1
chattr +C file1

For a directory with NOCOW attribute set, new files in it will inherit this attribute.

chattr +C directory1

For old files, copy the original data into the pre-created file, delete original and rename back.

touch vm-image.raw
chattr +C vm-image.raw
fallocate -l10g vm-image.raw

Subvolume (Untested)

Subvolume can not be set nocow separately. This is official answer.

But the files created inherit the attributes from directory, if separately mount subvolume on the directory which has nocow attribute, then the newly created files will inherit nocow attribute as well, regardless of the original volume.

Create directory

mkdir /var/lib/nocow
chattr +C /var/lib/nocow

Create subvolume

mount -o autodefrag,compress=lzo,noatime,space_cache /dev/mapper/zpool1 /mnt/zpool1
btrfs subvolume create /mnt/zpool1/nocow

Mount subvolume

/dev/mapper/zpool1     /var/lib/nocow  btrfs       rw,noatime,compress=lzo,space_cache,autodefrag,subvol=nocow  0 0

Drawback

No checksum, no integrity.

Nodatacow bypasses the very mechanisms that are meant to provide consistency in the filesystem, because the CoW operations are achieved by constructing a completely new metadata tree containing both changes (references to the data, and the csum metadata), and then atomically changing the superblock to point to the new tree.

With nodatacow, writing data and checksum on the physical medium, cause two writes separately. This could cause the data and the checksum mismatch due to I/O error, file corruption could happen.

References

BTRFS FAQ
Setting up a btrfs subvolume with noCOW

ZFS cache and log

ZFS cache and log

There are two kinds of cache, read cache and write cache.

Read cache

Called ARC and L2ARC.

ARC (Adaptive Replacement Cache)

In memory, caching the information that would require in the near future, while discarding the ones that will be needed furthest ahead in time.

This can be set using kernel/module parameter, such as zfs_arc_max.

L2ARC (Level 2 ARC)

In cache device, extension of ARC. Can be created using following command

zpool add tank cache ada3

Note: tank is the pool name, ada3 is the block device used for caching

Write cache

Called ZIL (ZFS Intent Log).

Asynchronous

By default, ZFS will cache write data in memory before write to disk, this is called asynchronous mode.

Synchronous

Synchronous will make sure data written to disk before continue, this can be set using following command

zfs set sync=always mypool/dataset1

ZFS Intent Log (ZIL)

This is the temporary space to store data before write into main disks, this can be used to speed up write operation. The write operation is considered as completed once data written into ZIL device, which is called SLOG (Separate Intent Log) devices, can be defined as follow

zpool add tank log ada3

Note: tank is the pool name, ada3 is the block device used for slog

If worrying SLOG device faulty, it can be mirrored too.

zpool add tank log mirror ada3 ada4

References

Configuring ZFS Cache for High Speed IO
ZFS Performance with Databases (Cached)

Snapshot and Copy on Write

Snapshot and Copy on Write

Two type of snapshots, they are using different ways.

Keep original data

No write to original data, all new data will be in delta file.

VMware

All new data will be in delta file, the original disk file will not be changed. This is very usefull especially in VDI environment, all VDI servers will base on same images and no impact to original disk file.

When deleting a snapshot, the snapshot files are consolidated and written to the parent snapshot disk. If parent is base disk, and all the data from the delta disk will be merged with the virtual machine base disk.

QEMU / KVM: COW mode

COW mode is available on some formats of virtual machine disk as QCOW2. When using the COW mode, no changes are applied to the disk image. All changes are recorded in a separate file preserving the original image. Several COW files can point to the same image to test several configurations simultaneously without jeopardizing the basic system.

QEMU / KVM allows to incorporate changes from a COW file to the original image

Overwrite original disk

Write latest data into original disk, and the original data move to delta disk.

RedHat LVM snapshot

The data in snapshot volume is original data.

ZFS and btrfs

Due to native copy-on-write feature, file usage reference structure always points to new data, and the old data is saved in old reference structure.

Compare

Original Disk Pros Cons
Keep * Support multiple childs without too much performance overhead Deleting snapshot takes time
When disk full, no more write can be done
Overwrite Less overhead - only when first time writing data on new location
Reverting snapshot takes time
* Deleting snapshot fast
Native COW No overhead
Fast dropping snapshot
Fast reverting snapshot
No impact to service when snapshot full, but snapshot corrupt
Unable to delete file when disk full
Disk fragmented easily
* Unable to cache disk write operation

References

Deleting Snapshots
QEMU / KVM: Using the Copy-On-Write mode
Why would I want to disable Copy-On-Write while creating QEMU Images?

Error of txg_sync blocked for more than 120 seconds

Error of txg_sync blocked for more than 120 seconds

Following error was appearing in my dmesg monitoring screen.

txg_sync blocked for more than 120 seconds --> excessive load

If I'm not wrong, it could be caused by slow harddisk speed, because the TrueNAS zfs cache is about 61GB, can take longer time to flush back to hard disk.

Same as other filesystem, zfs has writeback caching (aka write-behind caching), which will flush data back to hard disk in specific interval. zfs has synchronous and asynchronous mode, they are a bit different that readonly, writethrough and writeback mode.

Except above, zfs has different behaviors on copy on write (COW) as below.

  • Always write to new block due to copy on write
  • Big file for random writing, such as VM disk file, can be fragmented
  • Can not reduce the write operation even if keep writing same block

Therefore, copy on write should be disabled for VM images. But if so, snapshot function could be lost.

Reference

Read-Through, Write-Through, Write-Behind Caching and Refresh-Ahead

Ubuntu grub-efi-amd64-signed error after do-release-upgrade

Ubuntu grub-efi-amd64-signed error after do-release-upgrade

Following error occurred whenever run apt upgrade after perform do-release-upgrade

dpkg: error processing package grub-efi-amd64-signed (–configure):
installed grub-efi-amd64-signed package post-installation script subprocess returned error exit status 32

Solution

Reinstall all grub group packages using following commands

sudo apt-get purge grub\*
sudo apt-get install grub-efi
sudo apt-get autoremove
sudo update-grub

Options restrict in one filesystem

Options restrict in one filesystem

There are quite number of tasks may want to be executed in one filesystem, this is important during troubleshooting, especially for root directory (/).

find

Restrict find command only looking entries within one filesystem, use option -xdev

find /usr -xdev ...

du

Restrict du command only calculate for one filesystem, use option -x

du -cshx /

tar

Restrict tar command only archive files in one filesystem, use option --one-file-system

tar --one-file-system -czvf /tmp/root.tgz /

Increase upload file size limit for WordPress and NGNIX

Table of Contents

Increase upload file size limit for WordPress and NGNIX

There are various ways to do, but the workable way is, updating .htaccess in WordPress and NGNIX configuration file.

Issue

First, tried the way by changing function.php in theme, but no luck. Then updated .htaccess file, it worked.

Then the client gets the error “Request Entity Too Large” (413). This error reported by NGINX.

WordPress

Add following lines in .htaccess file in html directory

php_value upload_max_filesize 64M
php_value post_max_size 64M
php_value max_execution_time 300
php_value max_input_time 300

Then the upload page in WordPress should be shown as below

Maximum upload file size: 64 MB.

Alternative

These options are PHP options, which can be applied to php.ini as well as below

upload_max_filesize = 64M
post_max_size = 64M
max_execution_time = 300

NGINX

Add the following line to http, server or location context in nginx.conf or conf.d/default.conf

client_max_body_size 64M;

Then reload NGINX configure.

# /usr/local/nginx/sbin/nginx -s reload

This will fix the client error “Request Entity Too Large” (413).

Remove ubuntu zfs snapshots

Remove ubuntu zfs snapshots

There are so many snapshots when using zfs in ubuntu.

Issue

When tried to do release update, got following error

# do-release-update
...
...
Not enough free disk space 

The upgrade has aborted. The upgrade needs a total of 256 M free 
space on disk '/boot'. Please free at least an additional 91.4 M of 
disk space on '/boot'. You can remove old kernels using 'sudo apt 
autoremove' and you could also set COMPRESS=xz in 
/etc/initramfs-tools/initramfs.conf to reduce the size of your 
initramfs. 
...

This error messsage was occurred many times before, but those systems had very small /boot partition or many old kernels kept. If it is the first case, total repartitioning and moving root filesystem are required.

Space on /boot

Examing disk space for bpool, found that zfs reported 675MB used in bpool, but actual usage is only 242MB.

root@ubuntu:~# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
bpool   960M   675M   285M        -         -    30%    70%  1.00x    ONLINE  -
rpool  17.5G  7.99G  9.51G        -         -    21%    45%  1.00x    ONLINE  -
root@ubuntu:~# zfs list bpool
NAME    USED  AVAIL     REFER  MOUNTPOINT
bpool   675M   157M       96K  /boot
root@ubuntu:~# du -cshx /boot
242M    /boot
242M    total
root@ubuntu:~# 

Then found many snapshots both in bpool and data pool

root@ubuntu:~# zfs list -t snapshot | head
NAME                                                               USED  AVAIL     REFER  MOUNTPOINT
bpool/BOOT/ubuntu_e8m8h0@autozsys_ywm1ok                             0B      -      238M  -
bpool/BOOT/ubuntu_e8m8h0@autozsys_ms74md                             0B      -      238M  -
bpool/BOOT/ubuntu_e8m8h0@autozsys_ugu9z7                            80K      -      242M  -
bpool/BOOT/ubuntu_e8m8h0@autozsys_r3xqau                            72K      -      242M  -
bpool/BOOT/ubuntu_e8m8h0@autozsys_nkagbh                             0B      -      242M  -
bpool/BOOT/ubuntu_e8m8h0@autozsys_xdbwsy                             0B      -      242M  -
bpool/BOOT/ubuntu_e8m8h0@autozsys_zrt7vi                            72K      -      242M  -
bpool/BOOT/ubuntu_e8m8h0@autozsys_jbmnwk                            72K      -      242M  -
bpool/BOOT/ubuntu_e8m8h0@autozsys_0e5p2e                            64K      -      242M  -
root@ubuntu:~# 
root@ubuntu:~# zfs list -t snapshot | wc
    301    1505   27701

Too many! Not sure how many snapshots ubuntu likes to create

Removing snapshots

List all snapshots for /boot

root@ubuntu:~# df /boot
Filesystem               1K-blocks   Used Available Use% Mounted on
bpool/BOOT/ubuntu_e8m8h0    408192 247808    160384  61% /boot
root@ubuntu:~# zfs list -H -o name -t snapshot bpool/BOOT/ubuntu_e8m8h0
bpool/BOOT/ubuntu_e8m8h0@autozsys_ywm1ok
bpool/BOOT/ubuntu_e8m8h0@autozsys_ms74md
bpool/BOOT/ubuntu_e8m8h0@autozsys_ugu9z7
bpool/BOOT/ubuntu_e8m8h0@autozsys_r3xqau
bpool/BOOT/ubuntu_e8m8h0@autozsys_nkagbh
bpool/BOOT/ubuntu_e8m8h0@autozsys_xdbwsy
bpool/BOOT/ubuntu_e8m8h0@autozsys_zrt7vi
bpool/BOOT/ubuntu_e8m8h0@autozsys_jbmnwk
bpool/BOOT/ubuntu_e8m8h0@autozsys_0e5p2e
bpool/BOOT/ubuntu_e8m8h0@autozsys_b17dwn
bpool/BOOT/ubuntu_e8m8h0@autozsys_uad1rb
bpool/BOOT/ubuntu_e8m8h0@autozsys_mxhvc9
bpool/BOOT/ubuntu_e8m8h0@autozsys_9athz8
bpool/BOOT/ubuntu_e8m8h0@autozsys_61umv1
bpool/BOOT/ubuntu_e8m8h0@autozsys_1q65cz
root@ubuntu:~# 

Then remove them

zfs list -H -o name -t snapshot bpool/BOOT/ubuntu_e8m8h0 | xargs -n 1 zfs destroy

Now, it is ok to upgrade

root@ubuntu:~# zfs list -o space bpool
NAME   AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
bpool   589M   243M        0B     96K             0B       243M
root@ubuntu:~# 

Firewalld Basic

Firewalld Basic

Concept

Some basic concepts for firewalld to understand the commands

  • NIC
    Different NIC can have different zone assigned using nmcli command, if not specified, it is using default zone.
  • Zone
    By default, default zone is called default, this can be changed using firewalld command temporarily.
    To assign the default zoon to the zone that isn't named default, using nmcli command is required.
  • Service
  • Port

Start/Stop

# systemctl start firewalld
# systemctl enable firewalld

Default zone

Default zone is public when option --zone is not specified in command line.

Display the default zone

# firewall-cmd --get-default-zone
public

Display current settings

# firewall-cmd --list-all
public (default, active)
  interfaces: eno16777736
  sources:
  services: dhcpv6-client ssh
  ports:
  masquerade: no
  forward-ports:
  icmp-blocks:
  rich rules:

Display all zones defined by default

# firewall-cmd --list-all-zones
block
  interfaces:
  sources:
  services:
  ports:
  masquerade: no
  forward-ports:
  icmp-blocks:
  rich rules:
  .....
  .....

Display allowed services on a specific zone

# firewall-cmd --list-service --zone=external
ssh

Change default zone

# firewall-cmd --set-default-zone=external
success

Change zone for an interface

Note: it's not changed permanently with "change-interface" even if added "--permanent" option

# firewall-cmd --change-interface=eth1 --zone=external
success
# firewall-cmd --list-all --zone=external
external (active)
  interfaces: eth1
  sources:
  services: ssh
  ports:
  masquerade: yes
  forward-ports:
  icmp-blocks:
  rich rules:

To change permanently, use nmcli like follows

# nmcli c mod eth1 connection.zone external
# firewall-cmd --get-active-zone
external
  interfaces: eth1
public
  interfaces: eth0

Services

Display services

# firewall-cmd --get-services
amanda-client bacula bacula-client dhcp dhcpv6 dhcpv6-client dns ftp high-availability http https imaps ipp ipp-client ipsec kerberos kpasswd ldap ldaps libvirt libvirt-tls mdns mountd ms-wbt mysql nfs ntp openvpn pmcd pmproxy pmwebapi pmwebapis pop3s postgresql proxy-dhcp radius rpc-bind samba samba-client smtp ssh telnet tftp tftp-client transmission-client vnc-server wbem-https

Service definition files are XML files in /usr/lib/firewalld/services

# ls /usr/lib/firewalld/services
amanda-client.xml      ipp-client.xml   mysql.xml       rpc-bind.xml
bacula-client.xml      ipp.xml          nfs.xml         samba-client.xml
bacula.xml             ipsec.xml        ntp.xml         samba.xml
dhcpv6-client.xml      kerberos.xml     openvpn.xml     smtp.xml
dhcpv6.xml             kpasswd.xml      pmcd.xml        ssh.xml
dhcp.xml               ldaps.xml        pmproxy.xml     telnet.xml
dns.xml                ldap.xml         pmwebapis.xml   tftp-client.xml
ftp.xml                libvirt-tls.xml  pmwebapi.xml    tftp.xml
high-availability.xml  libvirt.xml      pop3s.xml       transmission-client.xml
https.xml              mdns.xml         postgresql.xml  vnc-server.xml
http.xml               mountd.xml       proxy-dhcp.xml  wbem-https.xml
imaps.xml              ms-wbt.xml       radius.xml

Add or remove services temporarily.

# firewall-cmd --add-service=http
success
# firewall-cmd --list-service
dhcpv6-client http ssh
...
...
# firewall-cmd --remove-service=http
success
# firewall-cmd --list-service
dhcpv6-client ssh

Add or remove services permanently

Note: Reload the Firewalld is required to enable the change

# firewall-cmd --add-service=http --permanent
success
# firewall-cmd --reload
success
# firewall-cmd --list-service
dhcpv6-client http ssh

Ports

Add or remove ports temporarily.

# firewall-cmd --add-port=465/tcp
success
# firewall-cmd --list-port
465/tcp
# firewall-cmd --remove-port=465/tcp
success
# firewall-cmd --list-port

Add or remove ports permanently

# firewall-cmd --add-port=465/tcp --permanent
success
# firewall-cmd --reload
success
# firewall-cmd --list-port
465/tcp

ICMP

Add or remove ICMP types.

# firewall-cmd --add-icmp-block=echo-request
success
# firewall-cmd --list-icmp-blocks
echo-request
# firewall-cmd --remove-icmp-block=echo-request
success
# firewall-cmd --list-icmp-blocks

Display ICMP types

# firewall-cmd --get-icmptypes
destination-unreachable echo-reply echo-request parameter-problem redirect
router-advertisement router-solicitation source-quench time-exceeded 

References

Firewalld : Basic Operation