Moving to ZFS from Btrfs
Rule #4 states that failure will happen, therefore you should plan for the
eventual reality. The Linux workstations I build and use (if I have any say
about it) use at least 2 hard drives in some mirrored or otherwise redundant
fashion. My current patternis to build workstations with a small (120G or
there about) SSD drive as the boot drive that contains my OS install and swap
space. /home
, scratch, and possibly other areas are mounted from a 2 disk
mirrored array of spinning rust. I’ve spent years on Linux’s software raid
(md) setups, and the last 5+ years on Btrfs.
My workstation at home had a 7 year old disk paired with a 4+ year old disk drive. With Btrfs this was my primary storage filesystem. Good life expectancy for spinning rust is about 3 years. So I picked up a couple new Seagate BarraCuda drives to replace the existing drives before reality had a chance to ruin my day.
It was at this upgrade I knew I wanted to switch to ZFS. I’ve been running ZFS on other workstations and with a large capacity Graphite cluster and had a lot of success. As well as just enough failures to know how ZFS handled losing a drive. With Red Hat dropping Btrfs support from RHEL, no one using it in production seemingly (other than SuSE), Docker adopting ZFS as a storage backend, and the lack of progress on major features (encryption, RAID 5/6) I’ve become more convinced ZFS is the correct solution. In a way I’m saddened, because I believe the way to make Open Source solutions better is to, well, use them. But Btrfs, while still very actively moving, just doesn’t seem to be moving in the direction I need.
My upgrade plan was this:
- Replace one drive in the mirrored pair with a new HDD.
- Boot, create a zpool on the new drive, create ZFS filesystems for my users and scratch space.
- Use
rsync
to copy data from the Btrfs filesystem to the ZFS filesystems. - Replace the last drive with a new HDD.
- Attach the last new drive to the zpool and convert it to a mirror.
No plan survives contact with the enemy, and Step #2 is where things went
awry. My Ubuntu Xenial machine dumped me into emergency mode (without a
password even!) as it could not mount /home
or /srv
from the Btrfs array.
I removed these from /etc/fstab
and was able to boot into graphical mode
like normal, albeit without my home directory. mount /dev/sdc /mnt
failed
with no useful errors. However, dmesg
reported the following:
BTRFS: failed to read the system array on sdc
BTRFS: open_ctree failed
However, btrfs filesystem show
did appear to see a filesystem available
on sdc
. So I Googled. I discovered that there’s a trove of mount options
for Btrfs that don’t seem to be documented well or at least together. At
the suggestion of some old forums I tried:
# mount -odegraded /dev/sdc /mnt
This worked. (Although, I should have mounted read only.) This is also a prime example of why I’m moving away from Btrfs. How is this behavior an acceptable way to communicate to the administrator that Btrfs is refusing to mount a degraded array? ZFS would have told me just that and how to override that safety feature in its messaging.
With that solved, I moved on to making a fresh zpool on sdb
, my new drive.
# zpool create -f ordinary \
/dev/disk/by-id/ata-ST2000DM006-2DM164_XXXXXXXX
I usually name the first zpool after the name of the machine rather than
Matrix references. I also always use the /dev/disk/by-id/ata-
name of
the device. These names don’t reshuffle if the number or order of drives
changes. This ID includes the serial number of my drive so I can match
them up with the physical label on the drive if I have to.
Next, create filesystems:
# zfs create ordinary/slack
# zfs create ordinary/slack/Music
# zfs create ordinary/scratch
Next, I used rsync
to copy the data from /mnt
to my ZFS filesystems. This
was slower than I expected. But a few hours later had my data safely in
ZFS-land. I shutdown the machine, and replaced sdc
with the second new
HDD. I rebooted and ZFS was mounted with everything in place. The new
drive wasn’t otherwise interfering with the system.
Next, attach the second disk to the zpool:
# zpool status
pool: ordinary
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
ordinary ONLINE 0 0 0
ata-ST2000DM006-2DM164_XXXXXXXX ONLINE 0 0 0
errors: No known data errors
# zpool attach ordinary \
ata-ST2000DM006-2DM164_XXXXXXXX \
/dev/disk/by-id/ata-ST2000DM006-2DM164_YYYYYYYY
This took a couple tries to get the incantation correct, but ZFS’s error
messages were surprisingly helpful. This command attaches the new device
YYYYYYYY
to the existing device XXXXXXXX
in the zpool ordinary
. This
began resilvering immediately.
# zpool status
pool: ordinary
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Jul 7 00:11:06 2018
792M scanned out of 355G at 52.8M/s, 1h54m to go
788M resilvered, 0.22% done
config:
NAME STATE READ WRITE CKSUM
ordinary ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-ST2000DM006-2DM164_XXXXXXXX ONLINE 0 0 0
ata-ST2000DM006-2DM164_YYYYYYYY ONLINE 0 0 0 (resilvering)
errors: No known data errors
In 1 hour and 2 minutes (faster than the rsync) the zpool had completed resilvering. Just to be safe I initiated a scrub operation.
# zpool scrub ordinary
...
# zpool status
pool: ordinary
state: ONLINE
scan: scrub repaired 0 in 0h46m with 0 errors on Sun Jul 8 01:10:56 2018
config:
NAME STATE READ WRITE CKSUM
ordinary ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-ST2000DM006-2DM164_XXXXXXXX ONLINE 0 0 0
ata-ST2000DM006-2DM164_YYYYYYYY ONLINE 0 0 0
errors: No known data errors
Finally, I set the mountpoints where I wanted my home directory to be mounted.
# zfs set mountpoint=/home/slack ordinary/slack
Rebooting, I had a functional system, and, most importantly, my home directory back! Next, I cloned zfs-auto-snapshot and made sure I had daily snapshots enabled. Hour and daily snapshots of your home directory is really the best feature ever. It should have worked well on Btrfs, but I was a little nervous to try it there. To complete the exercise, I ran my Borg backups by hand and confirmed they still worked.