Moving to ZFS from Btrfs
Rule #4 states that failure will happen, therefore you should plan for the
eventual reality. The Linux workstations I build and use (if I have any say
about it) use at least 2 hard drives in some mirrored or otherwise redundant
fashion. My current patternis to build workstations with a small (120G or
there about) SSD drive as the boot drive that contains my OS install and swap
space. /home, scratch, and possibly other areas are mounted from a 2 disk
mirrored array of spinning rust. I’ve spent years on Linux’s software raid
(md) setups, and the last 5+ years on Btrfs.
My workstation at home had a 7 year old disk paired with a 4+ year old disk drive. With Btrfs this was my primary storage filesystem. Good life expectancy for spinning rust is about 3 years. So I picked up a couple new Seagate BarraCuda drives to replace the existing drives before reality had a chance to ruin my day.
It was at this upgrade I knew I wanted to switch to ZFS. I’ve been running ZFS on other workstations and with a large capacity Graphite cluster and had a lot of success. As well as just enough failures to know how ZFS handled losing a drive. With Red Hat dropping Btrfs support from RHEL, no one using it in production seemingly (other than SuSE), Docker adopting ZFS as a storage backend, and the lack of progress on major features (encryption, RAID 5/6) I’ve become more convinced ZFS is the correct solution. In a way I’m saddened, because I believe the way to make Open Source solutions better is to, well, use them. But Btrfs, while still very actively moving, just doesn’t seem to be moving in the direction I need.
My upgrade plan was this:
- Replace one drive in the mirrored pair with a new HDD.
- Boot, create a zpool on the new drive, create ZFS filesystems for my users and scratch space.
- Use
rsyncto copy data from the Btrfs filesystem to the ZFS filesystems. - Replace the last drive with a new HDD.
- Attach the last new drive to the zpool and convert it to a mirror.
No plan survives contact with the enemy, and Step #2 is where things went
awry. My Ubuntu Xenial machine dumped me into emergency mode (without a
password even!) as it could not mount /home or /srv from the Btrfs array.
I removed these from /etc/fstab and was able to boot into graphical mode
like normal, albeit without my home directory. mount /dev/sdc /mnt failed
with no useful errors. However, dmesg reported the following:
BTRFS: failed to read the system array on sdc
BTRFS: open_ctree failed
However, btrfs filesystem show did appear to see a filesystem available
on sdc. So I Googled. I discovered that there’s a trove of mount options
for Btrfs that don’t seem to be documented well or at least together. At
the suggestion of some old forums I tried:
# mount -odegraded /dev/sdc /mnt
This worked. (Although, I should have mounted read only.) This is also a prime example of why I’m moving away from Btrfs. How is this behavior an acceptable way to communicate to the administrator that Btrfs is refusing to mount a degraded array? ZFS would have told me just that and how to override that safety feature in its messaging.
With that solved, I moved on to making a fresh zpool on sdb, my new drive.
# zpool create -f ordinary \
/dev/disk/by-id/ata-ST2000DM006-2DM164_XXXXXXXX
I usually name the first zpool after the name of the machine rather than
Matrix references. I also always use the /dev/disk/by-id/ata- name of
the device. These names don’t reshuffle if the number or order of drives
changes. This ID includes the serial number of my drive so I can match
them up with the physical label on the drive if I have to.
Next, create filesystems:
# zfs create ordinary/slack
# zfs create ordinary/slack/Music
# zfs create ordinary/scratch
Next, I used rsync to copy the data from /mnt to my ZFS filesystems. This
was slower than I expected. But a few hours later had my data safely in
ZFS-land. I shutdown the machine, and replaced sdc with the second new
HDD. I rebooted and ZFS was mounted with everything in place. The new
drive wasn’t otherwise interfering with the system.
Next, attach the second disk to the zpool:
# zpool status
pool: ordinary
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
ordinary ONLINE 0 0 0
ata-ST2000DM006-2DM164_XXXXXXXX ONLINE 0 0 0
errors: No known data errors
# zpool attach ordinary \
ata-ST2000DM006-2DM164_XXXXXXXX \
/dev/disk/by-id/ata-ST2000DM006-2DM164_YYYYYYYY
This took a couple tries to get the incantation correct, but ZFS’s error
messages were surprisingly helpful. This command attaches the new device
YYYYYYYY to the existing device XXXXXXXX in the zpool ordinary. This
began resilvering immediately.
# zpool status
pool: ordinary
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Jul 7 00:11:06 2018
792M scanned out of 355G at 52.8M/s, 1h54m to go
788M resilvered, 0.22% done
config:
NAME STATE READ WRITE CKSUM
ordinary ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-ST2000DM006-2DM164_XXXXXXXX ONLINE 0 0 0
ata-ST2000DM006-2DM164_YYYYYYYY ONLINE 0 0 0 (resilvering)
errors: No known data errors
In 1 hour and 2 minutes (faster than the rsync) the zpool had completed resilvering. Just to be safe I initiated a scrub operation.
# zpool scrub ordinary
...
# zpool status
pool: ordinary
state: ONLINE
scan: scrub repaired 0 in 0h46m with 0 errors on Sun Jul 8 01:10:56 2018
config:
NAME STATE READ WRITE CKSUM
ordinary ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-ST2000DM006-2DM164_XXXXXXXX ONLINE 0 0 0
ata-ST2000DM006-2DM164_YYYYYYYY ONLINE 0 0 0
errors: No known data errors
Finally, I set the mountpoints where I wanted my home directory to be mounted.
# zfs set mountpoint=/home/slack ordinary/slack
Rebooting, I had a functional system, and, most importantly, my home directory back! Next, I cloned zfs-auto-snapshot and made sure I had daily snapshots enabled. Hour and daily snapshots of your home directory is really the best feature ever. It should have worked well on Btrfs, but I was a little nervous to try it there. To complete the exercise, I ran my Borg backups by hand and confirmed they still worked.