CubeFS: A crash course
A new distributed filesystem caught my interest recently - CubeFS. If you know anything about me, you'll know I'm a massive distributed storage nerd; with around 2+ petabytes of storage in my lab (RiffLabs Perth) bound together at various times by distributed filesystems such as MooseFS (Pro), SeaweedFS and Ceph.
I've been using and enjoying MooseFS Pro for quite a while now, but I have a capacity-limited licence and over 10x more capacity than I have a licence for, which restricts how much I can actually use my storage to an extent.
What I want out of a distributed filesystem boils down to a few major factors:
- Storage tiering
- Reliable high availability
- Data reliability and resilience
- Hands off self healing
- Performance
- (Bonus) Erasure coding
- (Bonus) Georeplication
Wait, what's a distributed filesystem?
Good question! A distributed filesystem allows you to combine the storage of one or many machines (usually many, given the word "distributed!") into one filesystem that can be mounted from one or many machines. Usually they expose a POSIX filesystem (traditional storage mount, a la NFS, in other words), but they can also come in the form of a block storage device (such as Ceph RBD) or object storage (S3), or HDFS, or... yeah, there's a lot of variety.
Think of it like taking a bunch of NASes and combining them into one big NAS and having them work together, visible as one unified storage drive.
Enter CubeFS
CubeFS is a fairly new distributed filesystem that is incubating into the CNCF project at the moment, and supports exabyte-scale installations (OPPO has one!).
Incredibly, CubeFS ticks all of these boxes with the exception of the first one - in the fully free and open source version.* (CubeFS is licenced under the Apache-2.0 software licence).
(Currently erasure coding relies on having a Kafka installation - thus, it won't be covered in today's guide, as I don't have one yet - but it works, it works at scale and it even allows for fine customisation of erasure coding schemes. Very cool!)
As a bonus, it supports:
- Native S3 gateway for object storage
- Kubernetes persistent storage
- Volume creation and management
- Scalable metadata storage that allows for horizontal scaling of metadata without creating very large "master" nodes
What is it missing?
At the moment the main thing I want that CubeFS doesn't have is storage tiering - I want to be able to have SSDs, HDDs and NVMe storage in one cluster, and have data intelligently (or manually) placed onto the various tiers.
I don't necessarily need it to happen automatically - something simpler like MooseFS' storage classes would be fine.
Luckily, the CubeFS team has confirmed that they are working on storage tiering, and it might even be a full implementation with proper multi level caching. This is huge. 😄
The catch
Great, so CubeFS looks really promising, right?
There's just a small catch - there's basically zero security out of the box. It assumes you're running on a trusted network. It's not that they haven't thought of this issue - they have! It's just that the authentication stuff hasn't made it all the way into a release yet, despite being stable and ready for usage. This means if you deploy a CubeFS filesystem normally, following all the directions on the CubeFS homepage, anyone with network access to the master/resource manager nodes can do anything they want to your filesystem, including deleting volumes, reading user keys and various other mischief.
Fixing this is luckily fairly easy. Unfortunately, the documentation for authentication is lacking, with only hints from the old ChubaoFS (the former name for CubeFS) documentation to help you. It took me weeks of pain to figure this one out - luckily, once you know how to do it, it's easy; thus my motivation for writing this guide.
Eventually I plan to turn this entire guide into a fully automated Ansible playbook that will turn the entire process into a 5-10 minute affair. Stay tuned for that.
Integrating authentication and encryption
Simply put, git clone
CubeFS, switch to the release-3.3.1
(latest release) branch, and then run git cherry-pick 7f43eaa513a579fad99aaade86bc419e75d89981
and fix the conflicts.
Then build CubeFS as normal, and deploy it to your nodes.
...Too complicated? 😄 Alright, well, fair enough. I did it for you!
This is a very light set of changes, simply integrating the relevant commit. I have made zero other changes, but have also included the steps I took (above) so you can verify this for yourself, which you should do if you plan to use this software anywhere important. I plan to keep my branch up-to-date. I will not be publishing binaries, as I don't believe the supply chain risks are worth it; if I published infected binaries unwittingly, it would be Pretty Bad.
Don't worry, we'll go over everything you need to use the modified CubeFS software and any caveats you'll want to know about. This is a fairly long guide, but it is worth it - you will end up with a fully resilient, self healing, distributed and clustered filesystem with high availability and all kinds of other fun goodies.
Getting started
You'll want to get started with either a single machine per role, or 3 machines per role. For high availability, I recommend using 3x master nodes and 3x authnodes, and then as many metadata and datanode nodes as you want to use.
In my cluster, I will be deploying to the following machines:
You can, optionally, spin up a node specifically for building the CubeFS software, to avoid having to install an entire build toolchain on one (or more) of your nodes. I'll be skipping this, opting instead to use cubefs-authnode01.per.riff.cc
as my buildbox, but in a proper production environment (especially an airgapped one!) this may be more appropriate.
Create your machines
In my case, I'll be using Tinkerbell and tink-flow to deploy my CubeFS nodes, except for the data and metadata nodes which already exist (and are running Proxmox).
I'll be using Debian 12, my standard distribution of choice for a lightweight and stable base, but really any standard-ish Linux distribution will work, from Ubuntu to Rocky Linux and maybe even Alpine Linux.
We won't cover provisioning in this guide, but essentially the process looked like:
- Create VMs for each of the nodes
- Collect MAC addresses from each
- Create a tink-flow manifest with the machines in it
- Let Tinkerbell deploy the machines with Debian 12
Installing Debian by hand is also a valid option.
You'll want to enable passwordless sudo on your CubeFS machines, at least during setup.
You should also create DNS entries for each of your machines. I'm using the hostnames, then .per.riff.cc
(to indicate the Perth RiffLab).
Building and installing CubeFS
We'll be building CubeFS on our first authnode, cubefs-authnode01
.
Log onto the node, then install the build dependencies.
First, required packages:
Then install Go 1.22 (or whatever the latest stable version of Go is at the time of reading).
Log out and in to ensure the Go toolchain is on your path (or source /etc/profile.d/golang.sh
).
Check out my repository, check out the "Zorlin/authnode" branch, then run make
to build CubeFS.
Now, throw together a quick tarball with your binaries to make it easier to transfer it to your nodes.
While you're there, grab the SHA256 hash of the tarball so you can verify its integrity later if needed.
Exit back to your local machine, then log back into your authnode using the -A
flag when running ssh
(for example ssh riff@cubefs-authnode01.per.riff.cc -A
). This will allow you to use your local machine's SSH keys on the remote (authnode01) and thus let you copy your tarball to all of your machines.
Now we'll do a similar trick to allow us to install the CubeFS binaries across all the hosts (Note: If you prefer to use Ansible or something more elegant for these steps, please do! This is just a basic example).
Finally, we'll use the same sort of thing to create a CubeFS user and group, and create the various directories we'll want CubeFS to use during its operation.
Finally, we're ready to start properly setting up the CubeFS cluster.
Setting up the cluster keys
Setting up our cluster is now relatively easy, but will take a few steps. We will set up the authnode cluster first, then the master cluster, then the datanodes and metadata nodes, (which may be colocated on the same boxes).
Creating the initial keys (PKI) for CubeFS
Before we can get started, we need to generate the root keys for our cluster. The following instructions are based on the ChubaoFS documentation and LOTS of trial and error, as well as the current CubeFS documentation.
Become root with sudo su -
and navigate to /etc/cubefs
.
Run cfs-authtool authkey
and you should see two files are generated, authroot.json
and authservice.json
.
From here, you'll want to create a new file called authnode.json
and fill it out. Replace "ip" with your node's IP address, "id" with a unique ID (we suggest the last octet of your node's IP address, as it is entirely arbitrary but must be unique), and "peers" with a list of your authnode's IP addresses. The format for "peers" is "$id:$ipv4:8443" for each host, with commas separating each host.
Change "clusterName" to whatever you like, then set "authServiceKey" and "authRootKey" to the values of the "auth_key" property found in authroot.json
and authservice.json
These keys are VERY sensitive and are fairly painful to change, so treat them appropriately.
Here is a slightly redacted version of my authnode.json
file.
Now we'll generate a self-signed TLS certificate, valid for each of the authnodes' IP addresses. Adjust the subject and subjectAltName according to your environment.
Check that the certificate material was generated correctly, then create the folder /app
and symlink the material into /app
, and finally change /app
to be owned by the CubeFS user.
# Check for server.crt and server.key
root@cubefs-authnode01:/etc/cubefs# ls server.*
server.crt server.key
# Create symlinks in /app to the certificates.
root@cubefs-authnode01:/etc/cubefs# mkdir /app ; ln -s /etc/cubefs/server.crt /app/server.crt ; ln -s /etc/cubefs/server.key /app/server.key ; chown -R cubefs:cubefs /app
We'll now install the systemd
unit for cubefs-authnode
, allowing us to start and run it as a service.
With your favourite editor, open up a new file at /etc/systemd/system/cubefs-authnode.service
and fill it with the following contents:
Then start the service.
Congratulations, you've setup your authnode. If you only have one, you can skip to the next heading, otherwise, read on.
If you're using a cluster, you'll need to setup your remaining authnodes. To do so, simply copy /etc/cubefs/authnode.json
, /etc/systemd/system/cubefs-authnode.service
, /etc/cubefs/server.crt
, and /etc/cubefs/server.key
to your remaining nodes, and also recreate the symlinks in /app
:
You'll need to adjust the "ip" and "id" fields in authnode.json
to match the node that configuration lives on, but other than that all configuration should be identical. You should also make sure the permissions on your TLS key and authnode configuration are correct - cd /etc/cubefs && sudo chmod 600 server.key authnode.json
will take care of this. While you're at it, make sure /etc/cubefs/
is owned by cubefs
on all of your nodes and that any files with keys in them are also set to chmod 600
(ie: read/write allowed for "cubefs", no permissions for group, no permissions for world).
In case it's helpful, this is what proper permissions should look like:
Once your remaining nodes are configured, start them just like you started the first node:
Checking on the state of your authnode cluster
You should be able to examine the logs on one of your authnodes (the first one is probably best for this) and see it electing a leader.
riff@cubefs-authnode01:/etc/cubefs$ sudo cat /var/log/cubefs/authnode/authNode/authNode_warn.log
2024/02/14 06:11:44.783799 [WARN ] authnode_manager.go:36: action[handleLeaderChange] change leader to [10.0.20.91:8443]
The leader will be elected as soon as two of the three (or whatever constitutes quorum in your Authnode cluster) nodes are online.
Creating the admin keys and an admin ticket
Create the following files in /etc/cubefs
on your first node. These will be used to generate keys with the appropriate permissions:
Whew, all done with those.
Use the authservice.json
file you generated earlier to create an Authnode
ticket.
Now, create an admin key:
Finally, create an admin ticket:
root@cubefs-authnode01:/etc/cubefs# cfs-authtool ticket -https -host=10.0.20.91:8443 -keyfile=key_admin.json -output=ticket_admin.json getticket AuthService
You can now use the admin ticket to create the rest of the necessary keys.
Creating the master/resource manager keys
Creating the datanode keys
root@cubefs-authnode01:/etc/cubefs# cfs-authtool api -https -host=10.0.20.91:8443 -ticketfile=ticket_admin.json -data=data_datanode.json -output=key_datanode.json AuthService createkey
Creating the metadata node keys
root@cubefs-authnode01:/etc/cubefs# cfs-authtool api -https -host=10.0.20.91:8443 -ticketfile=ticket_admin.json -data=data_metanode.json -output=key_metanode.json AuthService createkey
Creating the client keys
root@cubefs-authnode01:/etc/cubefs# cfs-authtool api -https -host=10.0.20.91:8443 -ticketfile=ticket_admin.json -data=data_client.json -output=key_client.json AuthService createkey
Creating the master/resource manager nodes
On your master nodes, install the following systemd service file as /etc/systemd/system/cubefs-master.service
:
Copy server.crt
(but NOT server.key
) from your authnode to /etc/cubefs/.
Now, fill out the following and save it as /etc/cubefs/master.json
on each of your master nodes.
You'll want to, as before, replace some of the values.
Specifically, set ip
to the IP of each master node, set id
to an appropriate unique value for each node (once again, it's probably best to use the same value as the last octet of the node's IP address for simplicity). Set clusterName to the same clusterName that you set before (when setting up the authnodes). Set "peers" in a similar fashion to the authnodes, only this time we're referencing the masters instead.
For masterServiceKey, copy the value from /etc/cubefs/key_master.json
on your first Authnode. Here, you must use the value auth_key and not the value from auth_key_id. Confusingly, some components need the former and some need the latter.
Once you've configured all your master nodes, go ahead and start them all.
You should be able to see in the logs that a leader is elected when you have enough nodes for quorum, as before:
Configuring cfs-cli
Now for the exciting part - we finally get to test out the basics of our cluster authentication. (Okay, the really exciting part comes later, but this is still cool).
Open /etc/cubefs/key_client.json
and take the value auth_id_key and copy it.
Then, fill out the following on your management machine (whatever machine you want to manage CubeFS with - make sure you install the tarball from earlier onto it), replacing the IP addresses with your master cluster's IP addresses. You'll need to set this value based on auth_id_key (NOT auth_id!) from /etc/cubefs/key_client.json
from your first Authnode.
Let's first try an unprivileged operation:
wings@blackberry:~$ cfs-cli cluster info
[Cluster]
Cluster name : rifflabs-per
Master leader : 10.0.20.101:17010
Master-101 : 10.0.20.101:17010
Master-102 : 10.0.20.102:17010
Master-103 : 10.0.20.103:17010
Auto allocate : Enabled
MetaNode count : 0
MetaNode used : 0 GB
MetaNode total : 0 GB
DataNode count : 0
DataNode used : 0 GB
DataNode total : 0 GB
Volume count : 0
Allow Mp Decomm : Enabled
EbsAddr :
LoadFactor : 0
BatchCount : 0
MarkDeleteRate : 0
DeleteWorkerSleepMs: 0
AutoRepairRate : 0
MaxDpCntLimit : 3000
This tells us that our master cluster is working, and that things seem somewhat healthy.
Now, let's try something that requires a key:
wings@blackberry:~$ cfs-cli user create test
Create a new CubeFS cluster user
User ID : test
Password : [default]
Access Key: [auto generate]
Secret Key: [auto generate]
Type : normal
Confirm (yes/no)[yes]: yes
Create user success:
[Summary]
User ID : test
Access Key : REDACTED
Secret Key : REDACTED
Type : normal
Create Time: 2024-02-14 07:01:43
[Volumes]
VOLUME PERMISSION
If you instead see an error like this:
check all your key configuration (and in the absolute worst case, blow away your authnode's /var/lib/cubefs/*
data and set it up again - I hit a bug at some point where nothing I did would make certain keys be recognised, and creating a fresh Keystore was the only way to fix it. YMMV, and obviously only do that as a drastic measure) and then try again.
🎉 Great job if you've gotten this far! The rest is significantly easier than the previous steps, so you're not got far to go. 😄
Creating the full cluster
Now that we have our base infrastructure, we can add the actual data and metadata nodes that will store everythitar xvf cubefs-3.1.1-withauth.tar.gz ; cp cubefs/build/bin/cfs-* /usr/local/bin/ ; rm -r cubefs/
ng for us. Then we'll mount our new filesystem and store some data!
Creating the data nodes
On each data node:
Install CubeFS using the tarballs, as before, as well as creating the CubeFS user, group and directories (/var/lib/cubefs
, /var/log/cubefs/
, /etc/cubefs
) with the appropriate permissions.
Then, create the following systemd unit file, as /etc/systemd/system/cubefs-datanode.service
:
Now it's time to attach disks. You should have each of your disks mounted as a mountpoint of your choice, but with a single disk per mountpoint. You do not have to use a partition table (but can if you prefer) - you can simply do for example mkfs.xfs /dev/sdx
(IF /dev/sdx is the disk you are sure you want to format). Make sure your disks are mounted by UUID - you can find the UUID of each disk by doing, for example, blkid /dev/sda
(or if using partitions, blkid /dev/sda1
).
Once you have all of your disks attached to your nodes, with UUID based mounting to avoid issues, create a cubefs
folder within each mountpoint and change it to be owned by the cubefs
user. We'll assume all your disks are named /mnt/brick.something
from hereon out, but feel free to change the commands as needed.
Here's an example of creating the relevant cubefs
folders in a system with a bunch of bricks mounted as /mnt/brick.$name
:
It's a similar process to MooseFS bricks if you've set those up before.
Once you've got your disks attached, it's time to create the datanode.json
configuration file on each of your datanodes. Unlike the previous templates, the only thing you need to change here is to replace the disks array with a list of your disks and set the datanode key. The value after the colon (:
) is the reserved space for that disk in bytes (space that CubeFS will leave free and assume cannot be used, in other words), useful for ensuring disks don't fill up completely. You'll want to set "serviceIDKey" to the value of auth_id_key from /etc/cubefs/key_datanode.json
from your first Authnode.
As a hint, here's a script to give you the disks
array preformatted as a ready-to-paste block if you have all your disks as /mnt/brick.$name
:
Now that you have each of your datanodes configured, bring them online.
root@monstar:~# systemctl enable --now cubefs-datanode
Created symlink /etc/systemd/system/multi-user.target.wants/cubefs-datanode.service → /etc/systemd/system/cubefs-datanode.service.
root@ambellina:~# systemctl enable --now cubefs-datanode
Created symlink /etc/systemd/system/multi-user.target.wants/cubefs-datanode.service → /etc/systemd/system/cubefs-datanode.service.
root@al:~# systemctl enable --now cubefs-datanode
Created symlink /etc/systemd/system/multi-user.target.wants/cubefs-datanode.service → /etc/systemd/system/cubefs-datanode.service.
root@sizer:~# systemctl enable --now cubefs-datanode
Created symlink /etc/systemd/system/multi-user.target.wants/cubefs-datanode.service → /etc/systemd/system/cubefs-datanode.service.
Success!
(PS: My cluster appears to have data used already as I have other data on the disks used by CubeFS, so it appears as used space).
Creating the metadata nodes
If your metadata nodes will reside on the same hosts as your datanodes (which I highly recommend*), setting up the metadata nodes is simple and easy. If not, follow the rough directions above for installing CubeFS and creating users and directories and such.
* Note that in a very large system you may want to have dedicated hosts for metadata.
Open /etc/systemd/system/cubefs-metanode.service
in an editor, and fill it in as follows:
Now, edit /etc/cubefs/metanode.json
and fill it out as follows. You'll want to set "serviceIDKey" to the value of auth_id_key from /etc/cubefs/key_metanode.json
from your first Authnode.
As an aside, you should read the following important notes from the CubeFS docs:
Notes
The configuration options listen, raftHeartbeatPort, and raftReplicaPort cannot be modified after the program is first configured and started.
The relevant configuration information is recorded in the constcfg file under the metadataDir directory. If you need to force modification, you need to manually delete the file.
The above three configuration options are related to the registration information of the MetaNode in the Master. If modified, the Master will not be able to locate the MetaNode information before the modification.
In other words, modifying the listen port or any of the Raft ports used by the node after deployment will cause issues. Basically, don't, and if you have to, be aware of the above.
If you have a large/alternative disk such as a dedicated NVMe drive you wish to use for metadata, mount it as /var/lib/cubefs/metanode/data/meta
prior to starting the metanode for the first time (if you do need to set that up later - stop the metanode, move /var/lib/cubefs/metanode/data/meta
to a new location, mkdir /var/lib/cubefs/metanode/data/meta
, mount your new drive, and then move the contents of the old folder back to the original location now that your new drive is mounted).
Finally, start all your metanodes.
If you see something like the following, you've done it - congratulations.
wings@blackberry:~$ cfs-cli metanode ls
[Meta nodes]
ID ADDRESS WRITABLE STATUS
6 10.0.20.15:17210 Yes Active
7 10.0.20.18:17210 Yes Active
8 10.0.20.16:17210 Yes Active
9 10.0.20.14:17210 Yes Active
You should have a fully functioning CubeFS installation.
Creating your first volume
It's time to try it out for real.
Create your first volume, specifying that you want:
- A volume called media
- Owned by media
- With follower-read=true (highly recommended for better performance)
- 2 replicas (in other words, 2x replication, two copies)
- A total volume capacity of 5000GB (5TB)
Feel free to tweak any of these as you wish, and run cfs-cli volume create --help
for more information.
That's it! No, really, that's it. You're ready to mount it!
Mounting the filesystem
On the machine you wish to use CubeFS with, install CubeFS (you will not need a CubeFS user this time) and then install the following systemd unit file. This one is special - it is an aliased service unit, which will allow you to create arbitrary mounts using CubeFS as you wish.
You will create it as /etc/systemd/system/cubefs-mount@.service
- note the presence of the @ in aliased units.
Now that you've got your mount unit file created, you just need a configuration file.
Let's open up /etc/cubefs/client-media.conf
and create our first mount config:
You'll need the value of auth_key_id from /etc/cubefs/key-client.json
on your first Authnode.
deletekey
instead of createkey
). Basically just create a new version of data_client.json
with a new name and a new ID inside it, and then generate a key as before. The contents of auth_id_key contains the ID you want to authenticate with, so you can authenticate with different keys. You can actually do this same trick for most of the components in the cluster, but beware of overcomplicating things.Without further ado...
We have a working mount.
Done!
This intro post covers setting up a complete but basic CubeFS installation. Future posts will cover enabling the erasure coding and object storage subsystems, as well as integrating CubeFS into Kubernetes using the CSI driver.
If you enjoyed this post, check out the rest of my blog and maybe drop me a message via any of my contact details. Cheers!