Unsinkable Container Ships Part 1 - Distributed File Systems

The other day, I experienced an outage in my home data center. One of my Docker container hosts stopped working, and while most of its containers aren’t essential, one of them had an impact on a different system running on another server. The beauty of microservices.
I didn’t even notice it until I wondered why my stuff isn’t processing anymore, which took me a day or two.

Lesson 1:
I should have been monitoring that host.

Lesson 2:
I’m paranoid: I’m running two domain controllers providing two DNS servers, a DHCP failover configuration and proper relays in all VLANs, an SQL-AG, and the Orion® Platform in High Availability (HA) mode, but apparently the tiny things cause trouble. 

The Goal: Increase and Ensure Availability

I’ve had various ideas including Orion Platform actions to change routing to my Docker hosts, until I noticed I’ve ignored the most obvious choice: creating a Docker swarm.

While researching I found a very fascinating approach: a Docker swarm running on a distributed file system with persistent storage. Now I wanted to know more.

There are a few different file systems available for such a use case, some open source, some licensed, but even some of the paid ones come with free a developer/community version.
I decided to spend some time with Ceph, as it’s open source, quite mature, in use by some global players, and well documented. 

Ceph is kind of a “dark horse,” as at first sight you’d think it’s too messy to be deployed anywhere in production, but some enterprise storage hardware uses Ceph as an underlying architecture. Furthermore, it’s part of RedHat’s storage offering and supported out of the box by other Linux distributions, too.

But also, I’m considering changing my arrays from TrueNas to QuantaStor, so digging into Ceph is a two-birds-one-stone thing for me.

At first, the documentation is a bit scary as it’s very detailed and, in fact, a complicated matter, but a half-automated approach removes a lot of the complexity. 

So, Where to Start?

I prepped an Ubuntu 21.04 Server VM (2 vCPU, 4 GB mem, 16 GB + 40 GB disk), and a pretty basic configuration with SSH/root, SNMP, a fixed time zone and NTP, a manual but standard installation of Docker and docker-compose, and static IPs with DNS forward/reverse entries, but to fight paranoia, I edited the hosts file, too: 

I cloned it four times, so I can work with five machines, distributed over four ESXI, of course, as I want to achieve redundancy. By the way, don’t be like me and forget to change the hostnames of the clones!

Swarm1 is going to be my manager, and I’m running the installation on that one. There’s an official Ubuntu package:

apt install -y cephadm

And the next step is to bootstrap the cluster:

cephadm bootstrap --mon-ip

This verifies network connectivity, pulls a container image, and deploys the first node with a shiny dashboard.
Copy the password somewhere before moving on:

But there’s nothing to see yet, so let’s do a few other things first.
In a later step, we need the CLI, so let’s quickly install it:

cephadm install ceph-common

Now we can check the status:

Yep, not much to see.

Let’s Add More Nodes

We can stay on the manager and attach the other nodes remotely. A prerequisite is to exchange a key, which we can do via SSH. Occasionally, Linux can be convenient.
I hear Leon Adato saying, “I told you so.” Ja, ja, ja. 

ssh-copy-id -f -i /etc/ceph/ceph.pub root@
(Repeat for the other nodes)

Adding the nodes:

ceph orch host add swarm2 --labels _admin

Now what is this label thing?
They can be used to put a tag on objects. They’re meant as internal references, but some are special like the admin one that points out which node gets administrative capabilities.
In my tiny cluster, I want all of them to participate, so I’ll add it to all of them. Apparently the first node is a manager automatically, but didn’t show the label for me, so I applied it manually:

ceph orch host label add swarm1 _admin

Let’s verify the whole process: 

Great, the cluster has been formed within just a few steps.

Now We Need Storage

Remember I created the machine with two disks? We’re now attaching the second one to Ceph as storage devices, which are called OSD in its terminology.
The documentation shows a promising command to attach everything in one process:

ceph orch apply osd --all-available-devices

Unfortunately, it didn’t do anything for me, so I had to attach the disks manually:

ceph orch daemon add osd swarm1:/dev/sdb

Check it: 

Well, that’s a milestone reached. The next step is to create a file system, and that’s a quick one:

ceph fs volume create data

Verify the overall status:


Mounting the Pool

Unfortunately, this requires a little work, and it took me a while getting the essential information from the documentation. We need to create a folder and copy the configuration from the manager to the other nodes.
So, I log in to swarm2 and shoot these lines:

mkdir -p -m 755 /etc/ceph
ssh root@ "sudo ceph config generate-minimal-conf" | sudo tee /etc/ceph/ceph.conf
chmod 644 /etc/ceph/ceph.conf

We created a folder, placed the copy of a “minimal configuration” inside, and adjusted the permissions. The documentation mentions another step to copy the key-file:

ssh root@ "sudo ceph fs authorize data client.admin / rw" | sudo tee /etc/ceph/ceph.client.admin.keyring

But it failed for me as it only created an empty file, so I used an external SSH client to copy the ceph.client.admin.keyring from the manager /etc/ceph/ to the nodes.
Obviously, all other nodes require the same preparation.

The next step is to create a mount point and attach it. These steps apply to all nodes including the first one:

mkdir /var/data
apt install ceph-fuse -y
ceph-fuse --id admin /var/data

To make it persistent, we need to edit /etc/fstab

none /var/data fuse.ceph ceph.id=admin,_netdev,defaults 0 0
none /var/data fuse.ceph ceph.id=admin,ceph.conf=/etc/ceph/ceph.conf,_netdev,defaults 0 0

Testing the Black Magic

Let’s create a file; I’m on swarm1 now:

touch /var/data/mylittlepony

Switch to a different node and verify it’s properly replicated:


I think now is the right time to jump into the dashboard, as we should be able to see a few bits.
There’s both an http and an https version. The http one uses 8080, which is unfortunate for a later step, so let’s change it:

ceph config set mgr mgr/dashboard/server_port 8081

The first thing we see is this:

I’m using open source at home, so I’m okay with providing telemetry data, but you probably won’t enable it in a corporate environment.
Anyway, it’s nice that they ask, not like some others…but let’s not go there. Instead, let’s go here:

We can check the cluster members and their functions, and the file system we created earlier: 

One thing bothers me, and it’s that certain functions aren’t redundant, or not redundant enough for me. I’m using the GUI to upscale monitors and metadata servers, just in case I lose one again. 

The documentation is full of switches to tune the environment, but it wasn’t clear to me where to apply them. f you check the “Configuration” tab, it shows six pages of values to edit, that’s not much. At the bottom there’s a suspicious line:

And I was like, “Hey, where are the others?” until I found a pulldown in the top-right corner:

Dev mode = unlimited power!


Anyway, for my current use case, that’s it—there’s shared storage in HA mode on multiple Docker hosts. Perfect!

To make the system useful in production, you need to enable other protocols like iSCSI and NFS.

There’s also a Windows client available here, but I didn’t test it yet.

Ceph is a very interesting bit of technology to play with, and I surely will revisit in another form. 

But my next project is what I initially intended to do: building a Docker swarm. That’s going to be easy, and I’ll write a quick blog on how to build and monitor it.

Take care!