Unsinkable Container Ships Part 3 - DNS to the Rescue

saschg over 2 years ago 7 minute read time

Here we are again!

This is part 3 of 3 on my adventure to create a highly available Docker environment.

What did we do so far?
- Created a distributed filesystem with Ceph
- Set up a Docker Swarm and deployed Traefik and Portainer

What’s left to do?

As you might have noticed, my design runs with a single point of failure, and that’s addressing the hosts itself.
To fix it, we need to create a virtual IP and make sure the nodes are listening to it.

As usual, research is fun, as you find really creative ideas:
Some folks add a load balancer in front of the reverse proxy, some add a load balancing reverse proxy in front of the reverse proxy. Doh.

That’s too complicated. I want something simple.

Introducing: Keepalived.

Keepalived is a Linux package using VRRP. The network guys reading this are likely familiar with the protocol. It’s often used to create a link between routers or firewalls to establish a HA cluster, and it’s similar to CARP on FreeBSD or even HSRP for Cisco.

Before installing it, we need to sort a prerequisite to allow our manager nodes to bind external IP addresses:

nano /etc/sysctl.conf
net.ipv4.ip_nonlocal_bind=1

Unfortunately, this requires a reboot on each manager, which are swarm1-3 in my example.
Once they are back, we quickly install the package:

apt install keepalived -y

And create a config file:

nano /etc/keepalived/keepalived.conf

Now, this is the configuration for swarm1, my primary manager:

vrrp_script chk_docker {
    script "pgrep dockerd"
    interval 1
    weight 20
}

vrrp_script chk_traefik {
    script "pgrep traefik"
    interval 30
    weight 10
}

vrrp_instance VI_1 {
     state MASTER
     interface ens160
     virtual_router_id 12
     priority 200
     advert_int 1
     unicast_src_ip 10.0.10.31
     unicast_peer{
         10.0.10.32
         10.0.10.33
     }

     authentication {
         auth_type PASS
         auth_pass pass1234
         }

    virtual_ipaddress {
         10.0.10.39/24
         }

   track_script {
    chk_docker
    chk_traefik
}
}

Let Me Explain What Happens Here

The first elements (scripts) check the availability of processes; in the Windows ecosystem this could probably be a simple file share quorum in the Windows Cluster Manager, but the VRRP protocol is working a little different.

In the “instance” element we describe the actual failover condition.
The state is either master or backup, but as I’m using master on all three nodes, I let the system dynamically assign the state based on availability and weight. The weight comes from the priority setting; the higher, the more important the node. Src_ip is the local IP and the peers are, well, the peers.

I’m using a simple authentication, and finally the most important bit, the virtual IP, which is shared between the nodes.

On the second node, swarm2 in my case, I’m using the same conf but lower the priority to 100 and adjust the IP addresses to src=.32 and peers to .31 and .33, and guess what, on swarm3 I’m going even lower with priority 50 and adjust the IP addresses again respectively.

Each node requires a:

sudo service keepalived restart

Let’s verify the status:

But let’s look at the local IP addresses as well:

The VIP .39 is currently assigned to the node with the highest priority, my friend swarm1.

DNS Tricks

Before we can test failover, we need to deal with DNS.

In the previous posting, we deployed Traefik and Portainer, and created A records that match the host. But now, we need to change each of these records to the VIP.

Attention: There’s one major rule when testing DNS changes, or any action that depends on DNS in general, and that goes: cache is a beast.
So, flush your caches, my friends. DNS Server cache, update data files, local machine cache, browser cache.
So many great and working concepts died a premature death because the cache hasn’t been cleared, and they were deemed unsuccessful.

Here’s my current swarm2:

And here again, after I cut the network on swarm1:

Ping is still fine, and all dashboards, too. That’s what I wanted to see!

Here’s some behind-the-scene information:

At 17:39, swarm2 took over the responsibility. Now I’m enabling networking again on swarm1.

Swarm2 properly remembered its place and returned to backup state, making swarm1 automatically the master. Wunderbar.

We now made the trinity of storage (Ceph), compute (Docker Swarm), and networking (VRRP) highly available.
In theory, all of this could run on a few Raspberries with attached SSDs, which would make this a very, very inexpensive hyper converged cluster.

Look at the last time stamp in the screenshot. Finally, it’s time for a beer to celebrate!

Some Housekeeping

I tend to run my containers with the “latest” tag for the image, but the container/service re-creation requires a trigger somehow. Sure, there are cronjobs, but there’s a more elegant variant: a container to update other containers.

Previously I was using watchtower, but I learned it doesn’t work in Swarm mode. An alternative is shepherd.

Let’s do this:

mkdir /var/data/containers/shepherd
cd /var/data/containers/shepherd
nano docker-compose.yml

The file itself is simple:

version: "3"
services:
shepherd:
    image: mazzolino/shepherd
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    deploy:
      placement:
        constraints:
        - node.role == manager

Kick it with docker stack deploy shepherd -c docker-compose.yml

That’s it. By the way, we could use Portainer to deploy it, too, if you prefer a GUI.

In the default setting, shepherd checks all five minutes for new versions and will update the service.
Less manual labour, more time for beer. But wait—there’s more.

Over time, tools like watchtower or shepherd pile up old and unused images if not configured else.

Similar to watchtower, shepherd has a switch to get rid of unused images automatically “—env IMAGEAUTOCLEANLIMIT=“x” “ while x is the number of files you would like to keep, but it’s still not flexible enough, or, precisely, it lacks a feature: exclusions.

I’m using a container called docker-cleanup, which is no longer maintained but gets the job done just fine.

First, I create a file to describe the exclusion(s):

nano docker-cleanup.env
KEEP_IMAGES=heimdall,xxx,xxx
DEBUG=1

I’ve lost my Heimdall dashboard configuration twice after an update and don’t want that happening again.
The compose file:

version: "3"
services:
docker-cleanup:
    image: meltwater/docker-cleanup:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/lib/docker:/var/lib/docker
    networks:
      - internal
    deploy:
      mode: global
    env_file: /var/data/containers/cleanup/docker-cleanup.env

networks:
internal:
driver: overlay

And off you go:

docker stack deploy cleanup -c docker-compose.yml

Cleanup checks for unused volumes and will delete anything not in use for 30 minutes, which gives us a convenient timeframe for maintenance work.

What’s Left? The Bonus Level!

That’s the base for my (hopefully) unsinkable container ship, and I will now start filling it with services.

Further options could be to advise Docker, or at least specific containers, to send syslog messages to the Orion^® Platform. Unfortunately, Docker sends the Container ID instead of the name, which makes identifying the source a bit difficult.

If you’re lucky, you have a tool with tagging capabilities like SolarWinds^®Log Analyzer at your disposal, and you can create simple rules to add an explanation tag to the message.
To enable syslog on Docker, follow the documentation here.

On top of that, Docker comes with an API, and “/get info” sounds like an invitation:

Yes, it’s basic, but took just two minutes. Feel free to create a nice template yourself and share with the community. Maybe I’ll do it myself as long as this project runs.

But, knowing myself, I guess in a few weeks it will get boring, and I’m trying something else.
Or maybe I return to single nodes with proper monitoring and snapshots if I’m tired of the complexity.

Mid-term, I consider creating an API Poller template to monitor Ceph, as there are a few requests here in THWACK^® looking for a way to retrieve metrics.
They have a well-documented API, and instead of tinkering with Linux scripts like a caveman, why not use the more advanced options provided by our platform?

But that’s something for the future.

Right now, I have enough of this project and need to find something else to play with. Literally—it’s Stellaris. Where else would you go from the Orion Platform?

I hope it was informative, take care!