Unsinkable Container Ships Part 2 - Building and Managing the Swarm

saschg over 2 years ago 10 minute read time

I wanted to pick up where I left the scene here, so this time, I’m running a Docker Swarm with a little management and monitoring.

What Is Docker Swarm?

Just in case some folks aren’t exactly sure what a Swarm is, it’s basically a clustering of individual Docker hosts. They create what’s called a routing mesh, and instead of containers, you’re deploying services. But it’s the same concept, really, with the difference that you tell the system how many duplicates/instances you need and where to put them. The Swarm takes care of spinning up a new instance in case a host no longer responds.
For a while now, Swarm mode is built into the basic Docker package. Previously it was an external add-on.

A Swarm doesn’t give you the flexibility, scalability, and orchestration of Kubernetes, but on the other hand, a Swarm doesn’t require 20 years of studying or the blood of your firstborn.

Ideally, or let’s say “in production,” you would run at least three to seven manager nodes and multiple worker nodes per cluster. Currently, I’m running three manager and two workers, and it’s possible to scale up or down at any point. It’s easy, you just add nodes and promote/demote any node as required.

By the way, a manager is always a worker at the same time. Again, in production, we would probably separate them and make sure to not deploy services on managers. The command for this task is called “drain.”

Let’s deploy it.

On my manager node, swarm1, I run the following command:

docker swarm init --advertise-addr 10.0.10.31

Here’s the result, including the next step:

I’m running the docker swarm join command shown above on all other nodes and promoting swarm2 and 3 to managers:

docker node promote swarm2
docker node promote swarm3

Once done, we can verify the success:

A command like docker system info will show more details, but even the lines above prove the cluster works.

That’s it, job’s done, we can go home.
Is it too early for a beer? Well, yes, as I want to do a little more than just that!

Approaching the Orion Platform

While the Orion^® Platform has had the ability to monitor containers for about three or four years (the documentation is available here), I’m following a slightly different approach.
I’m adding all five nodes with both the agent and SNMP:

Once they’re alive, we go to Settings —> All Settings, and under Node & Group Management, select Manage Container Services.
In the popup, we provide a creative name, select Docker Swarm as the environment, and select a manager node.

Now we need to create New Credentials—this is a feature we changed in late 2020. But it’s still simple; just add a name and click generate:

Attention: the credential name isn’t important, but copy and paste the token somewhere. Do it immediately before clicking anywhere else!

Clicking “Create Service” will generate a few commands to run on the manager. Let’s have a look at them.

The first one pulls the image from the local Orion Platform instance. What I find interesting here is the IP showing is my currently active Orion machine. In one of my tests, I manually changed it to the HA VIP out of curiosity and can confirm it still works. Hello, QA, you can thank me later for testing this scenario.

curl -o cman-swarm.yaml --insecure --pinnedpubkey sha256//5VZMA0yX8N7itY+J7eEn/0xOWheabqfAW0i/qVtqH6o= https://10.0.20.22:38012/orion/container-management/monitoring/deploymentfile?guid=48f90bbb-6644-4ab2-9b58-28b78b60b9ee

The second one creates the actual deployment file:

sed -i "s/%HOSTNAME%/$(hostname)/g" cman-swarm.yaml

It’s a stack deployment, so you can look at the yaml file, but it gets the job done, so I move on.
Number three is the important one, as it’s going to ask for the token, which you hopefully saved somewhere:

SCOPE_PASSWORD=$(head -c 32 /dev/urandom | base64 -w 0) && read -sp 'Please enter SolarWinds Token: ' SOLARWINDS_TOKEN && echo -n $SOLARWINDS_TOKEN | base64 -w 0 > ./solarwinds_token_secret && echo -n $SCOPE_PASSWORD >> ./scope_password_secret && echo -n 'BASIC_AUTH_PASSWORD='$SCOPE_PASSWORD >> ./.env_scope && unset SCOPE_PASSWORD && unset SOLARWINDS_TOKEN && echo

In my first attempt, I didn’t save it, so I was stuck here and had to start over. Be better than me!
Now the deployment happens:

sudo docker stack deploy -c cman-swarm.yaml sw

And we do a little clean up. I love it! IT-OCD is a thing, remember?

rm solarwinds_token_secret scope_password_secret .env_scope

Once everything is done, we need to confirm it:

Now the system waits for the containers to come up and connect.
That might take a while, so maybe now is the right time for a beer? Meh, the clock says no, so let’s check the progress after a refreshing coffee instead:

SolarWinds and Weaveworks are our new containers. But remember, we’re on Swarm mode with its services, so shoot this one:

If something goes wrong and you need to re-deploy, follow the steps as outlined here. What the documentation doesn’t tell you is the token is still alive on the Swarm and, depending on the scenario, might need to be removed before you start the next attempt. Here’s how:

Oh, and if you need to recreate the token in the Orion Platform, you can delete the old one under API Credentials.
Anyway, it’s nice and green here:

And in the AppStack environment, too:

There’s traffic on the distributed virtual switch as well:

I inspected the traffic, and it’s mostly Ceph based. A lot, actually, but hey—it’s a network distributed file system, so that’s to be expected.

And, finally, the coolest bit, Orion Maps:

Yes, it’s red, but only because swarm1 is a bit of a drama queen, and Datastore 11 is filling up. I’m aware, thanks.

So far, it’s been a nice trip. Let’s add some excitement!

To correctly address multiple containers, we need to add a reverse proxy.
There are a couple ones available (Nginx, haproxy etc.), but Traefik is the perfect match for a container environment.

Unfortunately, depending on the required features, the configuration isn’t that trivial and could replace any medication against low blood pressure.
As this is more a proof of concept than a production environment, I keep things simple.

To work with a reverse proxy, we need a DNS entry for each container upfront, pointing to the host. For this example, I create traefik.sub.domain.tld with the same IP as swarm1.
You’re probably spotting a weakness in my design now, but we’re addressing that later.

On swarm1, I create a folder which will be replicated thanks to Ceph:

mkdir -p /var/data/containers/traefik

I’m running Traefik flat in a single folder. For my current config it doesn’t matter, but I’m not sure if it’s best practice in more complex deployments. My config is very, very basic and ignores features like the automated certificate creation and http to https redirects.

Stuff like that can always be added later. It’s the same with cooking—add the salt last.
But stop—before deploying anything, we need an overlay network:

docker network create --driver=overlay proxy

Now I’m creating a good old docker-compose.yml file.

version: '3.3'

services:
traefik:
    image: traefik:latest
    command:
      - "--providers.docker.endpoint=unix:///var/run/docker.sock"
      - "--providers.docker.swarmMode=true"
      - "--providers.docker.swarmModeRefreshSeconds=30"
      - "--providers.docker.exposedbydefault=false"
      - "--providers.docker.network=proxy"
      - "--entrypoints.web.address=:80"
      - "--api.dashboard=true"
      - "--api.insecure=true"
    ports:
      - 80:80
        - 8080:8080
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    networks:
      - proxy
    deploy:
      mode: global
      placement:
        constraints:
          - node.role == manager
      restart_policy:
        condition: on-failure

networks:
proxy:
external: true

What trickery is this?

Everything is basic: the file grabs the latest version, which is by the time I’m writing 2.4.13. I’m using the CLI approach, just pasting commands. There are two other flavors (yaml and toml), but in my situation, this one is the fastest. See the “swarmMode=true” command, and the automated deployment on each manager.

Since docker-compose v3, the files are compatible with Swarm, which is quite convenient, as it’s easier to “upgrade” from a single, isolated deployment.

By the way of talking deployment:

docker stack deploy traefik -c docker-compose.yml

Verify the success:

Looks good—the dashboard should be running already, and it’s accessible at http://fqdn:8080/dashboard/.

Not much to see yet, so let’s deploy another service.

Add Portainer to the mix.

Portainer is a management add-on and compatible with both Docker modes and Kubernetes. It’s very popular and probably doesn’t need much of an introduction.

The deployment is pretty easy; the vendor provides an almost turnkey file here.
Still, it requires a little customization. Here’s mine:

version: '3.2'

services:
agent:
    image: portainer/agent:latest
    environment:
      AGENT_CLUSTER_ADDR: tasks.agent
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/lib/docker/volumes:/var/lib/docker/volumes
    networks:
      - agent_network
   deploy:
      mode: global
      placement:
        constraints: [node.platform.os == linux]

portainer:
    image: portainer/portainer-ce:latest
    command: -H tcp://tasks.agent:9001 –tlsskipverify
    ports:
      - "9000:9000"
      - "8000:8000"
   volumes:
      - portainer_data:/data
    networks:
      - agent_network
      - proxy
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.role == manager]
      labels:
        - "traefik.enable=true"
       - "traefik.http.routers.portainer.rule=Host(`portainer.sub.domain.tld`)"
        - "traefik.http.routers.portainer.entrypoints=web"
        - "traefik.docker.network=proxy"
        - "traefik.http.services.portainer.loadbalancer.server.port=9000"
        - "traefik.http.routers.portainer.service=portainer"

networks:
agent_network:
    driver: overlay
    attachable: true
proxy:
    external: true

volumes:
portainer_data:

As opposed to Traefik, this is a real stack as it deploys two services: agent and manager.

Look at the deploy mode and placement constraints of both services, and the concept begins to make sense.

There are two networks involved; the first one creates itself and is purely for Portainer communication, and I added the proxy one which I created earlier.
Now, there are two important bits of information. Here’s one:

"traefik.http.services.portainer.loadbalancer.server.port=9000"

Usually, Traefik tries to automatically discover the ports used by applications, but that won’t work in Swarm mode. Instead, we need to put it into the label for each container to deploy. As the Portainer dashboard listens at 9000, this is the port to add.

The other element is the network label:

- "traefik.docker.network=proxy"

Without this, we could get a “bad gateway” error as Traefik might not know what to do with Portainer traffic.
Time to deploy:

docker stack deploy portainer -c portainer-agent-stack.yml

Good:

There’s one thing I’ve noticed in my attempts; while Portainer is snappy on a single host, it’s a little laggy in Swarm mode, at least in my deployment.

It takes a while to retrieve all information from its agents, so don’t rush jumping into the dashboard straight ahead. Maybe we check Traefik first:

Nice. And a few minutes later, Portainer collected enough information to provide it in its own dashboard:

Don’t get confused when the Swarm is showing as down or if nothing happens when clicking anywhere. Just give it a few more minutes, really.
Once the data collection finishes, it’s cool being able to scale out containers just using the GUI:

Alright, this concludes part 2. What’s left to do?

There are still availability concerns. Sure, the Swam will automatically replicate a service in case something happens, but what if the manager node goes down and we can no longer address it?
There’s a solution for that!

So long, take care.