Mattermost’s Kubernetes Operator spins up and manages Mattermost instances running on Kubernetes based on a
ClusterInstallation Custom Resource (CR). Mattermost Operator 1.0 has evolved a lot since its release, along with the
ClusterInstallation CR in the
v1alpha version. As time went by — as with any software — the Operator gained more features, configuration options, functionalities, and technical debt. These changes resulted in a more complicated code base and confusing configuration of Mattermost instances managed by the Operator for our users and us.
After using the Operator extensively ourselves and hearing about some hurdles experienced by customers, we decided to migrate CRs to the new
This blog post covers some challenges we encountered during the migration to the new Kubernetes Custom Resources and describes how we solved them.
As part of the Mattermost Cloud offering, we have thousands of Mattermost instances running in our Kubernetes clusters. Each of them is represented as a CR managed by the Mattermost Operator.
Mattermost Operator is also one of the recommended ways to install Mattermost for self-managed customers.
Those facts put some constraints on our migration story:
- No downtime of the Mattermost instance during the migration
- As little manual intervention as possible
- Ability to run the migration for several installations at the same time
Changing the Custom Resource
After using Mattermost Operator for quite some time, we’ve decided to simplify some things. However, removing some features and adding new ones was not the most significant change in the migration process.
Our initial CRs representing Mattermost installations were called
ClusterInstallation, which made sense from the domain perspective of Mattermost Cloud but didn’t follow best practices of the Kubernetes ecosystem.
v1beta1 specification, we decided to change the whole CR to simply
Changing Resource Kind
The usual way of migrating CRs to the new version is by using Conversion Webhooks. This approach offers the ability to convert the CR specification “on the fly” by creating a webhook used by the Kubernetes API server. This mechanism allows users to operate on several versions of the resource while the
etcd database stored only one version.
Conversion webhooks are designed to handle the migration from one version to another. However, the Custom Resource name and group can’t be changed during this process, as those are the identifying properties of the Custom Resource Definition (CRD).
Unfortunately, conversion webhooks did not apply to our case. We also changed both resource Group and Kind, effectively creating a whole new CR:
Migrating to the New Resource
ClusterInstallation CR is created, the Mattermost Operator spins up a new instance of the application. This instance includes Kubernetes Deployments which are running the actual Mattermost application. When the CR is deleted, all the resources are deleted with it, thanks to Owner References.
This process prevents us from simply running the script that would delete
ClusterInstallation and create
Mattermost in its place, as it would cause the downtime of the Mattermost application (unless we choose to orphan all resources).To “exchange” the existing resources between different CRs and support both resources for some time, we decided to run two Controllers as part of the Mattermost Operator and make them perform the desired migration. As we wanted to retain some control over Mattermost instances that will be migrated, we decided to introduce a new field to the
ClusterInstallation spec that would signal the Operator to start the migration.
migrate: "true" # New field added to ClusterInstallation. Setting it to 'true' instructs the controller to start the migration.
This way, we can perform the migration in the following way:
- When the controller for
ClusterInstallationsees that the
spec.migrateis set to
true, it stops regular reconciliation of the CR. It starts the migration by converting the old resource to the new one.
- Immediately after the
Mattermostresource is created, the controller for
Mattermostsees it and starts to adjust existing resources like Services and Deployments to the new CR, making small changes and overriding owner references.
- When we’re sure that the controller for
Mattermosthas successfully finished its work and Mattermost pods are ready to serve traffic, we delete the old
ClusterInstallation, and voila!
- As in the second phase, the controller is just checking new pods’ health. It can just do so once and requeue the resource for later reconciliation while starting the migration for the next one. This method gives us the ability to migrate multiple resources at a time.
The migration process has another advantage: if anything goes wrong, we can easily revert it by setting
spec.migrate back to
false and removing the newly created Mattermost CR. The controller for
ClusterInstallation will then claim the resources back and continue to monitor them.
Dealing with Immutable Fields
“Exchanging” the Kubernetes resources ownership between two Custom Resources worked fine in most cases but not for all of them.
Old Mattermost Deployments created by the Operator used the following selector:
When we began migrating to version
v1beta, we wanted to change all references to
Kubernetes Deployments have some awesome features, such as rolling updates, that we use all the time when we change environment variables, versions, or other configurations of our installations. As a result, we can update Pods sequentially, keeping some running while others are being updated.
spec.selector field is immutable, so we couldn’t just update the Deployment. We also can’t have two Deployments of the same name in the same namespace and we didn’t want to change the names of resources created by the Operator.
Simply running the Client Go equivalent of
kubectl delete deployment… and creating it from scratch wasn’t an option either, as it would cause the deletion of all Pods running Mattermost. This action would result in a brief downtime of the Mattermost instance managed by the Mattermost Operator, violating our “no downtime” constraint.
Recreating Deployments without Downtime
However, we can still delete the Deployment without deleting the Pods by using a proper deletion propagation policy and orphaning them.
The resources we orphan are not directly Pods but rather Replica Sets that manage Pods on behalf of the Deployment.
Although the Replica Sets (RS)
spec.selector field is also immutable, and we’ll have to delete it eventually, Replica Sets names are not as unique as the Deployment names. They contain a random suffix attached after the connected Deployment name, for example,
my-mattermost-8599f77fcb (the suffix is also a part of Pods name suffix managed by the RS).
kubectl get deployments.apps
NAME READY UP-TO-DATE AVAILABLE AGE
mm-abcd 2/2 2 2 22h
kubectl get replicasets.apps
NAME DESIRED CURRENT READY AGE
mm-abcd-6fccc4f76f 0 0 0 22h <- Old Replica Set is scaled to 0
mm-abcd-9df98d568 2 2 2 16m
kubectl get pods
NAME READY STATUS RESTARTS AGE
mm-abcd-9df98d568-q8z6g 1/1 Running 0 16m <- Pods share part of the suffix with the RS they belong to
mm-abcd-9df98d568-vdxjx 1/1 Running 0 16m
We recreate the Deployment and simultaneously have Replica Sets for both the old Deployment and the new one. As a result, our installations experience no downtime whatsoever.
There are some minor drawbacks to this approach. After the migration, we lose previous Deployment revisions and, for a short period of time, there will be twice as many Pods running for migrating the Mattermost installation. However, none of those drawbacks was a deal-breaker for us.
Managing the Migration
We already have all the building blocks for managing all those Custom Resources for version updates, etc. All we had to do was to add some code that would perform the update on the
ClusterInstallation resource (set
true) and wait for the new
Mattermost resource to reach the
stable state while Mattermost Operators perform their magic.
For most customers using the Mattermost Operator directly, the migration is as easy as running one
The migration to Kubernetes Custom Resources went reasonably smoothly. We had a few minor bumps along the way but nothing caused downtime to customers or any headaches to us.
Mattermost Custom Resources provide us with broader functionality and better management of Mattermost instances with a much clearer definition at the same time. The long-awaited change removes some technical debt gathered around old CR. It makes the Mattermost configuration easier for users to define and brings us closer to following Kubernetes Custom Resources best practices.
While we will not add new features to
ClusterInstallation CR, the Operator will support it until version
2.0. If you manage your Mattermost with Mattermost Operator and still use ClusterInstallation check out this guide on migrating to Kubernetes Custom Resources.