Engineering

Automate EKS Node Rotation for AMI Releases

In the daily life of a Site Reliability Engineer, the main goal is to reduce all the work we call toil. But what is toil? Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly as a service grows. This blog post describes our journey to automate our nodes rotation process when we have a new AMI release and the open source tools we built on this.

Overcoming Kubernetes and EKS Limitations

Apart from toil elimination, we had specific problems that we needed to solve by building our tooling around. Our first problem is the limited way in which Kubernetes Operations (kops) rolls out node changes. It does it sequentially, one by one, and in a specific window of time. We needed a much more flexible way of rotation to avoid stretching our workloads or hitting any limits on our infrastructure. Each environment might need different handling to be more efficient and as reliable as possible during this process.

In addition, we could not initially adopt the managed nodes solution that AWS EKS clusters offered due to some limitations. So we needed our custom rotation mechanism. The rotation ability is essential, especially for releases related to security patches, which should be in place as soon as possible. We adopted the AWS EKS managed nodes solution as soon as it improved, which simplified our workflow.

Our Open Source AMI Node Rotation Solution 

To solve the problems mentioned above, we combined our existing tooling, such as our cloud provisioner and our GitLab Pipelines, with the new tools we implemented. Below are the steps we took to achieve this.

  • Implemented a new library (rotator) that improved nodes rotation functionality. It provided us with a much more flexible, configurable, and reliable way to rotate nodes.
  • Added the rotator as a module on our cloud provisioner of our workloads (kops) clusters. 
  • Created a cli tool for rotator (rotatorctl) to use it from our local machines and our GitLab pipelines to rotate EKS clusters.
  • Created GitLab pipelines to release new AMIs for kops clusters using cloud provisioner with rotator module and its improved capabilities.
  • Created GitLab pipelines for releasing new AMIs on EKS clusters using rotatorctl
  • After we adopted AWS EKS managed nodes, we simplified our Gitlab pipelines by dropping our custom rotation mechanism, as managed nodes offer natively their own.

The resulting flow of node rotation for our kops clusters is as follows:

  1. When there is a change, a new AMI is built
  2. Manually trigger the deploy new image pipeline
  3. Disable new installations get in the cluster and trigger upgrade event with provisioner
  4. Wait until all nodes substituted with the new AMI and cluster become stable
  5. Enable installations and continue to the next cluster

And the flow of node rotation for our AWS EKS clusters is:

  1. When there is a change, a new AMI is built
  2. Manually trigger the deploy new image pipeline
  3. Terraform automatically applies the new AMI to the Launch configuration of the ASG of EKS cluster
  4. AWS EKS Managed nodes take the new AMI and rollout new nodes. In our initial solution, this was handled by our open source tool rotatorctl
  5. New AMI with fresh instances is in place

How Node Rotation Reduces Toil

Automating and improving these processes saved a lot of valuable time for the SRE team.  Before putting these new processes in place, some cases required 2 or 3 people to participate and closely monitor these tasks. This process was especially time-consuming (2 to 8 hours depending on the cluster size and the environment) for kops clusters, which rotate their nodes.

This choice gave our team the ability to roll out more regular AMI changes, which has resulted in a more secure and better-performing underlying infrastructure. We can focus on what matters: serving a reliable and more secure cloud offering for our customers. These tools are not only useful for our team but for the wider community, as they solve a problem that many Operations and SRE teams are facing. Offering back tooling to the Open Source community for managing their infrastructure and their workloads is a core principle in our team.

Stavros Foteinopoulos is a Site Reliability Engineer at Mattermost. He joined the team at November 2020. He holds an Msc in Information Systems Security and a MEng as EECE from Democritus University of Thrace.