Public Holiday Alert: Troubleshooting a Kubernetes ETCD Server

March 28, 2023

Engineering

Picture this: you are spending an evening with your friends when you receive an alert from your Kubernetes control plane:

etcdserver: mvcc: database space exceeded

“Great, just what we needed on a public holiday,” you think to yourself. But what does it mean? And why now? Of course, during a public holiday, all the odd alerts should pop up! etcdserver, this seems to be something wrong with the control plane of Kubernetes. But mvcc, no idea what that is. Database space exceeded, probably something hit a limit.

Let’s dig into how to troubleshoot a Kubernetes etcd server issue, so you can get back to dinner, your holiday, or simply business as usual.

Start digging

The first thing you do is to check the Kubernetes cluster to see if anything seems off. Since everything seems to be normal, you remove the cluster from your production line to prevent any new deployments aka Mattermost workspaces. Then, you follow the troubleshooting guide for kOps clusters, but nothing seems out of place. So what’s going on?

It turns out that etcd, the distributed key-value store that Kubernetes uses to store all of its cluster data, has hit a limit and gone into maintenance mode. It’s a self-protection mechanism that kicks in before it completely fills the storage and starts having issues. In this case, the solution was to extend the limit using this guide for etcd maintenance.

Getting into the issue

So you get all the etcd pods:

kubectl get pods -n kube-system | grep etcd-manager-main

The response is something like:

etcd-manager-main-XXX         1/1     Running   0             27d
etcd-manager-main-YYY         1/1     Running   0             27d
etcd-manager-main-ZZZ         1/1     Running   0             27d

SSH into every etcd main pod (i.e., the first one etcd-manager-main-XXX):

kubectl exec -it etcd-manager-main-XXX -n kube-system -- /bin/sh

And after accessing the pod, run the next command:

etcdctl --write-out=table endpoint status

Then the response is:

/bin/sh: 1: etcdctl: not found

Drat! So etcdctl was not installed. So now what? This guide helped shape a list of commands for installing it on kOps. Which in our case is like this:

ETCD_VERSION=3.5.4
ETCDDIR=/opt/etcd-v$ETCD_VERSION-linux-amd64 # Replace with arm64 if you are running an arm control plane
CERTDIR=/rootfs/srv/kubernetes/kube-apiserver/
alias etcdctl="ETCDCTL_API=3 $ETCDDIR/etcdctl --cacert=$CERTDIR/etcd-ca.crt --cert=$CERTDIR/etcd-client.crt --key=$CERTDIR/etcd-client.key --endpoints=https://127.0.0.1:4001"

Then re-running the command will show the status of this machine’s etcd.

etcdctl --write-out=table endpoint status

The output of the command should return a table like this one:

$ etcdctl --write-out=table endpoint status
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
|        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |             ERRORS             |
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| https://127.0.0.1:4001 | 8706a9w9c6f27c16 |   3.5.3 |  2.0 GB |     false |      false |        42 |  260623554 |          260623554 |   memberID:972964623457455978918 |
|                        |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+

The output shows that the DB size has exceeded the limit and there’s an alarm for NOSPACE. This means that 2 things need to be checked:

DB SIZE which in our case was 2.0 GB, and
The ERRORS section if there is alarm:NOSPACE. If this section is empty, it means that there is not any issue with this etcd member.

Redemption aka Remediation

To solve this problem, firstly a compaction is needed and then a defragmentation. Compaction involves resizing the nodes to ensure that they are not over or underutilized. Defragmentation is a process that rearranges the workloads on nodes to optimize resource utilization. Both compaction and defragmentation are necessary to ensure that a Kubernetes cluster is optimized and efficiently utilizing its resources.

Compaction

To perform a compaction, you need to find the revision number by running:

etcdctl --write-out=table endpoint status --write-out="json"

The response is:

[{"Endpoint":"https://127.0.0.1:4001","Status":{"header":{"cluster_id":14676166452064326842,"member_id":2443613209544298837,"revision":262894465,"raft_term":88},"version":"3.5.3","dbSize":2147483648,"leader":5478996075897959807,"raftIndex":293758007,"raftTerm":88,"raftAppliedIndex":293758007,"dbSizeInUse":623867176}}]

The revision number needed is "revision":262894465.

With that revision number, we can run the command for compaction:

etcdctl --endpoints https://localhost:4001 compact 262894465

Defragmentation

After the compaction, a defragmentation is needed:

etcdctl --insecure-skip-tls-verify --endpoints https://localhost:4001 defrag

To check if there was any impact, by running again the command:

etcdctl --write-out=table endpoint status

There should be a reduction into the DB SIZE from 2.0 GB

Connect to rest of ETCD members

The above steps of compaction and defragmentation should be also executed for the rest etcd members i.e. etcd-manager-main-YYY and etcd-manager-main-ZZZ.

SSH into each one of them with the command used before:

kubectl exec -it etcd-manager-main-YYY -n kube-system -- /bin/sh

Disarming the alarm

As seen in the ERRORS section on the table from the endpoints status, there was the alarm:NOSPACE which needs to be disarmed after compacting and defragmenting all the etcd members.

etcdctl --endpoints https://localhost:4001 alarm disarm

In the meantime…

The fire was put out and the above solution gave us some time until figuring out what really is causing the exhaustion of storage of `etcd` and thus hitting the quota usage.

There were few things that needed to be done until finding the root cause, just to be safe (relaxed) until the public holiday is over.

Observability and alerting
Increasing the storage quota

Observability and Alerting

Two new Prometheus alerts deployed that will ping us in case something wrong will happen again in the next few hours.

(etcd_mvcc_db_total_size_in_bytes/etcd_server_quota_backend_bytes) > 0.9

Which means when the total DB size of the etcd covers more than 90% of the total `etcd` quota size and a second one which actually shows the same as above metric but for a linear prediction of the previous 6 hours from now, which presents the trend of the time series data:

predict_linear(etcd_mvcc_db_total_size_in_bytes[6h], 24 * 3600)/etcd_server_quota_backend_bytes > 0.95

Increasing the storage quota

Phew, alerts done, now increasing the etcd storage quota a bit so that not to have any issues. In the etcd documentation, increasing the storage limit above 8 GB is not recommended, thus an increase to 4 GB can be done at this time just to be on the safe side. So how to increase the limit was the next question. etcd documentation, of course, how to add environment variable for quota storage limit. To add the environment variable into the kOps cluster, an edit of the cluster is needed.

kops edit cluster my-cluster-kops.k8s.local

Here is an example on how we added the environment variable ETCD_QUOTA_BACKEND_BYTES to be 4 GB.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  name: my-cluster-kops.k8s.local
spec:
<additional kubernetes configurations>
<additional kubernetes configurations>
<additional kubernetes configurations>
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-east-1a
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-east-1b
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-east-1c
      name: c
    manager:
      env:
      - name: ETCD_QUOTA_BACKEND_BYTES
        value: "4294967296"

Job done, back to business as usual

Our Kubernetes cluster was put back to business and what needs to be done now is to find the root cause of the increasing `etcd` DB size. For the time being we are safe, and we are heading back to continue our business as usual — or rather, the business of actually relaxing as if it was a public holiday.

Want to learn more about configuring, managing, and monitoring your Kubernetes cluster? Read more about Kubernetes on the Mattermost blog.