How to Make Your Incident Response Plan with Mattermost
For teams who deploy software to users around the world, every second counts when responding to outages and other incidents. It’s important that you have tools in your arsenal that are up to the challenge. Service monitoring, alerting, collaboration, and visibility are all essential components of a well-implemented incident response plan.
In this article, we will examine a turnkey incident response plan in Mattermost that leverages the power of real-time communication and automation to help your team respond more quickly and efficiently to service outages. Once you’re up and running, you’ll unlock developer-centric features to put your incident response workflow in the fast lane. The best part? Everything in this guide is open source! However, this demo has been designed in a way that you can easily integrate your preferred toolchain into the workflow.
- Demo Overview
- How to Launch the Demo
- Set Up Your Incident Response Plan in Mattermost
- Configure Healthchecks to Report Outages
- Configure the API Service for Health Reporting
- Your Incident Response Plan in Action: Simulate an Incident
- Next Steps
Demo Overview
To get started, you will need to install Docker for your operating system. Ensure you have Docker Compose included or installed separately. You’ll use the mattermost-incident-resolution branch of the osc-workshop-2022 repository.
The demo consists of four Docker containers:
- mattermost
- postgres_mattermost
- healthchecks
- deckofcards
You can learn more about how they are configured in the repository’s docker-compose.yml
. Let’s take a look at each container and learn what role they will play in the demo
mattermost
The developer collaboration platform that will serve as the hub in the demo. It’s accessible via https://localhost:8065/
postgres_mattermost
The database for the above Mattermost container. You can ignore this container because it doesn’t play a role in the demo.
health checks
The service monitoring application that will track deckofcard
’s uptime and report any outages to Mattermost. It’s accessible via https://localhost:8000/
deckofcards
The example API forked from crobertsbmw/deckofcards on GitHub. This API simulates decks of cards and could be used to build games or other features. It also has an OpenAPI specification. In a production environment, other applications might rely on this kind of API so you want to strive for 100% uptime. It’s accessible via https://localhost:8005/
The deckofcards and healthchecks containers are the most interchangeable part of the demo – feel free to replace them with something else or complement them with additional services once you understand the concepts.
How to Launch the Demo
First, enter the following commands in your local terminal:
git clone -b mattermost-incident-resolution https://github.com/azigler/osc-workshop-2022
cd osc-workshop-2022
sh init.sh
sh start.sh
You will need to wait a minute or two for everything to fully come online and you may be asked for the admin password for your device to run Docker with superuser privileges. If Mattermost fails to come online, you can troubleshoot by viewing the logs with the docker logs -f mattermost
command.
Next, follow the directions for each individual Docker container:
Set Up Your Incident Response Plan in Mattermost
You can access your Mattermost instance by visiting https://localhost:8065 in your web browser.
Follow the instructions to create your admin user, name your organization, and confirm that your server’s URL is https://localhost:8065. You can configure the remaining steps however you see fit.
This demo comes with a pre-configured Mattermost instance, where “enable integrations to override usernames” and “enable integrations to override profile picture icons” are both set to true. This allows integrations like Healthchecks to post in Mattermost using its own name. It also allows unsecure (http) connections from the hostname host.docker.internal
for the purposes of this demo. If you want to use Mattermost in a secure environment, you should disable this setting. This demo is also pre-configured so that “enable personal access tokens” is true.
Create a Playbook
Navigate to Playbooks in the top left of Mattermost and click the Create playbook button on the Incident Resolution template. On the next page, you can customize the name and description of the playbook along with its checklists and actions (which we’ll look at later). For the sake of demonstration, you can simply use the default template as-is, but the screenshots below will feature a customized playbook.
Take note of the playbook’s ID in the URL (localhost:8065/playbooks/playbooks/<ID>/preview
). You will need this in a moment to set up Healthchecks. For example, my playbook’s ID is ay6moato6b8wzfu9dhftusuabw
.
Make Investigations Easier with Slash Commands
Finally, let’s enable our team to quickly query the API service without leaving Mattermost, and you’ll do this by creating a slash command that will let us ping the deckofcards
API on demand. Click the button in the top left corner again, click Channels, and then re-open the top left corner menu. This time, select Integrations.
Click Slash Commands then click the Add Slash Command button. Give this command a title and description, set the Command Trigger Word to ping
and set the Request URL to https://host.docker.internal:8005/ping
(notice we’re not using localhost
) before saving and creating your command (the remaining fields are optional).
You might be wondering why we changed localhost
to host.docker.internal
. This is a special hostname provided by Docker that allows a container to access ports on its host machine. For example, localhost
inside Mattermost has no awareness of deckofcards
. You could set all of the containers up on a Docker network to resolve this, but using host.docker.internal
is a simpler method for the sake of this demo.
Now go back to a channel and try out your new slash command by typing /ping
and submitting the command.
You’re now done configuring Mattermost and can move on to Healthchecks!
Configure Healthchecks to Report Outages
You can access this container at: https://localhost:8000
Log in with the credentials included in the docker-compose.yml
:
- Username: [email protected]
- Password: healthchecks
Set Up an Incoming Webhook in Mattermost
Go to the Integrations tab. You can click the trash button to delete the pre-existing email notification, or you can leave it. It’s non-functional for the purpose of this demo. Under Add More, click on the Mattermost integration.
Follow the instructions presented to create the webhook over on your Mattermost instance. You can name it Healthchecks and assign it to the Town Square channel. Copy the provided webhook URL once you create the webhook.
Now, back in the Healthchecks window, paste the URL into the field at the bottom of the page, and change localhost
to host.docker.internal
(e.g., https://localhost:8065/hooks/zr8ubryxnbbk3q59ktrnombkhw becomes https://host.docker.internal:8065/hooks/zr8ubryxnbbk3q59ktrnombkhw).
Next, you will create a custom webhook integration that will launch the playbook you created in Mattermost. On the Integrations tab, click the appropriate button to add a Webhook. You will be presented with a page with two forms: one webhook for when it goes down, and one webhook for when it comes back up.
Automate Playbooks With Custom Webhooks
You can use Healthchecks to send any kind of webhook to the Mattermost server, in addition to the standard notification message that you configured just prior. Let’s use this opportunity to launch the Playbook in response to an API outage. To do this, compose just the one POST request for when the check goes down (the form on the left). Change the dropdown to POST and set the URL to be the API endpoint for launching a playbook, again changing localhost
to host.docker.internal
. This means the URL is: https://host.docker.internal:8065/plugins/playbooks/api/v0/runs
In the Request Body, provide the required properties to launch the playbook. According to the API, you need to provide a name
for the playbook run, an owner_user_id
, a team_id
, and a playbook_id
. You already have the playbook_id
, that’s the ID you just copied from the playbook’s URL in Mattermost (mine is ay6moato6b8wzfu9dhftusuabw
).
Create a Personal Access Token
You’ll need to find your own user ID and set it as the owner_user_id
, and then you’ll do the same thing for your own team_id
. Before you can do either, let’s create a user token so Healthchecks has permission to access the local Mattermost API. In the top right corner, click your avatar and select Profile.
In the modal window, select Security on the left-hand side and click the Create Token button.
Give the token a name (e.g., healthchecks) and click Save. A pop-up will notify you that you’re creating a token with advanced permissions, since we’re using the superuser account. This is fine for the sake of our demo, but keep in mind that any token has the same permissions as the user who created it. Go ahead and select Yes, Create to complete the process.
Now copy the provided Access Token before closing the modal because this will be the only time you will see the newly-created token. You’ll need this token to authenticate Healthchecks in the Mattermost API so it can launch the playbook. My token is qjnraai84b8ozcmrsjj3e3b55c
.
Fetch User and Team IDs
Now you need to retrieve our user ID for the playbooks API call. To do this, select the top-left menu button and click System Console. Search for “user” and select the Users page under User Management. You’ll see yourself listed here as a user, along with your User ID. For example, mine is aw7fubh46fyk8k6dkoju47s8jy
.
Next search for “team” and select the Teams page under User Management. Select your team (mine is named OSC) and you’ll find your Team ID in the URL like it was for the playbook (localhost:8065/admin_console/user_management/teams/<ID>
). Mine is 7ndwqbdzmtb9pf9qnps98rstcy
.
You now have all of the information to finalize the playbook-launching integration in Healthchecks. Let’s finish that now!
Enter the following payload in the Request Body field on Healthchecks for the integration, using the information you just gathered. For example, here’s mine:
{"name": "deckofcards is down", "owner_user_id": "aw7fubh46fyk8k6dkoju47s8jy", "team_id": "7ndwqbdzmtb9pf9qnps98rstcy", "playbook_id": "ay6moato6b8wzfu9dhftusuabw"}
In the Request Headers field, provide the access token you created. Here’s mine:
Authorization: Bearer qjnraai84b8ozcmrsjj3e3b55c
You can now click Save Integration, since we’re not going to add a webhook for when the API comes back up (the form on the right). Healthchecks will already notify us of that event via the other Mattermost integration.
With the new integrations created, navigate to the Checks tab and click the ellipsis button at the end of the My First Check row. Under Schedule, set both the Period and Grace Time to 1 minute.
Finally, under the How To Ping section at the top, copy the provided URL, it will look something like this: https://localhost:8000/ping/b0c822c9-b757-4134-8ff1-0f4683543064.
Configure the API Service for Health Reporting
You can access this container at https://localhost:8005
Now you want to configure deckofcards
to ping Healthchecks every minute so you can track uptime. To do this, open the ./deckofcards/spades/settings.py
file in an editor and set HEALTHCHECKS_PING_URL
on line 18 to the URL you copied above from the How To Ping section. Replace localhost
with host.docker.internal
again (e.g., https://host.docker.internal:8000/ping/b0c822c9-b757-4134-8ff1-0f4683543064) for the same reason as before.
Finally, rebuild deckofcards
to use the new URL:
docker-compose up -d --build
After a moment, Healthchecks will start receiving pings every minute from deckofcards
. This will allow Healthchecks to monitor its uptime!
You’ve now successfully set up everything to demonstrate incident resolution in Mattermost! Let’s take advantage of the demo to simulate an outage and learn how these tools work.
Your Incident Response Plan in Action: Simulate an Incident
You can easily simulate an outage of your deckofcards
API with the following command:
docker stop deckofcards
Make sure that in Healthchecks you have set both the Period and Grace Time for the ping to be 1 minute. After the initial minute, Healthchecks will enter its grace period.
After the grace period has passed, Healthchecks will report the outage to Mattermost and launch the playbook.
You can click the deckofcards is down channel on the left to see the new playbook in action.
On the right side, you can see the preset actions included with this playbook to help you resolve the outage efficiently. In a real-world scenario, you would want to identify the severity of the outage and investigate suspected causes. This checklist helps you stay on track, collaborate with other engineers, provide effective status updates, and reflect on the event afterward.
You could try the /ping
slash command again to verify the outage. You’ll see that the slash command doesn’t work. By using the slash command, you can verify the deckofcards
API is offline without even leaving the channel in Mattermost! You can include that tidbit of information with the first update, and a minute later you will be asked for another follow-up. During a real incident, every second counts!
To simulate resolving the outage, use the following command:
docker start deckofcards
After another minute, Healthchecks will report the restored service to Mattermost and the /ping
slash command will work again, so you can finish the remaining steps and mark the playbook as finished!
Next Steps
You’ve successfully set up your first incident response plan in Mattermost! Let’s recap everything you accomplished in these few simple steps:
- Set up Mattermost to centralize developer collaboration and communication
- Learned how to authenticate and fetch data from the Mattermost API
- Initialized an API (
deckofcards
) to serve as a starting point for new ideas - Created a Mattermost slash command to communicate with
deckofcards
- Configured Healthchecks to monitor
deckofcards
for any outages and report them to Mattermost, along with launching a playbook to resolve the outage!
You can use this setup as a starting point for your own projects that require uptime monitoring and incident resolution. Mattermost can serve as the platform for your entire developer workflow, and this demo is barely scratching the surface. For example, you could use this deckofcards
API to create slash commands or even an entire Mattermost app! What will you create?