Tech Ops is a mess. Here’s why we’re committed to fixing it.

June 15, 2022

Platform

Building software is hard. Building cloud software is even harder because things move much faster — and require mission-critical reliability and availability. To effectively build software in the cloud, engineering teams need observability, CI/CD, reporting, and lots of tooling.

At every organization I’ve worked at, we’ve needed a system of tools that lets us:

Know what’s happening in real-time without having to do a lot of digging through dashboards;
Create and maintain an automatic timeline during events;
Provide meaningful integrations that consolidate relevant data, making it visible and actionable.

But all the tools available to engineering teams never quite fit together with our specific processes. So, when things went wrong with our systems or services, there was a mad scramble to monitor and execute with disparate systems and data.

In trying to solve these issues, we always ended up with custom integrations that tried to pull together all of the necessary tooling to make our lives functional. This sort of worked. But it still included a fair amount of manual work to get the data out of each tool, and it also came with a significant maintenance burden.

Why building and maintaining manual integrations is so painful

Getting the complete picture of what was happening during an incident or outage was VERY painful. Bringing the monitoring and observability data together is a step in the right direction; it’s not even close to a proper solution.

That data needs to be paired with operational tools — like messaging, playbooks (automated runbooks), and agile boards — so we can see what’s happening, make good decisions quickly, take decisive actions with confidence, and track and audit what we’ve done.

To solve this problem, we cobbled together existing OSS and commercial software with internal tools. Our goal was simple: to create a cohesive toolset that allowed us to execute fast, see what was happening in real-time, and create a detailed timeline of systems data and actions we took.

Perhaps the most frustrating part was that these integrations were brittle. They often fell over when we could least afford for them to be down.

We had customer obligations for things like blameless post-mortems and public root cause analyses (RCAs). But there wasn’t a system of record that provided a full picture of what happened or how we responded.

We literally had people manually typing or pasting data into spreadsheets to create a consolidated timeline for service events. Of course, this data then had to be manually edited, validated, and enriched to make it useful for a post-mortem. The same information also had to be massaged in a completely different way for the public RCA. The result was a system that largely worked, but was painful to use and expensive to maintain.

Systematically eliminating friction in TechOps

The story above has been repeated over and over throughout my career. Each time I changed companies, I left behind the work we did to make this happen, and I had to start trying to integrate and streamline the tooling for this same problem all over again. We built the same integrations (sometimes with a lot of the same tooling) over and over at every new company I joined.

And that’s why I came to Mattermost in 2020.

Mattermost’s mission appealed to me because the things we’re doing here are things I had done repeatedly to meet the needs of every other place I’ve worked. I had seen this movie before.

For engineering teams to truly master digital operations, they need a system designed to work cohesively, which is flexible enough to adapt to whatever technical needs they might have. Solving this problem means removing the integration and maintenance burden from technical teams, while giving them an out-of-the-box process that enables them to get going quickly, with a proven system of integrated collaboration tooling.

This is why I’m so excited about Mattermost v7.0, the next evolution of our platform which gives engineering teams the tools they need to work with purpose, including:

Pre-built workflow templates for Playbooks that offer structure and collaboration and can be easily customized for specific team operations;
A new serverless App Framework that enables devs to quickly develop integrations and apps in any language that supports HTTP;
Calls, a native, secure voice communication solution for 1:1 calls, group conversations, and screen-sharing that can be launched in one click;
Collapsed reply threads, enabling teams to organize and focus threaded conversations and reduce channel clutter.

By giving developers back the time they spend building and maintaining their TechOps stack, Mattermost can change the game for your teams.

What comes next for digital operations

Our vision is a central collaboration suite that pushes the capabilities of digital operations teams forward with tooling, runbooks, and integrations that bring all of your operational data together in a single view — without context switching. We will continue building new features that give developer teams better capabilities to create and manage cross-organizational digital workflows at scale.

To learn more about what the future of developer collaboration looks like, try Mattermost today. While you’re at it, check out our roadmap to see what we’re working on next.

Why building and maintaining manual integrations is so painful

Systematically eliminating friction in TechOps

What comes next for digital operations

Read More Platform Articles

Open source news, right in your inbox

Thanks for subscribing!