User story: How a global media company reduced costly outages by implementing a secure DevSecOps collaboration platform
Catastrophic failures — such as a security breach or a complete outage leading to an unavailable product or service — are classified as Sev0 incidents. On a severity scale of 1–3, Sev0 is dire. It brings business to a complete standstill and may lead to loss of revenue and a damaged reputation. A Sev0 incident usually has no quick workaround; it requires a coordinated effort beyond the engineering team to diagnose, correct, and manage.
Contending with a Sev0 outage at enterprise scale
For one global media giant lacking a Sev0 business continuity plan, the day finally came when a catastrophic outage knocked out an engineering division’s infrastructure and communications. Over $100,000 of revenue was lost per minute of downtime, and the outage was drawn out because the team couldn’t communicate without their regular systems. Meanwhile, their end users, advertisers, and other internal teams experienced hours and hours of downtime while the recovery floundered.
The challenges extended beyond a single catastrophic incident to generally chaotic and slow response processes:
- The team’s backup system, IRC, lacked robust features and couldn’t scale to perform satisfactorily over a multi-hour international outage
- Data leakage was significant as teams struggled to communicate without their standard systems, resorting to texts and even Google Docs to communicate about sensitive data
- Actions and responses were incorrect or untracked, and retrieving information for a retrospective was onerous
- Automations and bots core to engineering workflows needed to be migrated from the old IRC environment to the response center, so the team needed a flexible, customizable solution to support that migration
Leveraging Mattermost for business continuity
The media giant’s engineering team evaluated and ultimately adopted Mattermost for incident response. After product availability, the most important factor in their decision-making process was business continuity. After all, there’s no way to fix an outage if you’re isolated from your teams and systems. This Mattermost customer removed single points of failure from the incident response process to dramatically reduce the likelihood and extent of a Sev0 incident.
With Mattermost, they gained:
- Built-in incident collaboration playbooks with clear steps, robust integrations, and automations for key engineering processes
- Customizations and integrations with internal, built-in-house tools
- High availability and redundancy to ensure business continuity and stability
- On-prem deployment for maximum security of IP and sensitive data as well as FTC compliance requirements
- Modern tools purpose-built to improve technical and operational team collaboration and workflow management, streamlining the flow of communication
An incident response command center that DevOps teams love
Today, over 15,000 engineers at this global media organization use Mattermost for DevOps, digital operations, and business continuity on a platform completely separate from their primary infrastructure. They’ve moved away from IRC and Slack while maintaining Mattermost as their incident response home base with custom integrations to their own messaging systems. Developer satisfaction has increased (being able to name their Mattermost server with a custom URL was an added bonus) while downtime, response, and resolution time decreased — all because they added safety and reliability to their incident response systems. Of course, saving $100,000 per minute doesn’t hurt either.
Want to learn more about how teams use Mattermost to improve incident response workflows, increase productivity, and more? Read more user stories here.