Connecting people, systems, and data with team messaging
The Keys to DevOps Success
Organizations choose DevOps tools and practices based on two factors: the best available information and their internal culture. Long-term DevOps success is based on using tools that facilitate an ample amount of good information and nurturing a change-positive, trusting environment. With both these factors in place, better decisions get made throughout the DevOps lifecycle.
About this document
We surveyed and interviewed dozens of users of Mattermost and other messaging applications. We spoke to experts throughout the technical world to see how DevOps was changing, and the various ways ChatOps was influencing organizations to think about their end-to-end DevOps lifecycle. Finally, we researched how users were taking advantage of ChatOps at each phase, from planning to retrospective.
Been there, done that. Five times, in fact!
Information is more than just logs. It includes awareness of best practices, understanding of engineering goals and organizational strategy, and an accurate assessment of both team skills and any particular tool’s abilities. For example, it’s easy to imagine how implementing a microservice-based infrastructure would have a higher chance of success if a team has built a similar one before. DevOps is largely driven by the ability to replicate models and processes, or conversely, abandon existing processes based on steadily improving information.
Because information is consumed by DevOps teams at multiple strata: operational, tactical, or strategic, data sources must include an equally wide range, from log files, to internal team feedback, to consumer response (that’s money, honey). Consistently improving the quantity and availability of good information is vital.
DevOps success is a long game: while there may be cases of massive overnight switchovers, DevOps processes are usually a series of small and steady decisions, based on the best information at hand.
Top 4 Reasons for Developer Messaging
Automate manual work
Unify disparate tools
Culture needs air to breathe. Not hot air, just air
Culture is an equally important part of a long-term DevOps success strategy. A few aspects that are particularly relevant to those in DevOps are:
How does your organization view change?
How does it foster productive conversation?
How does it handle conflict or criticism?
How does it make decisions?
Is there trust in decisions made across the organization?
Making DevOps strategy a reality requires a culture that embraces change, while safeguarding stability. Being able to understand this balance is a shared responsibility, and failing to achieve the balance is a common cause of chaos. When stability-loving customers, and the sales teams who love them, are adversely affected by technical teams who are eager for change, but unwilling to protect performance and uptime, it becomes a tug-of-war that can destroy an organization from within. A team that can discuss these differing viewpoints and arrive at a consensus is best suited to make mature decisions around tools and process.
Trust (…) is our willingness to be vulnerable to the actions of others because we believe they have good intentions and will behave well toward us.
Sandra J. Sucher and Shalene Gupta, The Trust Crisis, July 2019, Harvard Business Review
Having the ability to make mistakes and change direction along the way is a key aspect of DevOps-friendly culture. Automation, testing, continuous delivery, and other principles lead to overall quality, user experience, and developer happiness, which creates a positive feedback loop of growing trust.
The other important aspect of culture is alignment. Organizations spend millions to ensure executive teams are explaining and sharing their vision with clarity. HR and Engineering managers spend hours building onboarding and training programs to teach staff how to use the company’s mission and fundamental values to make decisions. The goal is alignment across all teams. This means less silos, faster flow of information, and quicker resolution to challenges.
“The stronger alignment we have, the more autonomy we can afford to grant.”
One of the most widely accepted frameworks for DevOps implementation is the CAMS model, which states that success is based on attention to Culture, Automation, Measurement, and Sharing. Successful ChatOps empowers these: it facilitates collaboration, makes manual work easier, and unifies disparate tools into a central dashboard and team command center, fostering greater access to information. Compared to other tools, ChatOps alone plays a malleable and powerful role that adds value across the entire DevOps lifecycle.
Section I: Plan
Project planning and discovery is the single most important step in the product lifecycle. Survey after survey has shown lack of adequate discovery to be the primary reason for failure to launch a desired product or feature at the forecasted time.
Initially, software development planning was done before development work started, and these planning decisions were rarely changed. Agile development came along and has attempted to change that rigid process. Agile demands more incremental planning that is responsive to new information. Today, engineers make planning a continuous process, rather than working from a set of assumptions made prior to writing a line of code.
If a tree fails fast and no one hears it, is it Agile?
No matter how much or little you adhere to Agile or other software development methodologies, planning always means understanding the problem space and having a clear idea of how to create a solution. Being able to gather information accurately and quickly is crucial. It helps teams surface edge cases sooner, so there are less surprises at launch. Sharing data and arriving at a more complete, yet flexible product vision together keeps teams both aligned and grounded in reality.
Agile and DevOps are popular because they promise to deliver better results more efficiently. In fact, teams adopt Agile practices to better manage changing priorities and have better project visibility. This cannot happen in a siloed team environment. Developers and DevOps teams need a clear conduit to communicate.
Gartner recommends frequent status review meetings as a way to guarantee alignment. Conversely, many engineers report that “frequent status meetings” are one of the bigger obstacles to their productivity and happiness. In fact, Stack Overflow’s 2019 Developer Survey found meetings to be the second most common challenge to productivity, with 36.6% of respondents citing meetings as one of their top 3 challenges to productivity. Our research and interviews acknowledge that visibility is important. But even more efficiency can be gained by removing rote updates from status meetings, automating individual or team KPI retrieval, and using chat or face-time to have meaningful discussions on blockers or urgent issues. This keeps teams focused on overall progress and maximizes expensive “together time.”
Some best practices to include in your DevOps implementation are:
At Project Kickoffs, include channels with a standard naming convention for each new product or feature. Examples of those we surveyed follow a logical, easily understandable nomenclature:
New Feature #12201 Integration Product Team – Product, Engineers
New Feature #12201 User Feedback – Product, Customers, Engineers
Integrate file storage (Dropbox, Google Drive, etc.) with your messaging tool and pin the folder to the room. When docs are updated, you can send this to the channel, keeping the team informed.
Record all-hands meetings and make them accessible in the project feed (or in a Meetings channel). Hold remote teams accountable for watching. For best results, team members should watch together.
Planning sets the stage for the success of the project in multiple ways. Kick off a project with a clear pipeline across teams, time zones, and roles. Don’t worry about “doing Agile perfectly,” but instead aim for a fast, accurate, feedback loop and react to information appropriately.
Section II: Code
There is a saying: “Proper planning prevents poor performance.” When it comes to writing code a team is proud of, planning is only part of the equation. In addition, coding standards, using tests effectively, and allowing time for periodic refactoring are help reduce bugs and maintain a stable codebase that developers aren’t terrified to change.
Coding isn’t a solitary, ascetic practice. It’s creative and collaborative
Developers use messaging for collaborative coding, enabling them to easily ask questions, discuss creative angles to solve problems, and review each other’s work. Additionally, progress on a feature can be automatically communicated on commit, creating an implicit awareness of team momentum. Hooking code repositories into chat is an established best practice, allowing the team to view and offer feedback on each other’s changes.
DevOps engineers may focus on server automation and orchestration, but they often also write a considerable amount of code. Because a key feature of ChatOps is accessing incoming data, they should integrate as much of this data as possible into a real-time data stream.
And since DevOps engineers often own ChatOps, the integrations they implement are one of the most creative ways they can positively affect collaboration. By understanding where teams spend time, the number of clicks it takes to access a control panel to run a command, or the total mental CPUs to run reports to get the same answers to the same questions, they can identify opportunities to optimize.
“I believe that prescriptive instruction (like coding standards) should happen not by default, but only to address a gap or shortcoming.
Has the team consistently introduced bugs and wasted troubleshooting time because half of the team uses an underscore prefix to mean one thing and half another? Do newer members of the team consistently make a correctness or performance mistake in the code? Have folks consistently introduced dependencies between projects that create deployment problems?”
Build a System “If you must have coding standards, you need to automate them as much as possible. Automating them provides so many wins and inoculates against some of the traps.
A colleague once told me that it’s best to break bad news by building a system and blaming things on the system.”
Brainstorming should be friction-free, and it should happen often. Talking through a problem or getting a quick code review are essential ingredients in a successful team, and it is everyone’s responsibility to create an atmosphere of collaboration. If you are doing it right, conversations via chat are ongoing, and a video chat is just a click away. And DevOps and ChatOps integrations allow tasks to be marked completed via git commit or through simple chat commands.
of teams surveyed use some kind of pair programming.
of teams are writing unit tests!
Almost any project management tool can integrate with messaging. Atlassian and Asana are common tools used by development teams
Section III: Builds
The software build phase compiles code from the source code repository and creates an executable artifact or bundle. There are dozens of popular tools for various languages and coding environments. Automating builds is essential to most DevOps strategies, as an application isn’t considered stable unless it can be rebuilt over and over on demand.
Always be building
Typically used in conjunction with Jenkins or other continuous integration tools, most teams kick off a build whenever new code is added to the codebase. All required code is checked out from the source repository and if the build passes, notifications are piped into the appropriate channel, as the job runs in the pipeline. Be sure to adjust the flow of notifications to reduce noise as teams grow and builds increase.
Build scripts themselves are checked into version control to ensure that scripts are versionable, recoverable, and testable. A single compile/packaging problem can have real performance and security implications, so they need to be reproducible and have review for all changes.
Teams also need to be able to manually start or stop a build. Developers trigger a build from within a Mattermost channel, and the entire team gets notifications of its success or failure.
Integrations with the two most common build tools, Gitlab and Jenkins can get you up and running in minutes.
Send notifications from Jenkins to Mattermost channels
More recently, containerization has made it easier to release software bundled with the specific software resources needed. Users can include the container recipe to better ensure the full infrastructure is manageable as code. During a build, container events such as build completion or system start can be sent to a channel.
Build dependencies are not deceptively named. Should there be upstream issues, pipelines can break. As a result, some companies ensure their builds are fully reproducible even if their primary repositories are down, and have alternate copies of dependencies to pull during outages. Identifying and minimizing dependencies is a best practice.
As applications get larger, they become exponentially more difficult to manage. As a result, engineers doing planning should consider using smaller, independent microservices. Microservices are not the answer to every infrastructure question, but they allow engineers to work on part of the machine, rather than the entire beast. This means builds are faster, and code gets shipped quicker. Ideally, each microservice has a separate deployment pipe so that it can be separately built and shipped.
“We have to be able to automate not only the build, but the entire verification process for the behavior we build in the system. This is what it means to be agile in software development and to be able to respond to last-minute changes.”
– David Bernstein, Developer/Coach/Trainer, To Be Agile
Section IV: Test
Testing code is a core part of continuous integration (CI). Before abjectly writing tests, teams must make a commitment to understanding what building testable software means. This usually involves a simple four-step life cycle: Understanding the value and trade-offs, researching and choosing tools and test approach, implementation, and finally, measuring and adjusting.
How much do your bugs cost; your best cash price
First, the overall return on investment (ROI) of testing should be carefully considered, researched, and quantified. Okay, you aren’t going to do that; no one does. However, it is important to understand the costs and benefits, as well as provide more than anecdotal proof of the effectiveness of testing. Conveying a nuanced, but accurate overview of testing is hard. Dedicating an extra 15% to 20% of engineering to testing sets an expectation that there will be a commensurate reduction of bugs, and puts engineers in a defensive stance to prove this.
Churn, morale, and company reputation are hard to measure, but those we spoke to all agree there is a correlation between rigorous testing and these less quantifiable measures. A customer may give up on your service after untested software gets pushed to production, but they may not actually leave until the end of a contract or until an alternative service is found. A good rule is to agree on KPIs that show increased testing results in bugs and other issues that affect performance are being caught before they are pushed to production. Also, there should also be an overall decrease in issues in general.
It is generally accepted that writing tests results in cleaner more supportable code, as edge cases have been considered and logic has been devised using a more measured approach. Users report that tests written months or years ago can shed light on the expectations of the code at the time of creation. As a result, implementing new changes requires less reverse engineering.
To a man with a hammer, every bug is easily squashed
Once the strategic aspects are understood, tools and testing methodologies should be selected. There are hundreds of ways to test. Complete coverage may include:
Integration testing: external API, etc.
Performance testing: load testing or PageSpeed Insights
Security tests, such as penetration testing
Visual regression, cross-browser, and mobile device testing.
Beyond these automated tests, there are still valid use cases where manual quality assurance (QA) is best. This may require working with remote team members. The most important goal is to maximize the efficiency of the testing you do. If regression testing alone exposes most issues, don’t overdo it with an overly complex workflow. There are diminishing returns on running and maintaining large suites of tests.
After these tests are defined and built, discipline is required to keep testing a high priority. Working smarter is key to long-term success. This is where ChatOps helps.
How ChatOps makes testing better
It is likely that your application will have multiple layers of testing, each executed on different environments at different stages: locally executed unit tests, smoke tests, integration tests, penetration tests, compliance tests, regression tests, and more.
Share your toys
Reusing test patterns is an important best practice: rather than rewriting similar tests, extend or call existing test functionality for similar features. This increases overall coverage in an efficient manner.
Measuring and making visible the accumulated number of defects shines a light on the most useful tests, as well as the tests which may need review. Show these totals every time they are called.
Even the broken toys
Testing code as it is merged into production is important as a final check before deployment. When these tests fail, it may not be as simple as rolling back. Therefore failed tests should hold a greater sense of urgency when they happen in production. To sound off a code blue situation, tie testing alerts to your messaging system, which alerts engineers, support, and other key stakeholders.
Testing requires reporting and feedback loops to disseminate results and facilitate communication. When tests fail, developers or operators should know quickly, then understand what, when, and why it failed. There are several ways this is accomplished.
Example: Posting load test results using a slash command
A team member issued a slash command from within a channel. That command triggered the CI/CD pipeline to spin up a load test, and posted the results back into the channel for the team to review. The test data is richly formatted in a table for easy consumption.
Example: Rich messages and log files
Corey uploaded some Nginx logs to a channel. It’s a text file, but when users click it, they can see the logs in place and start scrolling through them. This gives the team immediate access to deeper information about an incident or bug from within the channel, so they can better collaborate and take faster action.
Then you put your toys away
A testing suite evolves. Tests can become redundant or add diminishing value over time. Current tests may eventually be repurposed for regression testing. Additionally, as new testing technology emerges, consider refactoring the test suite periodically.
Your team can bring clarity to testing by tagging and aggregating tests by area. Make the resulting defect category (front end, accessibility, etc) accessible via messaging. This makes the process of determining how and where to improve testing immediately apparent.
Users indicate that showing test coverage and sharing incremental, constant metrics demonstrating the value of testing is a way to stress the value of the work being done. Several report that percentage of critical defects is a team-wide KPI
of teams report doing unit testing
use test-driven development
Test code is code and requires constant design and refactoring. The need for rapid test results helps this happen. Michael Nygard once told me that he created a test that would fail the test suite if it ran for longer than 15 minutes to force refactoring of the tests.
Staff Software Engineer at Walmart Labs
Section V: Releasing and Deploying
There are a few simple guiding principles at the heart of DevOps strategy. Code needs to exist in a state where changes can be made without impacting performance. Also, team members cannot work in isolation, unaware of each other’s changes. Version control takes care of executing the changes, but team members need to work together to reduce the risk of impact.
Rather than making massive changes to their codebase, developers make small additions throughout the day, while pulling in other’s changes. By regularly integrating with the team’s changes, fixing any conflicts or issues becomes more economical and less time-consuming.
Each time code changes are added to the codebase, automated builds and tests are kicked off. If there are no issues, these changes may be automatically deployed to production or held back for final release. Theoretically, however, since the entire codebase with new changes has been fully tested, the software should run as expected.
“Attaining agility in the delivery of applications gives you the ability to be flexible and to pivot directly on feedback from end users. The overhead of making changes should have an insignificant impact on the overall effort.”
Following the code to production, the DevOps team receives further notifications and data via Mattermost. They can easily track which servers received the code and view any relevant stats. By using simple commands to trigger a deployment, ChatOps provides relevant next step commands: View Status, SSH into the server, Rollback, Deploy to next environment in the pipeline, etc.
Managing feedback is essential to successful continuous delivery. As newly merged code get closer to production, problems that arise become more expensive and have greater potential impact. Therefore, knowing when issues occur is vital. Equally necessary is a relentless commitment to reducing risk (and maintaining trust) while maintaining agility. Teams must be retrospective: Was there a root cause that should be addressed to prevent a bug from being released? Could a test be re-written? Could additional monitoring be put in place?
This means having the ability to review the effect of deployments. Here, ChatOps is a useful tool in creating a central timeline. When deployments are followed by error messages, dozens of eyes are immediately aware, and rollbacks can happen before customers are impacted. ChatOps can prevent worst case scenarios. As any Sales Engineer will attest, they would appreciate this level of team awareness, as opposed to being the one giving the client demo when a deployment breaks production.
With a commitment to ChatOps, making metrics visible periodically and on demand is easy. Aggregated values, such as average time to complete tests and average build times, are harbingers of problems to come. Along with test coverage and other KPIs important to the organization, having a finger on the pulse of the infrastructure doesn’t require logging into Grafana or another dashboard.
Smart companies use ChatOps as the captain’s chair for continuous delivery. This ensures they have the relevant information and quick access to the next steps and enables them to roll out changes faster and more safely.
use continuous integration
use continuous delivery
The companies we surveyed deploy between 5 and 50 times a day
Gitlab helps 100,000+ organizations embrace DevOps by being a single source of truth for the organization. In our journey, we’ve learned that surfacing the best available information and building internal culture is key to DevOps success. To achieve this, the best teams unify their tools and workflows to be more transparent, efficient, and well-governed.
Director of Technical Evangelism at GitLab
Section VI: Provisioning and Management
Infrastructure provisioning, container orchestration, and configuration management tools such as Puppet, Chef, and Ansible use scripts to provide consistent configurations across your environments. These can operate on public or private clouds, internal servers, or a combination of all three. It is common to manage many types of resources across multiple cloud platforms. Often, merely keeping operating systems updated can be a full time job. For people who manage these environments, DevOps isn’t an option. It’s a given.
More tools are available than most DevOps teams can handle, with more being invented on a daily basis. Containers seem almost ancient compared to function-as-a-service, serverless, or edge computing technology. One of the keys to reigning in the change is to use messaging as the central input/output for managing as many tools as often as you can. If you can perform the top 80% of commands from a single interface, that reduces a significant amount of the team’s time spent switching contexts. How companies apply this varies, but a common pattern is to group components by departmental functional area: dashboard, application servers, etc.
As systems expand through organic growth or architectural changes, as is often the case with microservices, adopting tools which can predictively scale and provide actionable alerting and information are vital. Without improving the flow of information to the team, gains acquired by optimization can quickly be lost in administration.
“As the number of microservices grow and the complexity of the processes increases, getting visibility into these distributed workflows becomes difficult without a central orchestrator.”
When everything, from applications to infrastructure to networking is code, management is easier and results become more predictable. Provisioning is no exception. When it is repeatable, easily editable, and revertable, it becomes easier to manage.
ChatOps makes cloud provisioning less abstract
Connect Ansible, Puppet, or your preferred provisioning tool into your messaging system. This creates low-effort visibility into spinup, configuration and deployment progress. You needn’t be glued to status windows, but knowing when an issue arises gives you the ability to respond quickly, before users are impacted.
Example: Securely deploy, monitor, and control Kubernetes in Mattermost with BotKube
BotKube can execute kubectl commands on a Kubernetes cluster without giving access to Kubeconfig or the underlying infrastructure. Additionally, you can debug your deployment, services, or anything about your cluster right from your chat window.
Section VII: Security and Monitoring
Launching code is wonderful. Then it tanks in production. Then your manager shows up at your anniversary dinner with your laptop and a wi-fi hotspot (yes, they broke into your house earlier to get your computer). Keeping track of application health and performance on production isn’t just about performance either. Bad actors may not have any reason to attack your applications or services specifically, but they may go after your cloud provider, upstream services, or APIs you use. Once the code is in production, protecting it is the next full-time job.
Monitoring without sleep deprivation
DevOps and System Reliability teams all have access to monitoring tools to stay on top of security and performance in a fast-moving Agile environment. Nagios, Logstash, Sensu, and others can be very precise in filtering, allowing engineers to create exception-based reporting based on error codes, IP ranges, or a variety of other factors. New Relic and other performance monitoring tools can provide alerts on traffic spikes or unexpected changes in backend performance after a deploy.
A well-integrated messaging solution can provide a canonical timeline across many applications and infrastructure components. During performance debugging or post-event forensic analysis, having a channel with notifications and alerts, even if silenced, can provide a concise timeline of how events transpire.
“When I talk about production monitoring, I mean the ability to get answers that matter to the programming team. That means the time a request lives on the server, the number of requests, the number of 400, 500, and 504 errors. This, along with debugging information on which microservices are called how often and how long they take to run. In complex environments I want to see a dependency map of which services call which other services.”
The mean API request time is the average amount of time a REST API request to the Mattermost app server takes to complete. If an app server starts to perform poorly, you’ll likely see a rise in the mean request time as it takes longer to complete requests. This could also happen if your database can’t sustain the load from the app servers. It may also be indicative of an issue between the app servers and your proxy.
You’ll want to set the alert threshold a little above what the mean request time is during your peak load times.
Here’s how it’s set on our community server:
Security without cold sweats
Automation is true neutral, neither good nor evil. As you read this, millions of automated attacks are occurring, applications are being scanned by bots, and networks are being tapped for vulnerabilities. For anyone responsible for security, sleeping at night depends on having a solid strategy to prevent data breaches and other hacks.
The best approach is implementing security across the lifecycle, from planning and commitment to organizational training and regular reviews. Coders should be trained on secure guidelines and reviews should include a security component. Automation such as static analysis, along with manual testing by internal staff or external black or grey-box testing teams, helps keep security consistently top of mind.
Having security built into the DevOps workflow is critical in spotting security vulnerabilities early in the build process. In GitLab’s recent Global Developer Survey, 45% of security respondents agreed that security vulnerabilities are mostly discovered by the security team after code is merged and in a test environment. Being able to spot a vulnerability early in the build cycle will help reduce costs associated with remediation later in the lifecycle.
Sr. Director of Security at GitLab
Running applications securely is also part of this multi-tier strategy, which means maintaining tight control of user access and permissions. All tools, especially those that use secrets or passwords need to be locked down.
Luckily, even security testing can be a part of the DevOps CI/CD process. Load testing, penetration testing, and security testing helps users identify weaknesses, allowing time to address issues before bad actors can exploit them.
Larger tech companies are moving from simple public keys to using SSH certificates and certificate authorities (CA) as a more controllable “Zero Trust” approach. We also recommend using federated authentication, such as LDAP.
For security-focused organizations, ChatOps is a part of building a strong chain of trust. Integration with a federated identity solution, such as LDAP/ Active Directory, ensures that access by both humans and machines are controlled. Furthermore, many organizations reduce the risk of possible dataspill by keeping certain levels of communication and notifications off email servers and within on-premise messaging platforms.
TL;DR: Real-Time Collaborative Success Punch List
People & Process
DevOps engineers are unique in that they are often individually responsible for turning hand-waving philosophy around building software into a conduit for software to which they might not contribute a line of code. They build, test, deploy, and monitor performance, using feedback to continuously improve the delivery process and streamline the product life cycle. Often the only insight the customer has into this crucial role is that the software they purchased “just works.”
Successful DevOps engineers we work with have the trust of their managers to use and implement the best available tools and information. They have the freedom to adjust, adapt, make mistakes, and learn. The best DevOps environments nurture these liberties. Organizations with healthy DevOps stress shared accountability, communication, and commitment to innovation. ChatOps helps foster the communication and information needed to expand this type of collaboration.
Mattermost is the open source collaboration platform built for development teams to drive innovation. Its on-premise and private cloud deployments provide the benefits of modern communication without sacrificing privacy. Mattermost gives enterprises the autonomy and extensibility they need to be more productive while meeting the requirements of IT and security teams. Download Mattermost and start collaborating across the DevOps lifecycle today.