
May
How Digital Operations Can Use Slack to Transform Their Incident Response – even more so in this remote world
Site Reliability, Engineers, IT Operations and DevOps already use Slack to communicate and collaborate as they resolve issues. This is the story of evolving new capabilities that are essential in the quest to speed up the response and resolution to technical incidents and business-critical services, right through Slack.
For
many companies with technical teams staying on top of their digital operations,
whether it be ecommerce, marketing, finance or travel platforms, a simple
technical issue can result in on-line business coming to halt. The
signature of that failure or slowdown is often hard to find amid all the time
series monitoring and log monitoring that is common nowadays. In many
ways each problem is a snowflake, being slightly different from the last one,
with all the changes to the deployed application, network and configuration
changes. Our world of on-line services is super linked through a set of
macro and micro services providing capabilities that are “assembled” into a
working web application. Add on top of that specific customer or user
context and the great variation of load that come with world changes and you
have a perfect storm hitting the technical support teams.
Slack
changed the game for Incident Response
The
introduction of Slack into these technical environments has already transformed
incident operations and response. Previously the crux of incident
response was a “war room” and/or conference call (either call or Webex/Zoom)
with dozens of technical people present (and many others who just needed to
know what was going on). Add to that the
inevitable confusion and haze while just a few key people were trying to
troubleshoot, triage and restore the downed service. In companies with
sizeable operations, you often had multiple incidents going on, some of which
had critical status and were considered major incidents with incident
commanders.
Where Slack really turned established incident response methods on their head was the use of dedicated channels (and threads) for specific incidents. For the first time, teams had a collaborative environment for sharing ideas, as well as the data for the troubleshooting and resolution of the incident. With the extensive adoption of Slack across many groups in the enterprise, not only did the technical team have a central hub to share, collaborate, troubleshoot and drive the restoration of services but there came with that a comprehensive record and human perspective of what happened during the incident.
Saving incident response time and speeding up resolution while working remotely
Customers tell the
same story all over the world, when it comes to Slack and incident response:

“I can’t mention enough the need for speed in starting up an incident response. Sometimes automated monitoring can take a few minutes to send in an alert. Both our internal users and customers can see an issue in just a few seconds. Being able to use Slack to start up the incident response and then assemble the team to escalate as quickly as possible to the SRE who can identify, troubleshoot and fix the problem on the spot is so important to our customers.”
“Any SRE team lead will be the first to tell you that their way of doing things, their process, the tools they use to do their job is different than the outfit down the street. Being able to use Slack to customize how we respond and the flexibility there for our team is so important in our DevOps culture.”
It is
more than just the tools
Besides culture, SRE and DevOps teams work with on average two dozen other tools to help them monitor, triage, track incidents, perform automation and other tasks as well. The structure of an incident response system needs to consider the people, process and tools. Slack is at the nexus of that system allowing people to work remotely.
People
interact through Slack, processes are executed within Slack, either through
manual interaction or through a workflow and typical SRE/DevOps tools,
including systems of record that can provide two-way interactions through Slack
Apps. Having the right Slack App to
help structure the team and integrate to relevant tools can be key.
The natural evolution
of incident response handling led to the development of RigD. RigD’s Slack App
utilizes Slack’s position at the center of operations, to provide capabilities
that users can consume directly through Slack to automate many of the activities
of incident response. Without leaving the context of Slack they can speed up
response no matter where they are and no matter who is involved. No special
machine learning or AI skills are needed. It’s proven.
“With RigD’s Slack App, we save critical minutes with every incident, getting the incident response started, engaging the right resources to fix the problem. Integration to our other tools allows us to save time by eliminating context switching. Slack and RigD has become our digital command center for our technical operations at Tripactions.”
No Comments