SRE Incident Response for the New Decade

05
Jan

SRE Incident Response for the New Decade

RigD Platform for Automated Response – A Four Part Series for SRE and DevOps Managers

Whether you are a DevOps/SRE individual contributor, a team lead or a manager in charge of a group, incident response is a major part of your job. Incident response has become critical as application and services have become more distributed and complex. Whether your organization is more business or IT centric, having the right culture, processes and tooling to streamline incident response and as some say make on call suck less needs your attention, now.

When I was talking to a few DevOps managers focusing their teams on the right process and tooling, I find that the area is ripe for new approaches. No different than when I was in charge of HP’s operational tool set, or when I worked closely with hundreds of customers doing automation for operations, We all as managers spend way too much money on too many tools that don’t make enough of a difference to our teams bread and butter of running the operations of our digital businesses.

The Problem with SRE and DevOps Incident Tooling

The problem, as I discussed with a few of our customers, is that the tools we use solve a problem or two really well and then the rest of the product is a set of features that kind of help the team, but are hard to use in our organization. I am paying 30 bucks a user and using perhaps a quarter of that value in my teams day to day operations. Let’s face it, incident management has been around for a while and we keep reinventing it. The problem is that my people are moving faster, innovating at a higher pace and our operational collaboration to solve issues is just not represented well in the existing tools out there. It time for a completely new way to look at and solve the problem of SREs and DevOps teams solving operational issues fast. For sure, processes best practices at Google or PagerDuty are great places to start thinking about the process aspect of this. What is the best new approach to solve this in the new decade?

Focus on Human Collaboration over Machine Alerts

My assertion is that we as an industry have over pivoted toward AI based analysis of machine alerts looking for the Deux Ex Machina (Deus ex machina is a plot device whereby a seemingly unsolvable problem in a story is suddenly and abruptly resolved by an unexpected and seemingly unlikely occurrence, typically so much as to seem contrived. – literally God from the Machine) of lights out operation. This has for the most part failed for the vast majority of teams and companies. Its time to recognize that. We need to get back to the human operations. Human, our staff who know the customers needs, the architects who designed the system and the co-team members who are running the site reliability tests, triaging complex issues, running post mortems and the automation to improve the operations one step at a time are the solution for the challenge.

A Platform for Incident Response with RigD

I wrote a four part series on the RigD platform. These blogs are focused on what an SRE or DevOps manager would want to look at a new way to help our teams with responding to incidents. The RigD philosophy in solving this problem differently (and hence getting different results than the established model of paying a lot for on call incident solutions and still not really solving the big problems) consists of:

Go beyond machine alerts creating incidents, help the team get work done fast
Provide a structure for the team collaborating, do this in Slack (or your collaboration tool of choice)
Create a “container” for the work, make it visible for all, most of all the machine intelligence in RigD to identify patterns in the human behavior for leveraged future use
Build into that container task management and post-mortem activities from the get go
Leverage flows everywhere: triage sequences, tool integration to get and push data around, simple task automation, reminders, updates and anything the team needs
Make the RigD Bot part of the team via SlackOps

With this in mind, take a look at these four blogs:

RigD Platform for Automated Response Part 1 – Use Cases and Response Steps

RigD Platform for Automated Response Part 2 – Core Capabilities for ChatOps

RigD Platform for Automated Response Part 3 – Natural Language and Machine Learning

RigD Platform for Automated Response Part 4 – DevOps Integrations

Tags:

incident response,RigD Platform for Automated Response,Slack,SlackOps