I recently had a very interesting conversation with a respected friend who works in incident management. The company in question is a giant in a traditional sector with mostly established and linear structure and processes. The company realises and is directly facing the impact of digital revolution and is gradually transforming itself to be prepared.
However, due to its long history, size and a diverse set of business areas, this transformation is only gaining traction after 3 or so years. Initiatives such as process redesign and streamlining became norm, cultural changes on unified purpose and shared values were introduced. Parts of IT and Operations even attempted to bring agile methodology into the scene. Not to mention the company also flexes its muscles and took over a handful of tech startups for both vertical and horizontal integration.
To be perfectly honest, I have rudimentary knowledge about incident management and operations management overall (though I completed ITIL greenbelt years ago) but I summarised some key attributes as below:
- Its extremely time sensitive – when an incident occurs, standby teams kick in to respond, restore and recover, all along keeping the internal and external (if necessary) communication going to ensure impacted stakeholders are ‘in the know’
- Its extremely unpredictable – incidents may occur from anywhere in the company’s value chain. Sometimes the collateral damage or flow on effect, an incident that occurred outside of the company’s remit may still be considered significant to warrant actions
- Its extremely demanding – in a complex environment, the process needed to respond to, restore and recover from an incident must be coordinated across different stakeholder teams in different functions and perhaps in different geographical locations. This collaboration and coordination requires years of team building, finetuning processes, clear roles and responsibilities through wargaming and tackling real life events
- Its extremely long-tailed as a process – from initial incident detection, severity assessment and escalation criteria, to allocated response, containment and communication, to incident closure, root cause analysis and problem fixing. A properly set up incident management is usually the thread connecting an otherwise disjointed set of functions
- Its extremely data rich – what went wrong says a lot about the integrity and robustness of a process. The nature and volume of incidents is a great source for risk assessment, business cases, cost and benefit analysis and ultimately decision making on what, why and how to fix a problem
As our conversation went along, the agile-inspired part of me took over and joined force with my risk management brain. We started thinking and exploring how potentially agile and agile thinking could help the incident management team to be better at what they do. Here are some of it:
- A core principle of agile methodologies is to limit ‘work in progress’. Teams will agree to take on a small subset of work from the pipeline within a timeboxed period. By limiting the teams focus and attention on what is most important you enable them to complete work to the appropriate quality standards. In incident management context, its usually about completing the root cause analysis and introducing permanent fixes so similar incident do not reoccur.
- Another similar agile concept is to focus on the most important task at hand. Highly relevant to the incident management that had to reprioritise constantly as new incidents come to hand. The team can gain a lot of focus by understanding the pipeline. Where the incidents come from? How does it arrive in the team. Visualise it by using agile tool like leankit or trello can help identify which items are the most important and challenge the team to focus on and finish those items. By reducing the scope of work could result in a change in behaviour, delivery time and quality.
- Another great attribute of agile is empowerment of the team that is usually self-organised and cross functional. It may be an easy concept to adopt as the team is already embracive to the idea of working collaboratively across the teams. They understand that it is usually the key in successfully dealing with an incident
- An extension of across team collaboration is the concept of DevOps – continuous and dynamic change management. The company still operates under the traditional change management model where limited release windows are predetermined and production is locked down otherwise. Dedicated management layers of testing and approval points must be passed before a release becomes possible. Historic data confirms that this model is a big pain point – significant spike in incident volume follows a release window. DevOps could perhaps change this paradigm or at least improve the situation but I agree it sits beyond the remit of incident management team and requires a wholesale change in how IT operations (software development, change management, testing, service management) works.
- Agile teams work with the principle that plans will change; that the understanding about the work becomes better as we go along and that no amount of planning really prepares us for the road ahead. It is also true for incident management too. By working in iterations and deliver incremental values, an incident team can handle distractions and uncertainties associated with investigations much better. Hypothesis and assumptions made at the beginning are normally wrong and being flexible in delivering small fixes can help the team in solving the problem in piecemeal fashion and bring together business stakeholders along the journey. For the teams this might mean short term focus on a set of metric goals to solve a particular business problem. Just having the routine of sitting with the business and reviewing priorities is a great first step.
- Adopting the concept of the “Definition of Done” for common activities. Not an operations manual that no one will ever read but a collection of one-page definitions of what it means to be done with regard to a problem ticket. Make the definition of done visible and easy to use, incident managers will know when they are finished with a piece of work before moving on.
- The best architectures, requirements, and designs emerge from self-organizing teams. Teams that are not controlled but enabled. Teams that trust each other enough to have passionate debate and disagreement without destroying the teams culture. The worst experience for an incident manager would be presented with a piece of work that has no scope for flexibility or creativity, and worst of all to be told how long it will take. Imagine an incident team that is self-organising within the constraints of the organisation. They receive requirements that describe the incident (why), the ideal acceptance criteria (what) and they, as a team, determine the solution (how). With that trust, the team know they have spare capacity for more work and they pull work into their queue.
- Short time-boxes focus teams on an objective they have to meet – particularly in this highly unpredictable incident environment, the most valued attribute of the incident team may not just lie in the velocity or the volume of work, or the number of fixes they delivered? It may just be that the rest of the organisations really appreciates the predictability introduced by working in set time-boxes, or sprints, so they know incidents and their fixes are being taken care of.
I’d love to hear people who work in IT Operations, service desks, incident and problem management and engineering who can compare the way they work currently against these ideas – all of which are simple and cheap to implement. If you have experiences in overlaying agile to existing service operations, even better – please share.
Ultimately the ideas of focus, alignment, self-organisation and predictable rhythm promotes a culture of learning – about the work the team handles, about how the team performs and how the team interacts with the business. What are your thoughts?
#agile, #agile thinking, #incident management, #service management, #ITIL, #devops, #collaboration, #Rispeak