Don't panic! 10 ways to manage major IT incidents

For many IT departments, the default reaction to a major incident is to shift into fire-fighting mode. So here are ten best-practice steps to resolve major incidents – without manning the panic stations.

Step one: Distinguish between high priority and major incidents

A major incident is any issue that has a huge business impact on several users and forces an organisation to deviate from existing incident management processes.

With no clear ITIL guidelines, high-priority incidents are often wrongly identified as major incidents. To avoid confusion, distinguish between high-priority and major incidents based on factors such as urgency, impact and severity.

Step two: Have clear and separate major incident workflows

To restore a disrupted service quickly, implement a robust process with separate workflows for major incidents.

Focus on automating and simplifying processes including: Identifying the major incident; communicating with the impacted staff or business stakeholders; assigning the right people; tracking the major incident throughout its lifecycle; escalation upon breach of SLAs; resolution and closure as well as generation and analysis of reports.?

To ensure the fastest possible resolution, adopt a no-approval process for solving major incidents.

Step three: Get the best team on the job

Ensure that your best resources are working on major incidents with clearly-defined roles and responsibilities. Some organisations have a dedicated major incident team headed by a major incident manager, whereas others have a dynamic, ad hoc team that has experts from various departments.

Your primary objective must be to keep your resources engaged and avoid conflict of time and priorities.

Step four: Train and equip staff with the right tools

No one can predict when a major IT incident will strike. However, the first step to handling it is by being prepared. Divide your major incident management team into sub teams and train them in major incident management. Assign responsibilities by mapping skills with requirements.

Run simulation tests on a regular basis to identify strengths, evaluate performance and address gaps as needed. This will also help your team to cope with stress and be prepared when facing real time scenarios.

Equip your team with the right tools such as smartphones and tablets with seamless connectivity so that they can work from anywhere during an emergency.

Step five: Follow pre-defined SLAs with additional resource on standby

Define stringent SLAs for major incidents. Set up separate response and resolution SLAs with clear escalation points for any breach of the process. Follow a manual escalation process if the assigned technician lacks the expertise to resolve the incident and ensure that a backup technician is always available.

Step six: Keep relevant people informed

Throughout the lifecycle of major incidents, send announcements, notifications, and status updates to relevant stakeholders. Announcements in the self-service portal will prevent end users from raising duplicate tickets and overloading the help desk.

Also, send hourly or bi-hourly updates during a service downtime caused by major incidents. Have a dedicated line to respond to major incidents immediately and offer support to stakeholders. Use the fastest means of communication, such as telephone calls, direct walk-ins, live chat, and remote control desktop, instead of relying on email.

Step seven: Review major incidents to avoid future repetition

After a major incident is resolved, perform a root cause analysis by using problem management methods. Then, implement organisation-wide changes to prevent similar incidents happening again in the future by following the change management process.

Speed up the entire incident, problem and change management process by providing detailed information about the assets involved using asset management.

Step eight: Add major incident intelligence to your knowledge base

Formulate simple knowledge base article templates that capture critical details. These might include the type of major incident the article relates to, the latest issue resolved using the article, the owner of the article and the resources needed to implement the solution. Create and track solutions separately for major incidents so that you can access them quickly with little effort.

Step nine: Review and report on major incidents

Document and analyse all major incidents so that you can identify areas of improvement. This will help your team efficiently handle similar issues in the future. Also, generate major incident-specific reports for analysis, evaluation and decision-making.

These might include : The number of major incidents raised and closed each month; average resolution time for major incidents; percentage of downtime cause of major incidents and problems and changes linked to major incidents.

Step ten: Evaluate major incident processes for continual service improvement

It is best practice to document major incident processes and workflows for ready reference by IT and other business stakeholders. This could include details such as the number of personnel involved, their roles and responsibilities, communication channels, tools used for the fix, approval and escalation workflows and the overall strategy, along with baseline metrics for response and resolution.

Management must evaluate processes on a regular basis to check if targeted performance levels in major incident management are met. This should help rectify flaws and contribute to continuous service improvement.

Sourced from Prithiv RajKumar, Marketing Analyst, ManageEngine

Ben Rossi

Ben was Vitesse Media's editorial director, leading content creation and editorial strategy across all Vitesse products, including its market-leading B2B and consumer magazines, websites, research and... More by Ben Rossi

Don’t panic! 10 ways to manage major IT incidents

Ben Rossi

Related Topics

Related Stories

Tech leaders profile: protect your business from disaster

The importance of disaster recovery and backup in your cybersecurity strategy

Shifting emphasis towards cloud-first data protection

NotPetya five years on: the cyber security lessons learned by organisations

Related Stories

Tech leaders profile: protect your business from disaster

The importance of disaster recovery and backup in your cybersecurity strategy

Shifting emphasis towards cloud-first data protection

Four tips to increase executive buy-in to disaster recovery