Site icon Tutorial

Incident Management

Go back to Tutorial

An ‘Incident’ is any event which is not part of the standard operation of the service and which causes, or may cause, an interruption or a reduction of the quality of the service.

The objective of Incident Management is to restore normal operations as quickly as possible with the least possible impact on either the business or the user, at a cost-effective price.

Inputs for Incident Management mostly come from users, but can have other sources as well like management Information or Detection Systems. The outputs of the process are RFC’s (Requests for Changes), resolved and closed Incidents, management information and communication to the customer.

Incident management (IcM) is a term describing the activities of an organization to identify, analyze, and correct hazards to prevent a future re-occurrence. These incidents within a structured organization are normally dealt with by either an Incident Response Team (IRT), or an Incident Management Team (IMT). These are often designated before hand, or during the event and are placed in control of the organization whilst the incident is dealt with, to restore normal functions.

There are three basic types of events:

Terminologies

The goal of incident management is to restore normal service operation as quickly as possible following an incident, while minimizing impact to business operations and ensuring quality is maintained.

End users log incidents by

Types of Incidents

Incidents are classified as

Security Incidents

A security incident can be

The “Incident Lifecycle”

ITIL recommends that incidents be managed through a lifecycle, or process, that includes a number of steps, activities, or sub-processes – from the initial identification or reporting of the incident through to its resolution and the closure of the incident record. This is the order:

Continuous ownership, monitoring, tracking, and communication are involved throughout each step as per the diagram below.

The benefits of incident management include, but are not restricted to:

Categorizing and Prioritization

Categories are used by the incident management system to create automatic assignment rules or notifications like the incident state. This allows the service desk to track how much work has been done and what the next step in the process might be. ITIL uses three metrics for determining the order in which incidents are processed. All three are supported by Incident forms:

ITIL suggests that priority be made dependent on impact and urgency. In the base system, this is true on Incident forms. Priority is generated from urgency and impact according to the following data lookup rules

Impact Urgency Priority
1 – High 1 – High 1 – Critical
1 – High 2 – Medium 2 – High
1 – High 3 – Low 3 – Moderate
2 – Medium 1 – High 2 – High
2 – Medium 2 – Medium 3 – Moderate
2 – Medium 3 – Low 4 – Low
3 – Low 1 – High 3 – Moderate
3 – Low 2 – Medium 4 – Low
3 – Low 3 – Low 5 – Planning

Escalation of Incidents

It is done on basis of

Investigation and Diagnosis of Incidents

Like the initial diagnosis and investigation, investigation and diagnosis are largely human processes. The service desk can continue to use the information provided within by the Incident form and the CMDB to solve the problem. Work notes can be appended to the incident as it is being evaluated, which facilitates communication between all of the concerned parties. These work notes and other updates can be communicated to the concerned parties through email notifications.

Resolution and Recovery of Incidents

After the incident is considered resolved, the incident state should be set to Resolved by the service desk. The escalators will be stopped and the service desk may review the information within the incident. After a sufficient period of time has passed, assuming that the user who opened the incident is satisfied, the incident state may be set to closed.

If an incident’s cause is understood but cannot be fixed, the service desk can easily generate a problem from the incident, which will be evaluated through the problem management process. If the incident creates the need for a change in IT services, the service desk can easily generate a change from the incident, which will be evaluated through the change management process.

Closure of Incidents

Closed incidents will be filtered out of view, but will remain in the system for reference purposes. Closed incidents can be reopened if the user or service desk believes that it needs to be reopened. Incidents that are on the Related Incidents list of a problem can be configured to close automatically when the problem is closed through business rules.

Incident Team

The security incident coordinator manages the response process and is responsible for assembling the team. The coordinator will ensure the team includes all the individuals necessary to properly assess the incident and make decisions regarding the proper course of action. The incident team meets regularly to review status reports and to authorize specific remedies. The team should utilize a pre-allocated physical and virtual meeting place.

Incident Triage

The name triage comes from a French medical term, which describes a situation in which you have limited resources and have to decide on the priorities of your actions based on the severity of particular cases.

In the incident handling process, the triage phase consists of three sub-phases: verification, initial classification and assignment.

To implement triage in your incident handling process, you can consider your incidents in the same way a doctor thinks about patients. You will need to complete the triage process to prioritise the incident and progress it to diagnosis and resolution. The triage should determine the:

The basic questions that should be answered in this phase are:

This information allows you to decide what to do. Should you reject the incident, should you undertake immediate action or can you perhaps handle the incident later? You can also decide that just advice is enough at this moment because the incident was reported by an experienced user (you know him). Eventually you may skip handling completely if your triage process tells you the issue is not important at all.

Root Cause Analysis

Root cause analysis (RCA) is a method of problem solving used for identifying the root causes of faults or problems. A factor is considered a root cause if removal thereof from the problem-fault-sequence prevents the final undesirable event from recurring; whereas a causal factor is one that affects an event’s outcome, but is not a root cause. Though removing a causal factor can benefit an outcome, it does not prevent its recurrence within certainty.

For example, imagine an investigation into a machine that stopped because it overloaded and the fuse blew. Investigation shows that the machine overloaded because it had a bearing that wasn’t being sufficiently lubricated. The investigation proceeds further and finds that the automatic lubrication mechanism had a pump which was not pumping sufficiently, hence the lack of lubrication. Investigation of the pump shows that it has a worn shaft. Investigation of why the shaft was worn discovers that there isn’t an adequate mechanism to prevent metal scrap getting into the pump. This enabled scrap to get into the pump, and damage it. The root cause of the problem is therefore that metal scrap can contaminate the lubrication system. Fixing this problem ought to prevent the whole sequence of events recurring. Compare this with an investigation that does not find the root cause: replacing the fuse, the bearing, or the lubrication pump will probably allow the machine to go back into operation for a while. But there is a risk that the problem will simply recur, until the root cause is dealt with.

The primary aim of root cause analysis is: to identify the factors that resulted in the nature, the magnitude, the location, and the timing of the harmful outcomes (consequences) of one or more past events; to determine what behaviors, actions, inactions, or conditions need to be changed; to prevent recurrence of similar harmful outcomes; and to identify lessons that may promote the achievement of better consequences. (“Success” is defined as the near-certain prevention of recurrence.)

To be effective, root cause analysis must be performed systematically, usually as part of an investigation, with conclusions and root causes that are identified backed up by documented evidence. A team effort is typically required.

There may be more than one root cause for an event or a problem, wherefore the difficult part is demonstrating the persistence and sustaining the effort required to determine them. The purpose of identifying all solutions to a problem is to prevent recurrence at lowest cost in the simplest way. If there are alternatives that are equally effective, then the simplest or lowest cost approach is preferred.

The root causes identified will depend on the way in which the problem or event is defined. Effective problem statements and event descriptions (as failures, for example) are helpful and usually required to ensure the execution of appropriate analyses.

One logical way to trace down root causes is by utilizing hierarchical clustering data-mining solutions (such as graph-theory-based data mining). A root cause is defined in that context as “the conditions that enable one or more causes”. Root causes can be deductively sorted out from upper groups of which the groups include a specific cause.

To be effective, the analysis should establish a sequence of events or timeline for understanding the relationships between contributory (causal) factors, root cause(s) and the defined problem or event to be prevented.

Root cause analysis can help transform a reactive culture (one that reacts to problems) into a forward-looking culture (one that solves problems before they occur or escalate). More importantly, RCA reduces the frequency of problems occurring over time within the environment where the process is used.

Root cause analysis as a force for change is a threat to many cultures and environments. Threats to cultures are often met with resistance. Other forms of management support may be required to achieve effectiveness and success with root cause analysis. For example, a “non-punitive” policy toward problem identifiers may be required.

 

Go back to Tutorial

Exit mobile version