Practical Guide to IT Problem Management
eBook - ePub

Practical Guide to IT Problem Management

Andrew Dixon

  1. 88 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Practical Guide to IT Problem Management

Andrew Dixon

Book details
Book preview
Table of contents
Citations

About This Book

Some IT organisations seem to expend all their energy firefighting ā€“ dealing with incidents as they arise and fixing, or patching over, the breakage. In organisations like this, restarting computers is seen as a standard method to resolve many issues. Perhaps the best way to identify whether an organisation understands problem management is to ask what they do after they have restarted the computer. If restarting the computer fixes the issue, it is very tempting to say that the incident is over and the job is done. Problem management recognises that things do not improve if such an approach is taken. Such organisations are essentially spending their time running to stay in the same place.

Written to help IT organisations move forward, Practical Guide to IT Problem Management presents a combination of methodologies including understanding timelines and failure modes, drill down, 5 whys and divide and conquer. The book also presents an exploration of complexity theory and how automation can assist in the desire to shift left both the complexity of the problem and who can resolve it. The book emphasises that establishing the root cause of a problem is not the end of the process as the resolution options need to be evaluated and then prioritised alongside other improvements. It also explores the role of problem boards and checklists as well as the relationship between problem management and Lean thinking. This practical guide provides both a framework for tackling problems and a toolbox from which to select the right methodology once the type of problem being faced has been identified. In addition to reactive methods, it presents proactive activities designed to reduce the incidence of problems or to reduce their impact and complexity should they arise.

Solving problems is often a combination of common sense and methodologies which may either be learnt the hard way or may be taught. This practical guide shows how to use problem solving tools and to understand how and when to apply them while upskilling IT staff and improving IT problem solving processes.

Frequently asked questions

Simply head over to the account section in settings and click on ā€œCancel Subscriptionā€ - itā€™s as simple as that. After you cancel, your membership will stay active for the remainder of the time youā€™ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlegoā€™s features. The only differences are the price and subscription period: With the annual plan youā€™ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, weā€™ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Practical Guide to IT Problem Management by Andrew Dixon in PDF and/or ePUB format, as well as other popular books in Computer Science & System Administration. We have over one million books available in our catalogue for you to explore.

Information

Year
2022
ISBN
9781000586626
Edition
1

Chapter 1 Getting Your Priorities Right

DOI: 10.1201/9781003119975-2
This book is not designed to help you pass an exam in problem management, although it may help you set up a problem management process within your organisation (Chapter 10 looks at formal processes). Above and beyond that, this book looks at the bigger picture of how problem management adds value to an organisation and why it is important.
Take, for example, the Apollo 13 moon mission. This is an oft-quoted example because of the famous expression
Okay, Houston, weā€™ve had a problem here.1
We could, at this point, concern ourselves with the nature of a problem. Rather, I want to initially focus on what was important at that moment as the three astronauts and Mission Control tried to understand what was happening. They needed to know the impact of what had happened. They did not need to know what had happened, or why it had happened. For some time, they did not need to know what they would do about it ā€“ that came later. The first question in problem management (and indeed in the related discipline of Major Incident Management) is:

What Is the Impact?

Let us consider the event from the view of the astronauts (Table 1.1):
Table 1.1 Status in the initial minutes2
Incident:ā€˜pretty large bangā€™
Readings:
Main B Bus undervolt
Oxygen tank 2 was empty and tank 1ā€™s pressure slowly falling
The computer on the spacecraft had reset
The high-gain antenna was not working
Observations:
ā€˜a gas of some sortā€™ venting into space
The volume surrounding the spacecraft was filled with myriad small bits of debris from the accident
In Chapter 5 we will explore how to ask the right questions at this point. The impact was assessed as:
Oxygen is required for power, heating and breathing. Oxygen has and is being lost. Without sufficient oxygen, the astronauts will die.
Normally, in such a situation, they would have used the Service Moduleā€™s main engine to return to Earth, but they determined that there was a significant risk that it had been damaged in the explosion.

Workarounds

In problem management a solution which addresses the immediate impact issues without addressing the underlying causes is known as a workaround. In the case of Apollo 13, they realised that they had a spare source of oxygen ā€“ the Lunar Module and a spare source of propulsion ā€“ the gravity of the moon. The astronauts lived in the Lunar Module for the next four days whilst the spacecraft travelled to the moon and back, using the Lunar Moduleā€™s propulsion system to guide the whole craft. This preserved the resources in the Command Module, so that it could be used for re-entry. The astronauts survived and were hailed as heroes, as were the staff of Mission Control who had assessed the impact and provided the workaround.
ITIL 4 defines an incident to be
An unplanned interruption to a service or reduction in the quality of service.3
The explosion and subsequent readings and observations amounted to a serious incident.
ITIL 4 defines a problem to be
A cause, or potential cause, of one or more incidents.4
The incident was over once the mission was over and the astronauts were safe. The tank which had exploded was somewhere in space ā€“ so it couldnā€™t be repaired.
The problem remained. Before another Apollo mission could take place, they needed to understand what had happened and how they could remove or reduce the risk of it happening again. This is called root cause analysis.
Note that although evidence was gathered, it was not important that this analysis was done until after the workaround had brought the astronauts home. In any problem, mitigating the impact is the first priority. Sometimes, this can only be done by identifying the root cause and addressing it. However, that is not always the case and is a call to make.
The review board determined that Oxygen Tank 2 was faulty before the mission and that activating a fan within the tank caused an electric arc which caused the fire and explosion.5 There were a number of contributing factors. The tank was later redesigned to remove the risk from all of the contributing factors. Performing the review was critical to the success of later Apollo missions ā€“ any one of which could have ended in disaster if the root cause analysis had not been done correctly.
The root cause analysis identified both a sequence of events which led to the accident and a design fault:
  1. Tank 2 was originally in Apollo 10, but was removed to fix a fault. It was dropped when it was removed.
  2. There were thermostats which were designed to operate at 28 volts, but were powered with 65 volts ā€“ they failed to operate correctly.
  3. The temperature gauge was only rated up to 29Ā° Celsius (84Ā° Fahrenheit), so failed to detect the failed thermostats.
  4. During testing, tank 2 needed to be emptied and the drain system didnā€™t work, so they boiled off the oxygen. Without the functioning thermostats, temperatures may have reached 540Ā° Celsius (1004Ā° Fahrenheit).
  5. The high temperatures appear to have damaged the Teflon insulation.
Tests on similarly configured tanks produced telemetry readings which were in accord with the telemetry readings captured during Apollo 13ā€™s flight, which gave the investigators confidence that this is what had happened.

Preventing Problems

Problem management does not occur in a vacuum. When I trained to do First Aid at Work, one of the things I was taught was that it was better to avoid an accident than to pick up the pieces afterwards. If I saw a trip hazard, I could remove it or wait until someone tripped and then administer first aid. If I saw a drawing pin on the floor, then I could pick it up and put it back on the noticeboard, or I could treat someone with a drawing pin in the foot.
The cost of the Apollo series of missions is estimated at $25.4 billion, so it can be argued that this mission cost in excess of $1 billion and failed to achieve its primary objective of reaching the moon. The mistakes which led up to this were, therefore, very expensive mistakes.
The thermostatic switches used in Oxygen Tank 2 should have been replaced when the operating specifications were changed.
When the tank was dropped, it should have been fully tested in an end to end lifecycle test.
Oxygen Tank 2 was filled during a countdown demonstration test. When it could not be emptied using the correct procedure, a workaround was applied of boiling off the oxygen (which would normally be stored in liquid form).
At each point, if a different decision had been taken then this disaster may not have happened and a $1 billion mission may not have failed.
Problem management exists in the context of providing an end to end service and needs to operate alongside enterprise architecture, continual improvement and risk management.
Workarounds should not be used to pass the problem further down the line. If the drain pipe did not work, this should have indicated that there was a more serious issue in existence. Just removing the oxygen ignored the issue.

A No Blame Culture

There is no suggestion in this case that people covered up a story, but it is good practice in problem management and in its sister major incident management to operate a no blame culture. People make mistakes. This is part of human nature. We all make mistakes. If someone is doing their job and makes a mistake, there should be no blame attributed to them. Clearly, if they wilfully avoid safety rules or if they persistently fail to follow process, then that is a different situation. However, it is not helpful to blame someone for a genuine mistake. The first reason why it is not helpful is that people will conceal information if they believe that they will get the blame ā€“ valuable time will be lost trying to gather data which people could provide but which will incriminate themselves. The second reason is that tomorrow we all have to work together. I once accidentally deleted an entire web site. I had double-checked that I was only deleting a backup copy of it, but still managed to delete the live site. If I had pretended that it wasnā€™t me, it would have taken ages to conduct fault diagnosis in order to understand what had happened. Because I immediately owned up to it, we recovered 90% of the site in under an hour and the complete site by the close of the day.
The Apollo 13 mission is a useful case study because it was a complex problem with a number of causes which could have been avoided. Once the incident had occurred, the immediate need was for a workaround, which was successfully applied. Afterwards a full analysis identified the root causes, which could then be addressed. Above all, it is an example of a team which worked well together under pressure and were clear as to what their priorities were.

Summ...

Table of contents

  1. Cover Page
  2. Half Title Page
  3. Series Page
  4. Title Page
  5. Copyright Page
  6. Contents
  7. Biography
  8. Introduction
  9. Chapter 1 Getting Your Priorities Right
  10. Chapter 2 Timelines
  11. Chapter 3 Failure Modes
  12. Chapter 4 Complexity Theory
  13. Chapter 5 Automation and Artificial Intelligence
  14. Chapter 6 Drill Down
  15. Chapter 7 Divide and Conquer
  16. Chapter 8 Cause and Effect
  17. Chapter 9 Resolution Evaluation Methods
  18. Chapter 10 ITIL Problem Management
  19. Chapter 11 Problem Boards and Problem Records
  20. Chapter 12 The Drive for Efficiency
  21. Chapter 13 Applying the Principles to the World Outside of IT
  22. Chapter 14 Using Checklists
  23. Conclusion
  24. Appendix A Glossary
  25. Appendix B Sample Checklists
  26. Index