Although we rarely think of it, reliability and maintenance are part of our everyday lives. The equipment, manufactured products, and fabricated infrastructure that contribute substantively to the quality of our lives have finite longevity. Most of us recognize this fact, but we do not always fully perceive the implications of finite system life for our efficiency and safety. Many, but not all, of us also appreciate the fact that our automobiles require regular service, but we do not generally think about the fact that roads and bridges, smoke alarms, electricity generation and transmission devices, and many other machines and facilities we use also require regular maintenance.
We are fortunate to live at a time in which advances in the understanding of materials and energy have resulted in the creation of an enormous variety of sophisticated products and systems, many of which (1) were inconceivable 100 or 200 or even 20 years ago; (2) contribute regularly to our comfort, health, happiness, efficiency, or success; (3) are relatively inexpensive; and (4) require little or no special training on our part. Naturally, our reliance on these devices and systems is continually increasing and we rarely think about failure and the consequences of failure.
Occasionally, we observe a catastrophic failure. Fatigue failures of the fuselage of aircraft [1], the loss of an engine by a commercial jet [1], the Three Mile Island [1] and Chernobyl [1] nuclear reactor accidents, and the Challenger [2] and Discovery [3] space shuttle accidents are all widely known examples of catastrophic equipment failures. The relay circuit failure at the Ohio power plant that precipitated the August 2003 power blackout in the northeastern United States and in eastern Canada [4] is an example of a system failure that directly affected millions of people. When these events occur, we are reminded dramatically of the fallibility of the physical systems on which we depend.
Nearly everyone has experienced less dramatic product failures such as that of a home appliance, the wear out of a battery, and the failure of a light bulb. Many of us have also experienced potentially dangerous examples of product failures such as the blowout of an automobile tire.
Reliability engineering is the study of the longevity and failure of equipment. Principles of science and mathematics are applied to the investigation of how devices age and fail. The intent is that a better understanding of device failure will aid in identifying ways in which product designs can be improved to increase life length and limit the adverse consequences of failure. The key point here is that the focus is upon design. New product and system designs must be shown to be safe and reliable prior to their fabrication and use. A dramatic example of a design for which the reliability was not properly evaluated is the well-known case of the Tacoma Narrows Bridge, which collapsed into the Puget Sound in November 1940, a few months after its completion [1].
A more recent example of a design fault with significant consequences is the 2013 lithium-ion battery fire that occurred on a new Boeing 787 aircraft while it was parked at the Boston airport [5]. Fortunately, the plane was empty, so no one was injured, but the fire and two subsequent fires of the same type resulted in all 787s being grounded until a modification to the battery containment was made. The cost to the airlines using the planes was estimated to be $1.1 million per day.
The study of the reliability of an equipment design also has important economic implications for most products. As Blanchard [6] states, 90% of the life cycle costs associated with the use of a product are fixed during the design phase of a productās life.
Similarly, an ability to anticipate failure can often imply the opportunity to plan for an efficient repair of equipment when it fails or even better to perform preventive maintenance in order to reduce failure frequency.
There are many examples of products for which system reliability is far better today than it was previously. One familiar example is the television set, which historically experienced frequent failures and which, at present, usually operates without failure beyond its age of obsolescence. Improved television reliability is certainly due largely to advances in circuit technology. However, the ability to evaluate the reliability of new material systems and new circuit designs has also contributed to the gains we have experienced.
Perhaps the most well-recognized system for which preventive maintenance is used to maintain product reliability is the commercial airplane. Regular inspection, testing, repair, and even overhaul are part of the normal operating life of every commercial aircraft. Clearly, the reason for such intense concern for the regular maintenance of aircraft is an appreciation of the influence of maintenance on failure probabilities and thus on safety.
On a personal level, the products for which we are most frequently responsible for maintenance are our automobiles. We are all aware of the inconvenience associated with an in-service failure of our cars and we are all aware of the relatively modest level of effort required to obtain the reduced failure probability that results from regular preventive maintenance.
It would be difficult to overstate the importance of maintenance and especially preventive maintenance. It is also difficult to overstate the extent to which maintenance is undervalued or even disliked. Historically, repair and especially preventive maintenance have often been viewed as inconvenient overhead activities that are costly and unproductive. Very rarely have the significant productivity benefits of preventive maintenance been recognized and appreciated. Recently, there have been reports [7ā9] that suggest that it is common experience for factory equipment to lose 10%ā40% of productive capacity to unscheduled repairs and that preventive maintenance could drastically reduce these losses. In fact, the potential productivity gains associated with the use of preventive maintenance strategies to reduce the frequency of unplanned failures constitute an important competitive opportunity [9]. The key to exploiting this opportunity is careful planning based on cost and reliability.
This book is devoted to the analytical portrayal and evaluation of equipment reliability and maintenance. As with all engineering disciplines, the language of description is mathematics. The text provides an exploration of the mathematical models that are used to portray, estimate, and evaluate device reliability and those that are used to describe, evaluate, and plan equipment service activities. In both cases, the focus is on design. The models of equipment reliability are the primary vehicle for recognizing deficiencies or opportunities to improve equipment design. Similarly, using reliability as a basis, the models that describe equipment performance as a function of maintenance effort provide a means for selecting the most efficient and effective equipment service strategies.
These examples of various failures share some common features and they also have differences that are used here to delimit the extent of the analyses and discussions. Common features are that (1) product failure is sufficiently important that it warrants engineering effort to try to understand and control it and (2) product design is complicated, so the causes and consequences of failure are not obvious.
There are also some important differences among the examples. Taking an extreme case, the failure of a light bulb and the Three Mile Island reactor accident provide a defining contrast. The Three Mile Island accident was precipitated by the failure of a physical component of the equipment. The progress and severity of the accident were also influenced by the response by humans to the component failure and by established decision policies. In contrast, the failure of a light bulb and its consequences are not usually intertwined with human decisions and performance. The point here is that there are very many modern products and systems for which operational performance depends upon the combined effectiveness of several of (1) the physical equipment, (2) human operators, (3) software, and (4) management protocols.
It is both reasonable and prudent to attempt to include the evaluation of all four of these factors in the study of system behavior. However, the focus of this text is analytical and the discussions are limited to the behavior of the physical equipment.
Several authors have defined analytical approaches to modeling the effects of humans [10] and of software [11] on system reliability. The motivation for doing this is the view that humans cause more system failures than does equipment. This view seems quite correct. Nevertheless, implementation of the existing mathematical models of human and software reliability requires the acceptance of the view that probability models appropriately represent dispersion in human behavior. In the case of software, existing models are based on the assumption that probability models effectively represent hypothesized evolution in software performance over time. The appropriateness of both of these points of view is subject to debate. It is considered here that the human operators of a system do not comprise a homogeneous population for which performance is appropriately modeled using a probability distribution. Similarly, software and operating protocols do not evolve in a manner that one would model using probability functions. As the focus of this text is the definition of representative probability models and their analysis, the discussion is limited to the physical devices.
The space shuttle accidents serve to motivate our focus on the physical behavior of equipment. The 1986 Challenger accident has been attributed to the use of the vehicle in an environment that was more extreme than the one for which it was designed. The 2002 Discovery accident is believed to have been the result of progressive deterioration at the site of damage to its heat shield. Thus, the physical design of the vehicles and the manner in which they were operated were incompatible and it is the understanding of this interface that we obtain from reliability analysis.
The text is organized in four general sections. The early chapters describe in a stepwise manner the increasingly complete models of reliability and failure. These initial discussions include the key result that our understanding of design configurations usually implies that system reliability can be studied at the component level. This is followed by an examination of statistical methods for estimating reliability. A third section is comprised of five chapters that treat increasingly more complicated and more realistic models of equipment maintenance activities. Finally, several advanced topics are treated in the final chapter.
It is hoped that this sequence of discussions will provide the reader with a basis for further exploration of the topics treated. The development of new methods and models for reliability and maintenance has expanded our understanding significantly and is continuing. The importance of preventive maintenance for safety and industrial productivity is receiving increased attention. The literature that is comprised of reports of new ideas is expanding rapidly. This book is intended to prepare the reader to understand and use the new ideas as well as those that are included here.
As a starting point, note that it often happens that technical terms are created using words that already have colloquial meanings that do not correspond perfectly with their technical usage. This is true of the word reliability. In the colloquial sense, the word reliable is used to describe people who meet commitments. It is also used to describe equipment and other inanimate objects that operate satisfactorily. The concept is clear but not particularly precise. In contrast, for the investigations we undertake in this text, the word reliability has a precise technical definition. This definition is the departure point for our study.