Risk Engineering and Murphy’s Law

There are two fundamental aspects of risk and its control that are literally household words.  The first is the safety valve. This small item of risk engineering would have come to prominence in the early days of steam and it remains in extended common use today – we talk of safety valves in social, psychological and political situations.  So much so that we have almost forgotten its origin in stopping pressure vessels from exploding.   The second is good old Murphy’s Law – what can go wrong will or various variations on this.  Wikipedia has a fascinating article on it, in which we read it could equally have come to be known as Strauss’s law (US Atomic Energy Commission chairman – “If anything bad can happen, it probably will) or Reilly’s law.  Everyone in the known universe now knows it by the name of Captain Ed Murphy (an engineer working on research at what became the now famous US Edwards Airforce Base), originally reported as saying about someone: “if there is anyway to do it wrong, he will”.  The Wikipedia entry gives various versions of the law and even quotes a source that calls it the fourth law of Thermodynamics, probably tongue-in-cheek.

In other words, everybody at work,  in your house, amongst your friends, in the shops you visit, the aeroplanes in which you fly etc. know this adage and have no difficulty recalling or understanding it.   

This saying, in whatever way it is formulated, is a statement of simple truth.  It makes a statement about the probability of things going wrong – if something is possible, then it is also probable (in the sense of having a probability associated with it).  There are only two certain things in the world, death and taxes, so everything else we deal with is uncertain and hence subject to this simple principle.  As engineers we are called on to design things to efficiently suit a required function, so it is natural for us to attend to what our designs are intended to do, and to pay less attention to what they could do.  The mature engineer must also attend to the adverse potential of their designs.  This means risk engineering is something all designers should do and our role as professed risk engineers is to both promote the science and application of the subject and assist designers ourselves.

For example, when a vessel is pressurised there is a probability that the safety valve (SV) will fail to operate when called on to do so (SV failures per demand).  If we include the probability of an overpressure situation arising (demands per pressurisation) and also the Exposure (the number of times a year the vessel is pressurised, or the period of time in a year for which the vessel is pressurised) some very simple arithmetic allows us to calculate the Frequency with which vessel overpressure is not relieved by the valve.   The inverse of that of course is the Mean Time Between cases of unrelieved overpressure.  The existing mathematics of reliability engineering is all we need to understand this.

These calculations are useful in that they give the engineer a sense of perspective on the significance of the Outcome when things go wrong and make it possible to promote efficient control measures aimed at reducing the adverse effects of that when it does happen.

The Time Sequence Model (Chapter 3) clearly and simply shows the logic of the process – there are reasons the Event can happen (called Event Mechanisms) and there are Outcomes after the Event has happened.  As engineers we are mostly interested in the the damaging potential of energy sources and the Event can be defined in this context – hence if an energy source exists it is only a matter of time before control over its damaging properties are lost and an Outcome with potential adverse Consequence occurs.

Logic as well as moral duty make it clear that the more extreme the Likely Worst Consequence of the Outcome the more must be done to minimise this by controlling the Outcome process as well as directing it towards less damaging Consequences.  Sometimes this means running around panicking and calling the emergency services, but the response should have been included in the design process and continue to be managed over the years in the operational phase.  I have seen many examples of the lack of this in my practice.

The logic of this is not complex, so I’m sure you’ll also feel frustration and curiosity when faced with reports of serious cases where this has not been taken into account in the system design.  I’ll mention three cases here.  The first and last attracted a lot of publicity, the other comparatively little.  The first case(s) really did kill a lot of people and happened in both 2015 and 2019.  The second case nearly killed a lot of people. The third case killed two plane loads.  There are relatively few examples (Exposure, maybe some hundreds) of the first case to be found in the world.  The potential for (Exposure) in the second case exists one hundred thousand times every day and third case less than that as the case involves only one type of aircraft.

First case:  Tailings dam failures in Brazil.  See https://www.youtube.com/watch?v=sKZUZQytads

Whether the ‘cause’ (i.e. Mechanism) of the failure (the Event) is known or not, what is known is that such dams do fail.  See https://en.wikipedia.org/wiki/List_of_tailings_dam_failures.  There is engineering knowledge of Mechanisms too, see https://www.hindawi.com/journals/ace/2019/4159306/.  It is clearly not fanciful to imagine such dam failures.  What is amazing, to me at least, is that the designers were content to design and build dams with outflow paths that would produce such destruction and that Governments were content to approve the designs.  As a quick web search will show, the costs to the communities, the environment and the companies involved were nothing short of spectacular.  Or is this just that people, companies, Governments etc. can’t think past what is intended to happen?  What I find disturbing (and unhelpful) is the way Governments subsequently assume that they are innocent and that massive punishment of the operator by imposing huge fines will somehow solve the problem.

Second case:  An airliner almost failed to achieve  take-off from Melbourne airport.  See https://en.wikipedia.org/wiki/Emirates_Flight_407.  This flight failed to achieve take-off safety speed at the point along the runway when it should have, resulting in the need to apply emergency thrust.  Damage was done to approach path lighting structures at the end of the runway and the aeroplane itself.  The Event here is something that is possible (of course it was –  it should not be even necessary to point this out).  It occurred when the engines were spooled up to what was assumed to be the required take-off thrust levels – full power take-off settings are an exception these days. The inadequacy of the thrust only became evident towards the end of the take-off run when an attempted rotation failed to produce sufficient lift to allow the aircraft to become airborne.  This is well into the Outcome pathway and it was only then that the flight crew became aware of the problem. How did they become aware?  By realising that rotation had not lifted the aircraft off the runway.  This is seat-of-the-pants stuff (and the year was 2009 and the technology the pinnacle of mankind’s capability).  You may be sure this was not a unique case – read also this:  https://www.avweb.com/flight-safety/technique/when-engine-instruments-lie/.  In this story the Captain thought they should have been going faster by the time they passed the terminal building and applied maximum thrust in time so they did not die.  Also seat-of-the-pants stuff.   I find it amazing that all cases of pilots monitoring take off performance are by setting a thrust level based on take-off performance charts, assuming they have done the sums or entered data correctly and that the engines are doing what they are expected to do etc..  Note that up to the Emirates case at Melbourne airport there was no independent monitoring of acceleration achieved vs expected.  That is, no attention paid by aeronautical engineers to designing the Outcome pathway.  I understand this case made manufacturers begin to design monitoring systems to give the flight crew information about what was actually happening vs. what they hoped was happening on the take off run.

The Mechanism here was an error entering fuel weight data.  In 1979 an error in entering route data led to the crash of an Air New Zealand aircraft into Mt Erebus with the loss of all on board.  This involved the same Mechanism type (data entry error) but different circumstances.  In other words, the same Occurrence Process different conditions and circumstances.  Conditions and circumstances are mostly the only differences between Occurrences. 

The third topical case is the two Boeing 737Max cases of loss of control and total loss.  I have been following this in aviation newsletters and it appears that the plane necessarily relies on automated pitch control as the fundamental aerodynamics are unstable in pitch due to the relatively forward and upward location of the engines compared with the usual engine underwing location.   The usual location is the same for all conventional underwing aircraft because it maintains pitch stability while minimising the drag of the installation. The design incorporated this unusual positioning, I understand, because of the desire to put bigger diameter engines on an old airframe. The automated pitch control relies, not surprisingly, on input from an angle of attack sensor.  Angle of attack is closely related to fuselage pitch angle.  Curiously, it appears the engineers designed the system to use only one of these sensors for this purpose.  If the sensor fails (one case appears to be from a bird strike on the sensor) there is no input to the pitch controller.  A natural response of pilots to autopilot pitch control error is to turn off the pitch function of the autopilot, but in the case of the 737Max this presents them with an aeroplane inherently unstable in pitch.  Nothing I have yet read suggests the design engineers had designed Outcome pathway control measures as a back up to this apparently very obvious and possible Mechanism.    

I should conclude by saying that this is by no means the case in the design of chemical process plant, where overpressure, underflow, reverse flow etc. sensitive designs are made as a matter of course.  Perhaps there are other fields of engineering where this is also true, but why not in all fields?  Does our profession lack routinely applied principles of design?  I suspect as a generalisation,  engineers of many or most disciplines are not taught the science of risk.

Always remember Murphy! 

 https://www.quora.com/How-many-airplanes-fly-each-day-in-the-world.  Viewed 2nd Feb 2020

(This article was first published in the Australian Risk Engineering Society newsletter in 2019)

Leave a Comment