Tuesday, February 04, 2020

Deepwater Horizon - Lessons we still haven't learned

What structure of management, motivations, mental-models, and culture allowed such a "bug" to pass through to production, and aside from fixing the instance that failed,  how do we "close that barn door" so this "kind of thing" won't recur?   Can we learn enough from this? 
 ( reposted from May 2, 2010 because it still hasn't been fixed. )

The Deepwater Horizon disaster in the Gulf was described as an equipment failure, but the cause is much deeper and we need to "go there" to see where to stop more things like this from happening.


 
=======================

Friday, April 30, 2010
Blowout preventer failed on Gulf rig
New York Times:

As cleanup crews struggled Friday to cope with the massive oil slick from a leaking well in the Gulf of Mexico, dozens of engineers and technicians ensconced in a Houston office building were still trying to solve the mystery of how to shut down the well after a week of brainstorming and failed efforts.

They have continued to focus their attention on a 40-foot stack of heavy equipment 5,000 feet below the surface of the gulf — and several hundred miles from Houston. Known as a blowout preventer, or B.O.P., the steel-framed stack of valves, rams, housings, tanks and hydraulic tubing, painted industrial yellow and sitting atop the well in the murky water, is at the root of the disaster.

When an explosion and fire crippled the deepwater drilling rig on April 20, workers threw a switch to activate the blowout preventer, which is designed to seal the well quickly in the event of a burst of pressure. It did not work, and a failsafe switch on the device also failed to function.

Since then, the group of experts in deep-sea oil operations has been working out of a BP office, grappling with the intractable puzzle of how to activate the device.

“It’s a mystery, a huge Apollo 13-type mystery,” as to why the blowout preventer did not work, said a person familiar with the efforts to activate it, who requested anonymity because he was not authorized to speak on the subject.

Like Apollo-program engineers, who 40 years ago (and also in Houston) cobbled together a long-distance fix to save the crippled spacecraft and its crew, these experts are trying something far beyond routine: shutting down an underwater out-of-control well by remote control. And at a mile below the surface, the work site might as well be halfway to the moon.

The effort involves a half-dozen remotely operated robotic submersibles hovering around the blowout preventer, along with surface support ships. The submersibles, designed for drilling work, are equipped with video cameras and tools like wire cutters and “hot stabs,” metal connectors that can plug into hydraulic systems in an effort to operate them.

So far the efforts have not been successful. “They seem to be having hydraulic issues,” said the person familiar with the effort.

In computing, when a "bug" surfaces in a program,   it is always a good idea to not only fix the bug, but to try to understand which barn door was left open that this bug could have ever made it into production code without being caught.   There is usually a "failure of imagination" or a "failure of the mental model" that is vulnerable to factors that from now forward, at least, should be taken into account.

It is, in that sense, not so much the "code" that has failed, as the process which produced and tested the code which has failed, and which needs to be fixed.  
Almost always, this ultimately points to both management and culture that have contributed to the problem's very existence.

Consider the case of the Blowout Preventer ("BOP") in the Deepwater Horizon oil well.  What does this failure to operate tell us about the management and culture that produced it.

"The work of the hands always reveals the true nature of the heart."
Facts:
  • The BOP was triggered ( told to close the well) by the surface crew when there was a burst of pressure,  the type which is undesired and unintended, but which is the whole point of having a BOP in the first place.     The triggering failed to close the well.
  • Underwater submersible remotely operated vehicles (ROV's) went down to the BOP and manually activated various emergency fall-back triggers,  which also did not function.
  • BOP's have failed before.
Analysis -- or "What were you thinking?!!"

  • Apparently, there were multiple ways to trigger the BOP to close, but all of them relied on a single hydraulic system to operate, making a single point of failure.
  • The design of the BOP did not utilize the design feature of airbrakes on trains and trucks and busses, namely, that when the pressure fails, the brakes LOCK.   The pressure holds the brakes UNLOCKED.  So, a predictable failure in hydraulics results in a known safe condition.
  • The design of the BOP did not utilize the wisdom of airplanes with hydraulic systems or electrical systems to lower the landing gear,   in having a fall-back manual crank system, when all else fails, to lower the landing gear.  
In other words, there are at least three major, obvious design flaws in the BOP's that are currently in use. The design, if submitted to a sophomore engineering class as a project, would probably get at best a grade of "C".

If you add to the equation that the cost of a failure is the ecosystem of the Gulf, and well over a billion dollars out of pocket,  the best grade it would get would be "F" = unacceptable for that purpose.

META-Analysis -- How could this have happened? 

What structure of management, motivations, mental-models, and culture allowed such a "bug" to pass through to production, and aside from fixing the instance that failed,  how do we "close that barn door" so this "kind of thing" won't recur?   Can we learn enough from this?

It's hard to imagine that the design engineers on the project of BOP design did not raise the issues above. This is Halliburton, not some back woods shop with inexperienced people.  That would be simple standard engineering practice.     We have to conclude that the engineers at the bottom of the food chain saw these problems, but their design recommendations were rejected by their management.

Apparently, to middle management,  assuming a 5% failure rate of the equipment,  a failure to keep pumping was something to avoid at all costs, but a failure to stop a a blowout was and "acceptable cost of doing business."

It's not clear that the math is correct, and, in the whole, by the time the lawsuits are done for this blowout, it will almost certainly have cost the company far more in profits that it would have cost to redesign the BOP's to be closer to failsafe.

So, we may have the same old story of short-run profits dominating the decision making over long run profits. It's not clear why investors put up with that, which is a whole branch of investigation in itself.

What can be done in the future?

It would be interesting if proposed safety mechanisms, such as the BOP,   were posted on-line and open for comments by the public, early in the design phase.     All of the above issues would surely have surfaced.

Regulatory agencies may be hard-pressed to issue regulations about the myriad of features of millions of safety issues, but they might be able to simply demand that such designs be posted on-line early in the design state,  so that public wisdom could be incorporated in the design.    We could leave it to the legal system to pursue questions of why, since issues X were raised,  they were never addressed, in cases such as the current one.
===============

No comments: