Sunday, September 23, 2007

Honey, I lost the nuclear weapons - Bent Spear


According to the Washington Post, the US military lost track of 6 nuclear weapons for 36 hours while transporting them under very light security from North Dakota to Louisiana. My interest in this is again in the general problem of what it takes to produce a reliable system (of any type) that protects us against dangerous errors, and how such systems break down.

As with the classic "swiss cheese" model of protection, in this case there were multiple layers of simultaneous failures. And, as usual, the common thread was a strong belief in a mistaken mental model, that, once launched, managed to obscure contrary details against systems carefully designed for just such a purpose.

The question I have, as with my Comair 5191 analysis, is what we can learn from this, not which person to blame was last in the chain in what is certainly a "system-level" failure. The common point of failure here is most likely the power of a belief - in this case that the missles had dummy warheads- and the corresponding sense that surely the people before me did their jobs -- exactly as the copilot of Comair 5191, handed the controls while the plane as already picking up speed down the wrong runway, faced a serious problem in questioning the prior actions of a superior officer.

It is precisely the kind of accident that requires "mindfulness" of the type Karl Weick warned us about. (see my post with links to high-reliability organization literature.) It is why the Army has gone with a more open model of management than you would expect. (see FM22-100, the US Army Leadership Field Manual.) The Army is a "learning" organization, and they have learned, the hard way, than only a strong culture of safety is enough to overcome the forces that suppress eyes-open mindfulness required to see that something is wrong and question it. All of the command and control top-down discipline in the world cannot make eyes at the bottom work as well.


Missteps in the Bunkers
Sept 23, 2007
Washington Post
By Joby Warrick and Walter Pincus
Washington Post Staff Writers
Sunday, September 23, 2007; A01
excerpts: (emphasis added)

The airmen attached the gray missiles to the plane's wings, six on each side. After eyeballing the missiles on the right side, a flight officer signed a manifest that listed a dozen unarmed AGM-129 missiles. The officer did not notice that the six on the left contained nuclear warheads, each with the destructive power of up to 10 Hiroshima bombs.

That detail would escape notice for an astounding 36 hours, during which the missiles were flown across the country to a Louisiana air base that had no idea nuclear warheads were coming. It was the first known flight by a nuclear-armed bomber over U.S. airspace, without special high-level authorization, in nearly 40 years.

Three weeks after word of the incident leaked to the public, new details obtained by The Washington Post point to security failures at multiple levels in North Dakota and Louisiana, according to interviews with current and former U.S. officials briefed on the initial results of an Air Force investigation of the incident.

The warheads were attached to the plane in Minot without special guard for more than 15 hours, and they remained on the plane in Louisiana for nearly nine hours more before being discovered. In total, the warheads slipped from the Air Force's nuclear safety net for more than a day without anyone's knowledge.

A simple error in a missile storage room led to missteps at every turn, as ground crews failed to notice the warheads, and as security teams and flight crew members failed to provide adequate oversight and check the cargo thoroughly. An elaborate nuclear safeguard system, nurtured during the Cold War and infused with rigorous accounting and command procedures, was utterly debased, the investigation's early results show.

The Air Force's account of what happened that day and the next was provided by multiple sources who spoke on the condition of anonymity because the government's investigation is continuing and classified.

Air Force rules required members of the jet's flight crew to examine all of the missiles and warheads before the plane took off. But in this instance, just one person examined only the six unarmed missiles and inexplicably skipped the armed missiles on the left, according to officials familiar with the probe.

"If they're not expecting a live warhead it may be a very casual thing -- there's no need to set up the security system and play the whole nuclear game," said Vest, the former Minot airman. "As for the air crew, they're bus drivers at this point, as far as they know."

What the Hell Happened Here?'

The news, when it did leak, provoked a reaction within the defense and national security communities that bordered on disbelief: How could so many safeguards, drilled into generations of nuclear weapons officers and crews, break down at once?

Military officers, nuclear weapons analysts and lawmakers have expressed concern that it was not just a fluke, but a symptom of deeper problems in the handling of nuclear weapons now that Cold War anxieties have abated.

When what were multiple layers of tight nuclear weapon control internal procedures break down, some bad guy may eventually come along and take advantage of them," said a former senior administration official who had responsibility for nuclear security.

A similar refrain has been voiced hundreds of times in blogs and chat rooms popular with former and current military members. On a Web site run by the Military Times, a former B-52 crew chief who did not give his name wrote: "What the hell happened here?"

A former Air Force senior master sergeant wrote separately that "mistakes were made at the lowest level of supervision and this snowballed into the one of the biggest mistakes in USAF history. I am still scratching my head wondering how this could [have] happened."

Actually, the right question isn't how this particular event happened. As James Reason noted, things will break. The more complex a system is, the more ways it can break.

The right question is, why would such events happen more than once? Why isn't there a learning curve at least as strong as the forgetting curve? How can people at a low level raise their hands hundreds of times on web-logs and not be seen or heard once at higher levels? Was this actual incident preceded by at least one "near-miss" that could have been attended to and learned from, that wasn't?

Lloyd Dumas' book "Lethal Arrogance - Human Fallibility and Dangerous Technologies" and Charles Perrow's classic "Normal Accidents - Living with High-Risk Technologies" have the same message: the social side of managing risk is where the problems must be stopped. As Perrow says, "systems complexity makes failures inevitable." As the cover of that book notes,
[Perrows] research undermines the promises that "better management" and "more operator training" can eliminate catastrophic accidents.

Dumas has a more extreme analysis, and concludes that as everything becomes more interdependent in unexpected ways, it will never be possible to stop all combination of events that could lead to certain accidents, and recommends an astounding thing for such cases - that society should voluntarily and pro-actively get rid of the technologies that cannot be fully controlled.

Reading Laurie Garrett's description of solitary 16 year old soldiers protecting the Soviet Union's abandoned bio-warfare facilities, you can see how unexpected events, such as collapse of a government, can remove all the safeguards we had been counting on in one step.

So, maybe, "We trust we won't have to use these...." isn't the world's safest strategy. That's a whole different debate.

For the technologies already in play, we need to bring our social systems up to the level of our technical systems, or we face a long list of risks that are certain in all but the date they will occur. Point-wise "control" systems depending on message-passing through formal channels need to be supplemented with diffuse cultural control cultures in order to be reliable and in order to be able to suddenly stop and say -- "Wait! Something doesn't fit right here!" and be heard.

This larger problem of people lower in the chain of command seeing things and not being heard, or not being willing to voice what they see, is the real problem behind many of these accidents. And after they occur, it's all to common to hear "I kept telling them that was going to happen one of these days, but no one would listen!"

The response from the person's superior officer or manager will predictably be "You never told me that!" or, more precisely "I never heard you say that!". What's subtle here is that the superior officer is correct and not falsifying information - they, literally, never did HEAR what their subordinate said, or was trying to say. Analysis of cockpit voice records of numerous accidents show clearly that the pilots, literally, did not hear what the copilot's were trying to tell them. Dr. Peter Pronovost's work in ICU's in hospitals in Michigan showed that it is actually surprisingly hard for a lower-status being (a nurse) to be heard, literally, by a higher-status being (a surgeon) and it takes a special step to make that work at all.

My whole post on "What I learned at Johns Hopkins last week" is about the co-dependenct suppression of dissent in classrooms and is a similar phenomenon.

People simply don't realize how powerful social suppression is between someone with authority and someone who is dependent on the first person for their survival or career. Once they realize they are hitting resistance, the subordinates back off and the superiors, literally, un-hear what was said and it vanishes from their long-term memory. Magicians use this technique all the time.
Even top Generals seem to collapse to Jello when trying to contradict the President, by some accounts. This is way harder than it "seems", which means our intuition is terrible and our mental model needs a lot of work and retuning. (That, of course, is what Systems Dynamics "flight simulators" are about.)

That's where we need to develop expertise, spend research dollars, and focus research attention. One more set of rules, or Standard Operating Procedures, or one more layer of the same kind of thing we have now will just fail the same way these ones failed here.

We were lucky this time.

As Michael Osterholm points out in hs book "Living Terrors - What America Needs to Know to Survie the Coming Bioterrorist Castastrophe",
You have to be lucky all the time - we have to be lucky just once.
Irish Republican Army
This needs more attention.

No comments: