Showing posts with label errors. Show all posts
Showing posts with label errors. Show all posts

Sunday, September 23, 2007

Honey, I lost the nuclear weapons - Bent Spear


According to the Washington Post, the US military lost track of 6 nuclear weapons for 36 hours while transporting them under very light security from North Dakota to Louisiana. My interest in this is again in the general problem of what it takes to produce a reliable system (of any type) that protects us against dangerous errors, and how such systems break down.

As with the classic "swiss cheese" model of protection, in this case there were multiple layers of simultaneous failures. And, as usual, the common thread was a strong belief in a mistaken mental model, that, once launched, managed to obscure contrary details against systems carefully designed for just such a purpose.

The question I have, as with my Comair 5191 analysis, is what we can learn from this, not which person to blame was last in the chain in what is certainly a "system-level" failure. The common point of failure here is most likely the power of a belief - in this case that the missles had dummy warheads- and the corresponding sense that surely the people before me did their jobs -- exactly as the copilot of Comair 5191, handed the controls while the plane as already picking up speed down the wrong runway, faced a serious problem in questioning the prior actions of a superior officer.

It is precisely the kind of accident that requires "mindfulness" of the type Karl Weick warned us about. (see my post with links to high-reliability organization literature.) It is why the Army has gone with a more open model of management than you would expect. (see FM22-100, the US Army Leadership Field Manual.) The Army is a "learning" organization, and they have learned, the hard way, than only a strong culture of safety is enough to overcome the forces that suppress eyes-open mindfulness required to see that something is wrong and question it. All of the command and control top-down discipline in the world cannot make eyes at the bottom work as well.


Missteps in the Bunkers
Sept 23, 2007
Washington Post
By Joby Warrick and Walter Pincus
Washington Post Staff Writers
Sunday, September 23, 2007; A01
excerpts: (emphasis added)

The airmen attached the gray missiles to the plane's wings, six on each side. After eyeballing the missiles on the right side, a flight officer signed a manifest that listed a dozen unarmed AGM-129 missiles. The officer did not notice that the six on the left contained nuclear warheads, each with the destructive power of up to 10 Hiroshima bombs.

That detail would escape notice for an astounding 36 hours, during which the missiles were flown across the country to a Louisiana air base that had no idea nuclear warheads were coming. It was the first known flight by a nuclear-armed bomber over U.S. airspace, without special high-level authorization, in nearly 40 years.

Three weeks after word of the incident leaked to the public, new details obtained by The Washington Post point to security failures at multiple levels in North Dakota and Louisiana, according to interviews with current and former U.S. officials briefed on the initial results of an Air Force investigation of the incident.

The warheads were attached to the plane in Minot without special guard for more than 15 hours, and they remained on the plane in Louisiana for nearly nine hours more before being discovered. In total, the warheads slipped from the Air Force's nuclear safety net for more than a day without anyone's knowledge.

A simple error in a missile storage room led to missteps at every turn, as ground crews failed to notice the warheads, and as security teams and flight crew members failed to provide adequate oversight and check the cargo thoroughly. An elaborate nuclear safeguard system, nurtured during the Cold War and infused with rigorous accounting and command procedures, was utterly debased, the investigation's early results show.

The Air Force's account of what happened that day and the next was provided by multiple sources who spoke on the condition of anonymity because the government's investigation is continuing and classified.

Air Force rules required members of the jet's flight crew to examine all of the missiles and warheads before the plane took off. But in this instance, just one person examined only the six unarmed missiles and inexplicably skipped the armed missiles on the left, according to officials familiar with the probe.

"If they're not expecting a live warhead it may be a very casual thing -- there's no need to set up the security system and play the whole nuclear game," said Vest, the former Minot airman. "As for the air crew, they're bus drivers at this point, as far as they know."

What the Hell Happened Here?'

The news, when it did leak, provoked a reaction within the defense and national security communities that bordered on disbelief: How could so many safeguards, drilled into generations of nuclear weapons officers and crews, break down at once?

Military officers, nuclear weapons analysts and lawmakers have expressed concern that it was not just a fluke, but a symptom of deeper problems in the handling of nuclear weapons now that Cold War anxieties have abated.

When what were multiple layers of tight nuclear weapon control internal procedures break down, some bad guy may eventually come along and take advantage of them," said a former senior administration official who had responsibility for nuclear security.

A similar refrain has been voiced hundreds of times in blogs and chat rooms popular with former and current military members. On a Web site run by the Military Times, a former B-52 crew chief who did not give his name wrote: "What the hell happened here?"

A former Air Force senior master sergeant wrote separately that "mistakes were made at the lowest level of supervision and this snowballed into the one of the biggest mistakes in USAF history. I am still scratching my head wondering how this could [have] happened."

Actually, the right question isn't how this particular event happened. As James Reason noted, things will break. The more complex a system is, the more ways it can break.

The right question is, why would such events happen more than once? Why isn't there a learning curve at least as strong as the forgetting curve? How can people at a low level raise their hands hundreds of times on web-logs and not be seen or heard once at higher levels? Was this actual incident preceded by at least one "near-miss" that could have been attended to and learned from, that wasn't?

Lloyd Dumas' book "Lethal Arrogance - Human Fallibility and Dangerous Technologies" and Charles Perrow's classic "Normal Accidents - Living with High-Risk Technologies" have the same message: the social side of managing risk is where the problems must be stopped. As Perrow says, "systems complexity makes failures inevitable." As the cover of that book notes,
[Perrows] research undermines the promises that "better management" and "more operator training" can eliminate catastrophic accidents.

Dumas has a more extreme analysis, and concludes that as everything becomes more interdependent in unexpected ways, it will never be possible to stop all combination of events that could lead to certain accidents, and recommends an astounding thing for such cases - that society should voluntarily and pro-actively get rid of the technologies that cannot be fully controlled.

Reading Laurie Garrett's description of solitary 16 year old soldiers protecting the Soviet Union's abandoned bio-warfare facilities, you can see how unexpected events, such as collapse of a government, can remove all the safeguards we had been counting on in one step.

So, maybe, "We trust we won't have to use these...." isn't the world's safest strategy. That's a whole different debate.

For the technologies already in play, we need to bring our social systems up to the level of our technical systems, or we face a long list of risks that are certain in all but the date they will occur. Point-wise "control" systems depending on message-passing through formal channels need to be supplemented with diffuse cultural control cultures in order to be reliable and in order to be able to suddenly stop and say -- "Wait! Something doesn't fit right here!" and be heard.

This larger problem of people lower in the chain of command seeing things and not being heard, or not being willing to voice what they see, is the real problem behind many of these accidents. And after they occur, it's all to common to hear "I kept telling them that was going to happen one of these days, but no one would listen!"

The response from the person's superior officer or manager will predictably be "You never told me that!" or, more precisely "I never heard you say that!". What's subtle here is that the superior officer is correct and not falsifying information - they, literally, never did HEAR what their subordinate said, or was trying to say. Analysis of cockpit voice records of numerous accidents show clearly that the pilots, literally, did not hear what the copilot's were trying to tell them. Dr. Peter Pronovost's work in ICU's in hospitals in Michigan showed that it is actually surprisingly hard for a lower-status being (a nurse) to be heard, literally, by a higher-status being (a surgeon) and it takes a special step to make that work at all.

My whole post on "What I learned at Johns Hopkins last week" is about the co-dependenct suppression of dissent in classrooms and is a similar phenomenon.

People simply don't realize how powerful social suppression is between someone with authority and someone who is dependent on the first person for their survival or career. Once they realize they are hitting resistance, the subordinates back off and the superiors, literally, un-hear what was said and it vanishes from their long-term memory. Magicians use this technique all the time.
Even top Generals seem to collapse to Jello when trying to contradict the President, by some accounts. This is way harder than it "seems", which means our intuition is terrible and our mental model needs a lot of work and retuning. (That, of course, is what Systems Dynamics "flight simulators" are about.)

That's where we need to develop expertise, spend research dollars, and focus research attention. One more set of rules, or Standard Operating Procedures, or one more layer of the same kind of thing we have now will just fail the same way these ones failed here.

We were lucky this time.

As Michael Osterholm points out in hs book "Living Terrors - What America Needs to Know to Survie the Coming Bioterrorist Castastrophe",
You have to be lucky all the time - we have to be lucky just once.
Irish Republican Army
This needs more attention.

Friday, June 15, 2007

More on foreclosures - from the Baltimore Sun


The Baltimore Sun had a top front page story this morning
AT-RISK LOANS RISING IN STATE (in caps in the original)
by Jamie Smith Hopkins - 6/14/07

Here's some highlights

Geographic and socioeconomic distribution:
Rates in the suburbs are rising faster than in Baltimore, with defaults of pricey suburban homes, condo-conversion projects and even an undeveloped section of a new-home community in Harford County - which went back to the lenders at an auction this week.... loan problems are particularly focused in a handful of states, ones with persistent job losses - Ohio, Indiana and Michigan - and ones that had a lot of real estate speculation, including California and Florida.

The number of homes in the foreclosure process in Ohio, Indiana and Michigan is so high that together the three states account for about 20 percent of all U.S. foreclosures. Ohio tops the country, its share of homes in the foreclosure process more than six times Maryland's. [ note - this is big-3 auto industry downsizing effect]

Trend:
I think there is a clear indication that the number of foreclosures is only going to increase," said Phillip R. Robinson, executive director of Civil Justice Inc., a Baltimore legal-help group. "The concern that I have as a public-interest advocate is, what do we do to help people in that pipeline save their home ... and how do we prevent people from getting into inappropriate loans?" [ emphasis added]


Root cause and solutions:

Now, I find it interesting that the "solutions" to this problem all seem to involve some kind of governmental legal or policy action.

One "solution" in Maryland and elsewhere is to use taxpayer money to bail out those in trouble and, well, reward those who got this whole thing rolling in the first place which, one might think, would only reinforce that behavior in the future. ("unintended consequence"?)
The state said this week it has commitments for $100 million to refinance Maryland homeowners from such ARMs into fixed-rate mortgages so borrowers aren't overwhelmed. "We're going to stand up ... to protect that building block of wealth for the middle class that is homeownership," Gov. Martin O'Malley said as he announced the initiative.
Another "solution" involves "going after" those people who oversold these loans in the first place, people who say in self defense "why blame us?" Congressmen are threatening to change which agency "regulates" such loans if much stronger rules are not put in place to "bring this under control." [ Hmm... sounds like a regulatory feedback process to me...]

What is, to me, conspicuously absent from those solutions is raising the effective economic IQ of the people who fell for this very bad idea in the first place.

Trying to "regulate" this once it's at full throttle is like trying to control the flow of smoke with huge billboards, instead of putting out the fire.



We keep trying to come up with "foolproof" ways to, well, make life safe for fools. It's very expensive, and it doesn't work very well, if at all. It also results in being a real pain for those who were responsible in the first place, who end up bearing the costs for those who were irresponsible.

But the irresponsible claim immunity on account of stupidity. "The car was going so fast that there was nothing I could do to stop on the ice! It's not my fault!"

Right. And media feed this concept. I bounce off the walls every time I see headlines like "ice causes pileup on freeway" or "fog causes 27 car crash - 5 dead."

In aviation, there is no such thing as a crash caused by "bad weather." There is "a decision to continue operations into weather beyond the skill and experience of the pilot." This gets back to people. To us.

In my mind, that's what needs to be fixed. We need to overcome collective stupidity and greed, by changing the story, and working together, and trying to have a group IQ that is at least as large as the largest individual IQ among us, if not larger.

That used to be one of the points of civilization, literature, history, science, and government.
We've abandoned that in favor of downstream damage-control efforts, the same way our health care system focuses money on heroic repair instead of prevention.

Curious. Why do we do that?

------- post script

I wanted to clarify one point. I'm not saying that people as individuals are stupid, but that the way we're interacting and interconnected (or not) is what's stupid and what's broken.

This is why understanding that basic concept about "systems" is so critical, or we can't even begin to see "where" this is broken. The problem isn't with what's between our ears, because humans are pretty smart animals. The problem is at a different "level".

There are "levels" and each one has properties that are mostly independent in the short run from other levels. I'm talking about a "systems" problem in that the way otherwise-smart people INTERACT and INTERCONNECT is what's broken here.

And therefore, that's what needs to be "fixed."

This is not a problem that just affects "dumb" people, or that has anything to do with native intelligence. This happens to doctors, scientists, airline pilots, CEO's. There's even a book called "Why do smart people do dumb things?"or some similar title. Individuals can make just a little indent to try to keep order around themselves, and this can be totally undone by a larger tilt to the playing field on a neighborhood or cultural level.

In the short run, "gangs" or groups of teenagers go off and do things that are incredibly stupid that are almost incomprehensible and that no single one of them would have done if left to himself. There's a "group effect." It can be for the good, or for the bad - it's neutral.




(Picture from my post on The Toyota Way viewed as feedback control).

For high-reliabilty, critical operations, like an Intensive Care Unit, or an aircraft cockpit,
or a nuclear reactor control room, literature shows that you can't get the results you need unless both levels are engaged and working well - the individual level and the group / team level. Individuals aren't strong enough to manage alone, regardless how bright they are.

Bryan Sextan presented data that 74% of commercial airline accidents happened on the very first day a new team of people was formed out of people that used to be on other teams. The pilots are still 20,000 hour professionals, but the "TEAM" has not yet gelled, and that leave a gap between levels that the accident can leak through.

(See my clever cartoons on accidents leaking through "swiss cheese" that's not well interconnected here in "The road to Error" ).

Loose people are like "dust in the wind" and can be blown anywhere. Interconnected people are like mountains and can defy the wind.

In that metaphor, I'm less interested in "reimbursing the dust" than I am in understanding why the mountain has turned to dust, and whether that is reversible, and if so, how and when can we start?

Wednesday, May 30, 2007

The road to error - illustrated

There are many different kinds of errors that organizational systems of humans can make, but one of the trickiest is directly related to the questions of "integrity", "transparency", and "prejudice." I want to relate these to the classic "swiss cheese" multi-layered defense system that James Reason made famous:


[ source of that slide: ..? ]

Instead of looking at the layers the way he does, let's just use one slice of cheese as a model, and examine what can happen when an organization, initially one person, has a base fully covered but then the organization starts to grow and add people.



The problem is that, as the organization spreads out one conceptual task over more and more people, gaps start to occur in the coverage. They occur particularly in the area where it's a little fuzzy which person or team's job it is to handle that task.





This seems to me to be an intrinsic failure mode for organizations. It turns out, that regardless how good a job anyone in a company can do, if they don't actually do it, their skill level doesn't matter. Furthermore, a very common way for people not to do a job is for them not to realize that it's their job to do. In some organizations this might be accompanied by a twinge of remorse, but then a resigned "It's not my job!" and forgetting about the task.

So, when a task that used to be something one person does get's divided up among many people, there is a risk that none of those people will decide the task is their to do, regardless how well intentioned or skilled they are. This effect can completely neutralize years of effort getting skilled at a task. Things, almost literally, "fall through the cracks."

And the cracks almost always appear, if the task and organization keep growing and growing and adding more and more people to distribute a single conceptual task among. Soon, the organization looks like the following, with entire "silos" of separate groups, and each silo broken into a pecking order of elites, middle class, and bottom rung workers of some kind. Now there are a lot of gaps, but still, the gaps are fairly small.



But, as the organization continues to grow and evolve more specialized skills in each local area, the people in each box start to spend more time talking to each other than they do talking to people outside their own little box. It's more convenient, and the language is more directly relevant. We all speak the same language. It begins to become "us" here in this box, versus "them" out there in other boxes.



Still, the teams may be cooperating, but that won't last. Sooner or later, messages are missed, or silence itself becomes interpreted as a hostile message. Something falls through the cracks, there is a storm of blame and recimination, and a deadly spiral sets in of becoming more and more convinced that all problems are due to the people in other boxes, who are surely idiots or else have evil intent. The boxes draw away from each other, in a mild form of disgust. The "us" becomes fractured into many different kinds of "us".



As the communication between teams becomes more hostile, "management" may decide to simlify the problems by having all communications go through them. The number of connections going into one box is now at most two, one from above, and one going to a box below in the pecking order. This allows the fabric of the cheese to twist around the thin connecting segments, as if around an axle. Within each section of cheese, this is unnoticed, because their world is still fine, locally.



Then, the layer of cheese may start to warp and become a curved surface, not a flat surface. Again, seen from within that section, everything is fine, because the observers in that "flat land" are measuring a curved surface with curved rulers, and it looks just fine. Even simple facts and reasoning from other sections, however, don't seem to make sense anymore, because they don't line up correctly. This is attributed to the other group losing touch with reality.




Finally, the fabric of the organization is so frayed and fragmented that whole pieces fall off, unnoticed from within. Now you can "drive a small truck" through the gaps and holes, but again this is not visible from inside each segment, because it spends zero time pondering the middle territory or white space. That space is "not our job" but is "someone else's job".

This condition of an organization is now somewhat stable. Life goes on, and a number of errors come and go, with everyone attributing the errors to everyone else, and shaking their heads at how those "others" aren't doing their jobs. Other groups are seen as actively hostile enemies, blaming us for things we didn't do. Relations deteriorate. Errors abound.

Now the amazing thing is that this can occur even though each team is doing an almost perfect job of managing what they see as their own turf.

The error occurs in a place we are so unfamiliar with we don't even have a name for it. I call it the M.C. Esher Waterfall Error, after this work of Escher. At first glance and even close inspection, the image seems a little strange, but harmless.


A closer inspection reveals that the water, however, is following an impossible path.

It flows down a waterfall, then flows down a zigzag of channels, and finds itself back at the top of the waterfall, so it falls down the waterfall, ...
etc. forever. It's a perpetual motion machine.

The vertical columns in the middle tier in front have something terribly wrong with them too.

And yet, if you look at any small part of this lithograph, nothing seems wrong.

This is a problem we are simply not used to encountering - the detail level is correct, but the larger global level is clearly absurd and wrong.

We have "emergent error", sort of the opposite of synergy.

The swiss cheese and waterfall pictures are meant to illustrate that organizations break down in a funny way, where all the pieces continue to work, but the overall integrity falls apart, in a very subtle and unnoticed way. In fact, it is generally hard to get anyone to pay attention to the fact that something serious is wrong, because anyone can see, from inside, that everything (that you see from inside) is correct. (We have run into Godel's Theorem as a problem.)

Conclusions:
1) Just because everything locally measures as fine does not mean things are fine.
2) Even if everyone can do a perfect job, that won't matter if they don't do it.
3) They won't do it if it's not perceived as "their job".
4) This mode of breakdown is very insidious, but I think it is also very common.

This kind of expansion and condensation and specialization needs to be balanced with a corresponding effort at reintegration, although it may seem a minor and non-urgent task.

Then, something huge comes through the gap, and everyone is astounded that such a thing could happen.

Another post will deal with ways to address it. This post is just to document that there is a type of problem that organizations can suffer, a malady or disorder or disease, that is very difficult to trace locally. It always seems to be coming from "over there", but if you go "over there" you see that it isn't coming from "over there" either. It locks itself down with blame, stereotyping, and sullen bitterness about having to put up with "those idiots" in the other departments who keep messing things up. It is hard to decipher because the simplest messages from other departments don't even make sense and you have to wonder if they've remembered to take their medications lately. The more errors go through the hole, the more people lock into blaming each other, and the more the subsections curl up to avoid touching the other sections and withdraw into their own comfortable world where people talk sense and behave rationally.

No one is doing anything wrong, and everyone is doing something wrong, but the wrongness is subtle. It has something to do with whether everyone is OK with not being clear whose job a task might be, and not being able to find out whose job it is. If people are "responsibility seeking", this may be less likely than if they are "responsibility avoiding" as an ethic. If people feel an error is "not my problem" or "someone else's problem" this can worsen.

If the world is divided into "us" and "them", there is always a middle ground that is very confusing and not clearly us and not clearly them. Errors flow to that ground, like pressurized gas trying to escape. If there are cracks between teams, errors seem eerily capable of finding them. The errors are remarkably resilient to efforts to track them down and fix them, and seem to keep happening, as if those idiots over there have no learning curve at all.

But, it is a very dangerous wrongness, if this problem occurs on a global scale, and teams don't just get annoyed at each other and fight figurative wars, but actually start dropping explosive devices on each other in order to stop the continual assault they feel they are under.

It may also be that the efforts to reduce animosity by controlling all communications between hostile teams by routing them through management is well intended, but is based on a model of communication that is single-channel, explicit, context-independent, and rooted deeply in processing linear strings of symbols - where one mistake can throw off everything. The communication that takes place before the body is fractured and fragmented howerver is more like image-processing: it is multi-channel, implicit, context dependent, and not based on symbol processing, so it is robust and fairly immune to point-noise. In fact, generally, changing a single pixel in an image has zero effect on the contained communication.

It may be that what is needed is a lot more socializing, and sloppy, many-to-many uncontrolled interactions, as a kind of glue to keep the pieces from falling apart. As Daniel Goleman notes in his book Social Intelligence, humans have a great many different ways to synchronize and synch up and coordinate with each other, most of which are non-verbal, very fast, and intrinsically sloppy and prone to pointwise error. Those errors are made up by having massive parallel communications, not by reducing communications to a single channel that is very tightly regulated. There is not enough bandwidth in a single channel to synchronize two disparate groups at all points. The groups can "twist" and "rotate" around that channel, and move out of synch. Best efforts mysteriously fail.

References
Human Error - Models and Management, James Reason. BMJ 2000;320:768-770 ( 18 March )