Wednesday, May 30, 2007

The road to error - illustrated

There are many different kinds of errors that organizational systems of humans can make, but one of the trickiest is directly related to the questions of "integrity", "transparency", and "prejudice." I want to relate these to the classic "swiss cheese" multi-layered defense system that James Reason made famous:


[ source of that slide: ..? ]

Instead of looking at the layers the way he does, let's just use one slice of cheese as a model, and examine what can happen when an organization, initially one person, has a base fully covered but then the organization starts to grow and add people.



The problem is that, as the organization spreads out one conceptual task over more and more people, gaps start to occur in the coverage. They occur particularly in the area where it's a little fuzzy which person or team's job it is to handle that task.





This seems to me to be an intrinsic failure mode for organizations. It turns out, that regardless how good a job anyone in a company can do, if they don't actually do it, their skill level doesn't matter. Furthermore, a very common way for people not to do a job is for them not to realize that it's their job to do. In some organizations this might be accompanied by a twinge of remorse, but then a resigned "It's not my job!" and forgetting about the task.

So, when a task that used to be something one person does get's divided up among many people, there is a risk that none of those people will decide the task is their to do, regardless how well intentioned or skilled they are. This effect can completely neutralize years of effort getting skilled at a task. Things, almost literally, "fall through the cracks."

And the cracks almost always appear, if the task and organization keep growing and growing and adding more and more people to distribute a single conceptual task among. Soon, the organization looks like the following, with entire "silos" of separate groups, and each silo broken into a pecking order of elites, middle class, and bottom rung workers of some kind. Now there are a lot of gaps, but still, the gaps are fairly small.



But, as the organization continues to grow and evolve more specialized skills in each local area, the people in each box start to spend more time talking to each other than they do talking to people outside their own little box. It's more convenient, and the language is more directly relevant. We all speak the same language. It begins to become "us" here in this box, versus "them" out there in other boxes.



Still, the teams may be cooperating, but that won't last. Sooner or later, messages are missed, or silence itself becomes interpreted as a hostile message. Something falls through the cracks, there is a storm of blame and recimination, and a deadly spiral sets in of becoming more and more convinced that all problems are due to the people in other boxes, who are surely idiots or else have evil intent. The boxes draw away from each other, in a mild form of disgust. The "us" becomes fractured into many different kinds of "us".



As the communication between teams becomes more hostile, "management" may decide to simlify the problems by having all communications go through them. The number of connections going into one box is now at most two, one from above, and one going to a box below in the pecking order. This allows the fabric of the cheese to twist around the thin connecting segments, as if around an axle. Within each section of cheese, this is unnoticed, because their world is still fine, locally.



Then, the layer of cheese may start to warp and become a curved surface, not a flat surface. Again, seen from within that section, everything is fine, because the observers in that "flat land" are measuring a curved surface with curved rulers, and it looks just fine. Even simple facts and reasoning from other sections, however, don't seem to make sense anymore, because they don't line up correctly. This is attributed to the other group losing touch with reality.




Finally, the fabric of the organization is so frayed and fragmented that whole pieces fall off, unnoticed from within. Now you can "drive a small truck" through the gaps and holes, but again this is not visible from inside each segment, because it spends zero time pondering the middle territory or white space. That space is "not our job" but is "someone else's job".

This condition of an organization is now somewhat stable. Life goes on, and a number of errors come and go, with everyone attributing the errors to everyone else, and shaking their heads at how those "others" aren't doing their jobs. Other groups are seen as actively hostile enemies, blaming us for things we didn't do. Relations deteriorate. Errors abound.

Now the amazing thing is that this can occur even though each team is doing an almost perfect job of managing what they see as their own turf.

The error occurs in a place we are so unfamiliar with we don't even have a name for it. I call it the M.C. Esher Waterfall Error, after this work of Escher. At first glance and even close inspection, the image seems a little strange, but harmless.


A closer inspection reveals that the water, however, is following an impossible path.

It flows down a waterfall, then flows down a zigzag of channels, and finds itself back at the top of the waterfall, so it falls down the waterfall, ...
etc. forever. It's a perpetual motion machine.

The vertical columns in the middle tier in front have something terribly wrong with them too.

And yet, if you look at any small part of this lithograph, nothing seems wrong.

This is a problem we are simply not used to encountering - the detail level is correct, but the larger global level is clearly absurd and wrong.

We have "emergent error", sort of the opposite of synergy.

The swiss cheese and waterfall pictures are meant to illustrate that organizations break down in a funny way, where all the pieces continue to work, but the overall integrity falls apart, in a very subtle and unnoticed way. In fact, it is generally hard to get anyone to pay attention to the fact that something serious is wrong, because anyone can see, from inside, that everything (that you see from inside) is correct. (We have run into Godel's Theorem as a problem.)

Conclusions:
1) Just because everything locally measures as fine does not mean things are fine.
2) Even if everyone can do a perfect job, that won't matter if they don't do it.
3) They won't do it if it's not perceived as "their job".
4) This mode of breakdown is very insidious, but I think it is also very common.

This kind of expansion and condensation and specialization needs to be balanced with a corresponding effort at reintegration, although it may seem a minor and non-urgent task.

Then, something huge comes through the gap, and everyone is astounded that such a thing could happen.

Another post will deal with ways to address it. This post is just to document that there is a type of problem that organizations can suffer, a malady or disorder or disease, that is very difficult to trace locally. It always seems to be coming from "over there", but if you go "over there" you see that it isn't coming from "over there" either. It locks itself down with blame, stereotyping, and sullen bitterness about having to put up with "those idiots" in the other departments who keep messing things up. It is hard to decipher because the simplest messages from other departments don't even make sense and you have to wonder if they've remembered to take their medications lately. The more errors go through the hole, the more people lock into blaming each other, and the more the subsections curl up to avoid touching the other sections and withdraw into their own comfortable world where people talk sense and behave rationally.

No one is doing anything wrong, and everyone is doing something wrong, but the wrongness is subtle. It has something to do with whether everyone is OK with not being clear whose job a task might be, and not being able to find out whose job it is. If people are "responsibility seeking", this may be less likely than if they are "responsibility avoiding" as an ethic. If people feel an error is "not my problem" or "someone else's problem" this can worsen.

If the world is divided into "us" and "them", there is always a middle ground that is very confusing and not clearly us and not clearly them. Errors flow to that ground, like pressurized gas trying to escape. If there are cracks between teams, errors seem eerily capable of finding them. The errors are remarkably resilient to efforts to track them down and fix them, and seem to keep happening, as if those idiots over there have no learning curve at all.

But, it is a very dangerous wrongness, if this problem occurs on a global scale, and teams don't just get annoyed at each other and fight figurative wars, but actually start dropping explosive devices on each other in order to stop the continual assault they feel they are under.

It may also be that the efforts to reduce animosity by controlling all communications between hostile teams by routing them through management is well intended, but is based on a model of communication that is single-channel, explicit, context-independent, and rooted deeply in processing linear strings of symbols - where one mistake can throw off everything. The communication that takes place before the body is fractured and fragmented howerver is more like image-processing: it is multi-channel, implicit, context dependent, and not based on symbol processing, so it is robust and fairly immune to point-noise. In fact, generally, changing a single pixel in an image has zero effect on the contained communication.

It may be that what is needed is a lot more socializing, and sloppy, many-to-many uncontrolled interactions, as a kind of glue to keep the pieces from falling apart. As Daniel Goleman notes in his book Social Intelligence, humans have a great many different ways to synchronize and synch up and coordinate with each other, most of which are non-verbal, very fast, and intrinsically sloppy and prone to pointwise error. Those errors are made up by having massive parallel communications, not by reducing communications to a single channel that is very tightly regulated. There is not enough bandwidth in a single channel to synchronize two disparate groups at all points. The groups can "twist" and "rotate" around that channel, and move out of synch. Best efforts mysteriously fail.

References
Human Error - Models and Management, James Reason. BMJ 2000;320:768-770 ( 18 March )

2 comments:

Wade said...

Lest we forget - all USA coins and currency have the phrase "In God we trust" and "e pluribus unum" (from many, one) on them. (even if the new US dollar coin hides them on the edge of the coin, instead of on either side.)

The question of how to make many people into one people was clearly a very important subject to the founders of this country. They and their legacy heirs tried to put reminders where every person could find them.

Academia doesn't put nearly enough energy into studying this question in my view.

They probably would, except that so many local industries are falling apart, and other people are making so many errors that there's no time left to focus on this problem.

It's a classic "we have no time to drain the swamp - we're too busy fighting alligators!" dilemma. There's no money for this, but there's a trillion dollars (US) available for "fighting a war" against bad people "over there" who are trying to harm us.

Maybe, some of the Homeland Security funds should be used to study how and why our corporate endeavors seem to be crumbling from within, and how to protect them from that failure mode.

Anonymous said...

This looks great!