asymmetric payoffs: Architecture and the Principle of Least Authority

There is one particular model of system that might look great on the director's PowerPoint slide as a couple of boxes joined together with arrows, but in many cases becomes a death march weighed down by a huge codebase and management process akin to Soviet-era planning, just without the vodka. And that is the large scale integration project.

These integration style projects are the ones that I seem to have worked on more than most in my investment banking career and usually consist of a common middle-tier architecture platform which integrates existing business specific services to provide a supposed new and improved experience for the end users.

Lots of books and articles are written about large scale issues in development and the organisational and development anti-patterns that lead to such dysfunctional outcomes. However, I think many of these are really descriptions of the effects of a much more widespread narrative cognitive bias that affects many walks of life (most notably politics and economics). This is most commonly manifested in the fallacy of central planning.

Central Planning is a siren song for humans; the assumption that we can aggregate enough information and knowledge to overcome complexity and failure through upfront planning and control by a group of superhumanly knowledgeable and experienced individuals (usually ourselves) if only we make the plans detailed enough. From such good intentions springs totalitarianism and fragile systems.

This simple bias is in most cases shared by the developers and the management team; and often defines unconsciously the starting assumptions we take forwards on the project. The most common being that the framework needs to understand and exert control of the data that passes between the 2 parties you are ultimately writing the system for; the end users and the business system.

In essence you are setting yourself up as the controller and mediator of all such interactions in the system in the same way central government planners cannot stop themselves from trying to assert more and more control in all societal transactions.

While appearing sensible and logical, the effects of this initial mindset starts to spreads its tentacles into concepts in the framework such as (and by no means comprehensive):

large common data models - which have to account for every team's slightly different requirements and rapidly become a bottleneck both in development velocity (as it becomes larger change becomes more difficult) and in complexity as it becomes more fragile to unintended consequences of changes.
a common transport format that must be used by all teams (such as XML, JSON, or Protocol Buffers etc) so the backend and client data data must be transformed at many stages into a "better" format as it passes through the system. Leading to growing binding and transformation layers which produce inefficient process bottlenecks at the data boundaries though garbage generated and impedance mismatch of data.
granular access rules based on knowledge of the data structure that attempts to unify multiple downstream permissioning models. Spawning a monster that is usually some variant of runtime introspection and rule engine/static code rule overhead for every method invocation (even if hidden in a 3rd party security framework).
re-implementation of business rules in the system that exist downstream as you find you are pulling more and more of the locus of control of the data up into the common layers. These have to be kept in sync as those systems are seldom static and usually are a big tangle of rules as they have evolved organically over a period of time. This also results in the framework having an at best linear order-n problem of overlapping rules as more systems are integrated.
caching a growing set of our synthetic data locally, as it takes ages to get the data, transform, enrich with common structures and apply permissions etc. So we now have an expanding cache (in many cases that must be distributed) that has to somehow not get out of sync with the backend master system or refresh itself regularly. In many cases this leads to backend data being replicated locally into DBs from performance pressures, which in turn now suffer from data mastering issues and schema fragility on downstream change.

From a management perspective the growing complexity and scale brings along with it the usual bedfellows of central planning:

A larger and larger amount of effort must be put into tracking and planning all the interdependencies and infrastructure, and new "processes" and document requirements start springing up that attempt to contain the growing complexity but in reality are a huge and mostly pointless exercise in hindsight bias.
Planning starts to take on heroic proportions as repeated 3-month, 6-month plans are developed in detail for the "platform features" by a team of overworked PMs that are thrown away every 3 weeks as real forces and random (to the development team) business direction changes make any detailed planning irrelevant beyond the near term.

Stepping back from the practical effects of this mindset for a moment the question is what can we do about it?

While we may like to think that system design is so new and different, we are really talking about individuals and their biases and how structures arise in organisations. This is nothing new and has been a subject for philosophical writers from John Locke onwards.

However, we shall limit ourselves to one particular aspect of this from which we can derive some practical value, namely how we can inhibit the unconscious tendency towards central over-planning and the fragility it entails, but first a bit of background on the idea. (I shall examine the logical fallacy of forecasting the future disguised as project planning in a later article).

Locke in his second treatise on government addressed the idea of the "principal of least authority". Essentially that we should give up only as much of our individual rights as to make it possible for a society to preserve those rights. "The right to do whatever one thought fit to preserve oneself is given up to be regulated by society so far forth as the preservation of himself and others shall require. When any rights are given up, it is only with an intention in every one to better preserve himself, his liberty, and his property."

In computing, principal of least authority for security is often used as the model to isolate processes to only the data and resources they need to perform a particular task. Apologies to the security guys in overloading this meaning, but this is not the sense which i am using it in.

Instead, I am using it as a simple principle we can use to try and inhibit our natural planning and centralisation biases when designing common framework systems, and try and limit how much unintended direction and control we exert over other actors in the system while providing them with enough commonly useful runtime artefacts (and in the process greatly simplifying the scope and complexity of our own code).

In effect, try and imagine that you are the in the other teams using your framework and at each stage ask yourself what is minimal and reasonable for them to give up in order to ensure they can operate together in a minimally restrictive society that is the overall project, but still take advantage of commonality.

For most middle-tier frameworks the minimal set of functionality is termination of client connections, routing data efficiently to and from the client and performing some minimal authentication. In many ways this sounds a lot like a bare-bones messaging server in concept!

Some computing idioms implicitly practice this approach and it is especially prevalent in areas like communication protocols and messaging systems where it is understood that the purpose of the system is to facilitate other processes, while at the same time try to impose only enough restriction to enable systems to interact freely upon it. See how similar this is to Locke's ideas; by giving up just enough of our computing freedom to ensure that we can communicate easily with another entity while at the same time enabling almost any data or higher level behaviour to be modelled between the parties.

Many projects do not even consider at this level of abstraction just what they are trying to achieve and whether the common framework should philosophically demand total obedience of all participants at a micro level and centralise its behaviour (usually the default starting assumption), or provide a minimal platform on which individual participants can decide to build as simple or as complex behaviours as is appropriate. Often this only reflects the individual developer and management members' worldview and is not an introspection people readily undertake.

Even many projects that start out with a messaging-like paradigm often find themselves re-implementing downstream functionality as a result of centralisation forces pulling them in that direction like an undertow.

All that said, what about the practical outcome of this? After all, it is all about building a running system irrespective of the philosophical approach that was used to get there. I shall take a real example of a system (although for simplicity and business reasons I cannot go into too much detail), on which I have had the fortunate opportunity to design a significant portion of the framework system and put into practice these ideas and how this approach worked out in the real world.

Project A is an integration project that ultimately will integrate functionality from 20-30 teams (at peak probably > 300 developers & PMs) in the organisation and accordingly result in 20-30 separate development streams. Teams from each area will be tasked with providing specific functionality built on a common framework and the framework developers themselves will be tasked with providing a common middle-tier infrastructure for each team to leverage. From a standing start the project will have at least a handful of business teams working in parallel, and the others will be brought on quickly.

Keeping in mind how centralisation creeps upon us, an early decision was made to provide as little as possible in restriction to the other teams using our simple principle as a guide for our framework design. This resulted in the following design practices:

All data passed to and from the client to the backend systems would be assumed to be as opaque to the framework as possible.
As little as possible (and ideally no) transformation should be needed in the framework paths. A major aim was data only needed to be understood at a point when a process needed to act upon it.
No transformation of format should be mandated as data passed between layers.
An asynchronous and synchronous path would be provided between the clients and the downstream systems, but this would be modelled with no business interfaces at all and offer no further abstraction than a single un-typed Object parameter.
Simple client termination handling would be offered, with some minimal abstraction from the connection lifecycle boilerplate.
The client modules that deal with a particular team are the only part of the system that needs to understand a particular data item in any detail. Effectively making a contract between the client module and the downstream system, but no other process cares about this data. As they tended to be written by the same team this means that the dev teams had a contract with themselves that was outside the framework control.
Pluggable communication protocols would be used as appropriate for each downstream system and would be flexible enough to deal with messaging, bespoke point-to-point, multicast, REST and HTTP and whatever else came up.
Some standards would be encouraged but not mandated for data types and communication, mainly to give teams a simple starting point from which to evolve their thinking.
Runtime instances would be specific for each team, but minimally managed by the framework so no team could introduce side-effects or problems for other teams.
Authentication and authorisation at any granular level was managed by existing downstream auth systems and entities.
A very strong push back against duplicating downstream data.

While this sounds like a recipe for a free-for all and goes against the grain of many of the enterprise design books you have read, in practice things turned out a bit differently. Lets examine the actual outcome in the light of our hypothesis and statements:

Positives:

The framework has not (yet) suffered from massive complexity and bloat. Most features that are accepted centrally are driven bottom up by more than one team needing the same abstract function. Even so we are parsimonious in accepting these as some have turned out to be transient.
Re-use is not as prevalent as one would think given the copious literature.
All teams were able to get up and running pretty quickly and develop in parallel their own specific functions without too much assistance.
Pass through is a lot more useful than it first appears. Many teams can do most of the functionality they require by having a mostly pass though framework.
Security introspection of data can be performed surprisingly well under this model with little direct overhead (especially if you do not transform or deserialise the data as a side effect).
We did not need to scale the infrastructure team to tens of people as one commonly finds on such projects as the code base stayed relatively small.
Individual teams do not need to care about a common model and can create abstractions that suit themselves.
It is far easier to perform optimisations when the framework is very bounded in its behaviour.
Business interfaces in the middle tier surprisingly did not arise spontaneously in each team's code base and in fact overall hardly at all. Most of the business specific functions have stayed down in the backend systems where they belong and have not been dragged up.
Most teams got along easily when they realised that they do not need to follow the usual abstraction/transformation/aggregation activities that normally characterise the middle tier.
middle tier authentication is very coarse (e.g. can user X connect to a particular channel of data).
We have so far not had a compelling need for the over-arching big giant cache pattern or much local db replication apart from a couple of business specific cases.

Negatives:

Teams are sometimes unwilling to accept more ownership and responsibility.
The top-down process burden has not reduced as much as one would expect. It seems planning is too much of a security blanket for some people to let go of just yet despite all the evidence to the contrary.
Deployment and testing have more complexity as each team's application has a certain number of unique behaviours.
Network design is more complicated.
No matter how simple the framework there are always unexpected uses it gets put to.
Not many restrictions are sometimes mistaken for no restrictions.
In true whack-a-mole style, some irreducible complexity ends up being inside the framework to keep the interfaces simple.

I am not saying that this is a silver bullet to all your design needs, nor am i saying that a dogmatic idea that no planning is always right. Rather, we can plan to encourage self-direction in development teams by being extremely considered in the freedoms we take away from other participants. Encouraging bottom up feature origination in practice can result in quite a few positive benefits. As in life, the areas that become successful are often the ones that were not even on the radar at the planning stage.

So what can we take from this experiment? One is that subtle starting assumptions can have very large effects, a second is that coherent systems can be built upwards without detailed planning (as in the real world), and a third is that many things we are conditioned to think as being required may not in practice be all that important.

Such a small alteration in your thinking can produce very large outcome differences if you can challenge your own biases and act to limit your inner central planner. And as a bonus you might find that you are freed just that little bit from the gulag experience that characterises many large scale projects.

If you find yourself reading this and thinking that this is a quick and dirty approach and once we have it working we could just define a new all seeing plan to refactor and centralise the functions to make it cleaner and more sensible I invite you to start again at the top of the article.

asymmetric payoffs

Sunday, October 30, 2011

Architecture and the Principle of Least Authority

No comments:

Post a Comment