Infrastructure

Faster reconciliation doesn't fix drift

Configuration management asks what each value should be. Governance asks whether anyone decided. Drift accumulates in the gap between the two.

A classification of configuration decisions. A governed decision is recorded in version control, or deliberately deferred so upstream chooses. With no recorded decision, the configuration is drift in waiting.

Configuration drifts silently.

Configuration management asks what each value should be. A harder question comes first: whether anyone decided what the value should be. That question is governance, and drift accumulates in the gap between the two.

Drift is what configuration management produces when governance is missing. Every deployment check can pass while half the parameters carry values nobody confirmed. The configuration is consistent with itself and inconsistent with intent, because intent was never recorded.

Governance is the layer that establishes whether intent was recorded. Classification is how it does that: sorting every parameter by whether a decision exists.

Every parameter has a decision state

Three states cover the working set.

Tracked. Someone decided what this value should be. The decision is recorded in a source of truth the operator owns, whether in version control or in a runtime mechanism. The intended state is whatever that source of truth specifies.

Defaulted. Someone decided to defer to upstream. The deferral can be pinned to upstream's default at install time, or it can float with future upstream defaults. The current state is whatever the chosen variant produces. The Default-value allowlist decides whether future upstream changes can land automatically.

Undecided. No recorded decision exists. The current state is whatever the system happens to be running.

The first two are governed. The third is drift in waiting.

What governance adds is a system that knows which state each parameter is in and refuses to mutate the third automatically. The taxonomy fits in a paragraph. The enforcement layer takes months to build and longer to maintain.

Refuse to act on the unknown

The safety property holds across every toolchain: when automation encounters an undecided parameter, it refuses to mutate it automatically. The form varies by toolchain: halt the deployment for review, block promotion until the parameter is classified, or surface the parameter to an operator while other changes proceed.

Skipping makes the parameter invisible, and the next deployment skips it again. The state nobody chose persists across every run, and drift compounds because the automation has been told to look the other way.

Stopping makes the undecided parameter a forcing function. Someone has to record a decision: either set an explicit value (tracked) or defer to upstream's default (defaulted). The set of unknowns shrinks under deliberate work.

Fail closed on the undecided is the property that closes the loop. Without it, governance is a wish list.

Switch this on against a brownfield catalog and every deployment stops on the first run. Fail-closed needs a bootstrap phase: catalog the existing parameters, classify the obvious ones, then enforce against the residue. How long that takes depends on parameter count, the research cost per parameter, and how much of the original reasoning is still recoverable. The forcing function works on inflow first; older parameters get classified in stages, in the order the next deploy or reconcile forces the question.

The dual allowlist separates "decided" from "safe to auto-apply"

Governance establishes that someone decided. Some decided parameters still need intentional friction before the change lands: confirmation, alarm, abort, or any mechanism that slows the change down. That's a second governance decision: whether the recorded value is safe to auto-apply.

A dual allowlist makes the second decision explicit:

Tracked-value allowlist. Which tracked parameters can be reconciled automatically when running state diverges from the recorded value.
Default-value allowlist. Which defaulted parameters can be auto-applied when upstream defaults change.

A parameter outside its allowlist doesn't auto-apply. Automation reports the divergence (running state versus recorded value for tracked parameters, or previously-applied default versus new upstream default for defaulted parameters) and routes the change through whatever friction the system enforces.

Take a connection pool size. A new service ships with the parameter unspecified, leaving it undecided, and automation refuses to carry it forward silently. An operator measures peak concurrency and records an explicit value (tracked), or accepts the framework's default and records the deferral (defaulted). The dual allowlist then decides whether future changes apply automatically or surface for operator review.

The split exists because "we know what the value should be" and "this is safe to auto-apply" are different claims. Some parameters depend on traffic patterns. Capacity envelopes set others. A few have invariants the configuration schema can't express. Auto-apply a known-correct value on one of those, and a recorded decision still produces a production incident. A traffic-dependent value misfires during a spike, a capacity-bounded one after a fleet resize, a semantics-bearing one during a version bump when the name stays stable but its meaning drifts.

The dual allowlist also makes scope expansion an evolutionary action. New parameters start outside the allowlist for their category. Confidence grows, parameters move in. When a parameter moves back out, the move is recorded too. The allowlists themselves are governed.

Trust topology decides the scheme's shape

The dual allowlist is a coordination instrument. It exists because the second decision (is this safe to auto-apply?) can be answered differently by different operators. When two parties don't share the same view of the parameter's behavior, the allowlist forces the disagreement into the open before the change lands.

Where decision authority is unified, the allowlist's coordination role collapses: there's no second party to disagree about whether a value is safe to auto-apply. Its safety role survives. A known-correct value whose semantics drift on a version bump still produces an incident for a single operator, so the auto-apply judgment stays even when one person holds it.

Where authority is divided across operators or organizational boundaries, the second decision has to be coordinated. The allowlist returns.

A 200-service operation run by a single operator collapses to the same shape as a small one. A 5-service production system run by three operators across two time zones still needs the allowlist. Scale tracks trust topology at the extremes, but division of decision authority is the structural cause.

Trust topology changes the shape. The property holds at every shape: refuse to mutate the undecided automatically.

The schema covers what it claims

The three states cover every parameter the governance layer can inspect and act on. Apparent edge cases collapse back into the schema.

A parameter set by orchestration or service discovery is still tracked. The runtime mechanism is the source of truth, and the operator owns the mechanism that produces it. The schema doesn't need a separate class for it.

A "default that isn't safe" should be tracked explicitly. If you don't trust upstream's default, the right move is to record an explicit value the operator stands behind. The alternative is a ticking time bomb: a parameter that defers to upstream while flagged as risky.

A parameter whose semantics shift between versions is the operator's re-validation burden at upgrade. The Default-value allowlist already handles unexpected default changes: parameters whose defaults might shift unsafely stay out of the auto-apply allowlist, and any change surfaces for operator review.

Governance is the answer-existence layer

Configuration management is the answer-execution layer. Whatever toolchain enforces the recorded values (push-based orchestration, pull-based GitOps, a control plane that reconciles fleet-wide), its job is to make applied state match recorded state. Declarative-system sync status (out-of-sync, missing, drifted) lives here. Sync answers "does applied state match recorded state." It never looks at what the process is actually running, which is where an undecided value hides. Policy-as-code asks a related question: "does this change violate a rule." Both presuppose governance. A value can be in sync, pass every policy check, and still be undecided.

Governance sits one layer above. It establishes that a recorded state exists, by classifying every parameter into a decision state and refusing to mutate undecided parameters automatically. It also decides how much operator review each recorded decision requires before it's applied. Sync depends on governance; without it, reconciliation just makes drift look intentional.

When the two layers collapse into one, drift gets treated as a configuration management failure: stronger reconciliation, faster cadence, better tooling. None of those fix the gap. The values that drift are the ones with no recorded decision behind them. With no recorded target, reconciliation has nothing to converge toward. Where a value defers to a floating upstream default, faster reconciliation just spreads that default more uniformly. That's consistency without governance: the original failure mode at higher fidelity.

The diagnostic question is the load-bearing one: did anyone decide what this value should be? Pick ten parameters from your most recent deploy. For each one, point to where the decision is recorded: the commit, the runtime source of truth the operator owns, or the recorded deferral to upstream. (The allowlist entry records a different decision, whether to auto-apply, not that the value was decided.) The parameters whose decision you can't locate are the gap your configuration management tool isn't auditing against.

Governance answers whether. Management answers what.