Faster reconciliation doesn't fix drift
Configuration management asks what each value should be. Governance asks whether anyone decided. Drift accumulates in the gap between the two.
Configuration drifts silently.
Configuration management asks what each value should be. A harder question comes first: whether anyone decided what the value should be. That question is governance, and drift accumulates in the gap between the two.
Drift is what configuration management produces when governance is missing. Every deployment check can pass while half the parameters carry values nobody confirmed. The configuration is consistent with itself and inconsistent with intent, because intent was never recorded.
Governance is the layer where intent is recorded. Classification is how the layer is implemented when more than one person decides.
Every parameter has a decision state
Three states cover the working set.
- Tracked. Someone decided what this value should be. The decision is recorded in a source of truth the operator owns, whether in version control or in a runtime mechanism. The current state is whatever that source of truth produces.
- Defaulted. Someone decided to defer to upstream. The deferral can be pinned to upstream's default at install time, or it can float with future upstream defaults. The current state is whatever the chosen variant produces. The Default-value allowlist decides whether future upstream changes can land automatically.
- Undecided. Nobody has formally researched what the value should be. There's no recorded decision. The current state is whatever the system happens to be running.
The first two are governed. The third is drift in waiting.
What governance adds is a system that knows which state each parameter is in and refuses to mutate the third automatically. The taxonomy fits in a paragraph. The enforcement layer takes months to build and longer to maintain.
Refuse to act on the unknown
The safety property holds across every toolchain: when automation encounters an undecided parameter, it refuses to mutate it automatically. The form varies by toolchain: halt the deployment for review, block promotion until the parameter is classified, or surface the parameter to an operator while other changes proceed.
Skipping makes the parameter invisible, and the next deployment skips it again. The state nobody chose persists across every run, and drift compounds because the automation has been told to look the other way.
Stopping makes the undecided parameter a forcing function. Someone has to record a decision: either set an explicit value (tracked) or defer to upstream's default (defaulted). The set of unknowns shrinks under deliberate work.
Fail closed on the undecided is the property that closes the loop. Without it, governance is a wish list.
Switch this on against a brownfield catalog and every deployment stops on the first run. Fail-closed needs a bootstrap phase: catalog the existing parameters, classify the obvious ones, then enforce against the residue. How long that takes depends on parameter count, the research cost per parameter, and how much of the original reasoning is still recoverable. The forcing function works on inflow first; older parameters get classified in stages, in the order their next deployment forces the question.
A dual allowlist separates "decided" from "safe to auto-apply"
Governance establishes that someone decided. Some decided parameters still need intentional friction before the change lands: confirmation, alarm, abort, or any mechanism that slows the change down. That's a second governance decision: whether the recorded value is safe to auto-apply.
A dual allowlist makes the second decision explicit:
- Tracked-value allowlist. Which tracked parameters can be reconciled automatically when running state diverges from the recorded value.
- Default-value allowlist. Which defaulted parameters can be auto-applied when upstream defaults change.
A parameter outside its allowlist doesn't auto-apply. Automation reports the divergence (running state versus recorded value for tracked parameters, or previously-applied default versus new upstream default for defaulted parameters) and routes the change through whatever friction the system enforces.
Take a connection pool size. A new service ships with the parameter unspecified, leaving it undecided, and automation refuses to carry it forward silently. An operator measures peak concurrency and records an explicit value (tracked), or accepts the framework's default and records the deferral (defaulted). The dual allowlist then decides whether future changes apply automatically or surface for operator review.
The split exists because "we know what the value should be" and "this is safe to auto-apply" are different claims. Some parameters depend on traffic patterns. Capacity envelopes set others. A few have invariants the configuration schema can't express. Auto-apply a known-correct value on one of those, and a recorded decision still produces a production incident, most often during a version bump, when the parameter's name stays stable but its semantics drift.
The dual allowlist also makes scope expansion an evolutionary action. New parameters start outside the allowlist for their category. Confidence grows, parameters move in. When a parameter moves back out, the move is recorded too. The allowlists themselves are governed.
Trust topology decides the scheme's shape
The dual allowlist is a coordination instrument. It exists because the second decision (is this safe to auto-apply?) can be answered differently by different operators. When two parties don't share the same view of the parameter's behavior, the allowlist forces the disagreement into the open before the change lands.
Where decision authority is unified, the second decision merges with the first. The same person who recorded the value will be present when it lands, and the allowlist collapses to a single recorded value per parameter.
Where authority is divided across operators or organizational boundaries, the second decision has to be coordinated. The allowlist returns.
A 200-service operation run by a single operator collapses to the same shape as a small one. A 5-service production system run by three operators across two time zones still needs the allowlist. Scale tracks trust topology at the extremes, but division of decision authority is the structural cause.
Trust topology changes the shape. The property holds at every shape: refuse to mutate the undecided automatically.
The schema covers what it claims
The three states cover every parameter the governance layer can inspect and act on. Apparent edge cases collapse back into the schema.
A parameter set by orchestration or service discovery is still tracked. The runtime mechanism is the source of truth, and the operator owns it. The schema doesn't need a separate class for it.
A "default that isn't safe" should be tracked explicitly. If you don't trust upstream's default, the right move is to record an explicit value the operator stands behind. The alternative is a ticking time bomb: a parameter that defers to upstream while flagged as risky.
A parameter whose semantics shift between versions is the operator's re-validation burden at upgrade. The Default-value allowlist already handles unexpected default changes: parameters whose defaults might shift unsafely stay out of the auto-apply allowlist, and any change surfaces for operator review.
Governance is the answer-existence layer
Configuration management is the answer-execution layer. Whatever toolchain enforces the recorded values (push-based orchestration, pull-based GitOps, a control plane that reconciles fleet-wide), its job is to make running state match recorded state. Declarative-system sync status (out-of-sync, missing, drifted) lives here. Sync answers "does running state match recorded state." Policy-as-code asks a related question: "does this change violate a rule." Both presuppose governance. A value can be in sync, pass every policy check, and still be undecided.
Governance sits one layer above. It establishes that a recorded state exists, by classifying every parameter into a decision state and refusing to mutate undecided parameters automatically. It also decides how much operator review each recorded decision requires before it's applied. Sync depends on governance; without it, reconciliation just makes drift look intentional.
When the two layers collapse into one, drift gets treated as a configuration management failure: stronger reconciliation, faster cadence, better tooling. None of those fix the gap. The values that drift are the ones with no recorded decision behind them. Faster reconciliation forces a default into a more consistent shape across the fleet, which is consistency without governance, the original failure mode at higher fidelity.
The diagnostic question is the load-bearing one: did anyone decide what this value should be? Pick ten parameters from your most recent deploy. For each one, point to where the decision is recorded: the commit, the allowlist entry, the policy. The parameters whose decision you can't locate are the gap your configuration management tool isn't running an audit against.
Governance answers whether. Management answers what.