When the design system and the code disagree

Two definitions of gray-500

The reconciliation was supposed to be cleanup. Instead it found a bug we'd been shipping in production for months.

On a recent client project (Nuxt + Tailwind v4), we had two gray scales. One lived in tailwind.config.js, the actual values rendering in users' browsers. The other lived in tokens.css, ported from a sister project to bootstrap the design system rebuild. Both files claimed to represent the same palette. Neither team had manually verified that claim.

They were close. Most values matched exactly. Two didn't.

Token	Live (`tailwind.config.js`)	Ported (`tokens.css`)	Visible
gray-200	#E1DCD6	#F8EFE6	Yes, noticeable
gray-500	#8A7F73	#8C7B6F	No, imperceptible
All other stops	matched	matched	–

gray-200 had drifted toward a lighter, cooler tone in the ported file. Put them side by side on screen and you'd see it immediately. The gray-500 difference was two hex digits across two channels. You would not see it on a monitor. But this is where "imperceptible on screen" and "inconsequential" part ways.

The reconciliation question

We had three options and tried two wrong ones first.

The first attempt: update tailwind.config.js to match tokens.css. Make the code conform to the design system, which is the right long-term direction. In practice, this meant shifting gray-200 visibly across hundreds of existing pages. We built it, looked at it in the browser, and pulled it back within the hour. Visual continuity on existing pages is not optional.

The second idea: token bifurcation. Keep both scales, scope one to legacy templates and one to new components. This sounds principled. It immediately creates two sources of truth, two files to maintain, and a naming problem for every developer who touches the system afterward. We talked ourselves out of it in one conversation.

The third option: update tokens.css to match the live values. Adapt the design system to the existing state, not the other way around.

“
Live values usually win. Adapt the new system to the existing state, not the other way around.

The reasoning is straightforward. Hundreds of pages were rendering gray-200 as #E1DCD6. Changing that value would produce a visible, unexplained shift across the site. For gray-500, the math is simpler: no user would notice either way, so the shipped value wins by default. You don't introduce risk to correct something no one can see.

What we changed

Two values changed in tokens.css. The CHANGELOG got an entry:

## [Unreleased]
 
### Changed
- tokens.css: gray-200 updated from #F8EFE6 to #E1DCD6
  (align with live tailwind.config.js; visual continuity on existing pages)
- tokens.css: gray-500 updated from #8C7B6F to #8A7F73
  (align with live tailwind.config.js; imperceptible delta, live value wins)

That felt like a clean close. Then we ran the contrast checks.

The accessibility surprise

WCAG contrast ratios are math. You take a foreground color and a background color, compute relative luminance for each, and divide the lighter by the darker (both offset by 0.05). The formula:

L = 0.2126 × R + 0.7152 × G + 0.0722 × B
contrast = (L_lighter + 0.05) / (L_darker + 0.05)

For AA-level compliance, normal body text requires 4.5:1. Large text (18pt or 14pt bold) requires 3:1.

We ran #8A7F73 (the new, now-canonical gray-500) against white.

The result: 3.92:1.

That passes for large text. It fails for normal body text. The gap to AA is 0.58 points. Small enough that it looks fine on a bright monitor. Large enough that it fails the standard.

“
Accessibility math doesn't lie. 'It looks fine' is not WCAG.

The problem went beyond the number. gray-500 mapped to the semantic token foreground-muted. And foreground-muted had been applied to body paragraphs on several page templates. Not captions. Not timestamps. Full, sustained reading text.

The deeper issue

Here is where it gets worse.

We checked the value we had just replaced. The ported #8C7B6F scores 4.06:1 against white. Also below 4.5:1. Also a fail for AA normal text.

Both values failed. The old one and the new one. Neither ever passed the threshold for body text. We had been shipping a WCAG failure on body paragraphs for months, across both this project and the sister project the tokens were originally ported from. No one had caught it because no one had run the math. The value looked reasonable. The semantic name suggested a legitimate use case. On a high-brightness display in normal indoor lighting, 3.92:1 is not obviously broken. It's just broken.

The token name should have been the tell. Muted colors sit at the low end of contrast by design. Applying a muted token to sustained reading text was always going to be borderline at best. But when a token is named, shipped, and rendering without complaints, it accumulates a kind of implicit approval. No one flags what no one measured.

What we did about it

We didn't recolor gray-500. The value is correct for its intended purpose: captions, timestamps, metadata labels. In those roles, text is usually at larger sizes and the 3.92:1 passes the large-text threshold. The color is fine. The usage rule was wrong.

The accessibility spec now reads:

“
Use 'foreground-muted' for captions, timestamps, and metadata only. Do not use 'foreground-muted' for body paragraphs or sustained reading text below 18pt. For body paragraphs, use 'foreground-secondary' (gray-700) which scores 7.94:1 against white.

The CHANGELOG got a second entry:

## [Unreleased]
 
### Fixed
- Accessibility: foreground-muted (#8A7F73) scores 3.92:1 against white
  Fails WCAG AA for normal body text (threshold: 4.5:1)
  Previous value (#8C7B6F) also failed at 4.06:1
  Use foreground-secondary (gray-700, 7.94:1) for body paragraphs
 
### Added
- Accessibility spec: foreground-muted restricted to captions/metadata contexts
 
### Deferred
- Full audit of foreground-muted usage across all page templates (ticket #42)

We opened a ticket for the full audit and deferred it. A complete sweep of every template was out of scope for a token alignment pass. We documented the finding precisely, locked in the rule, and left the audit to a focused follow-up. Expanding the scope of every cleanup task until it's fully resolved is how cleanup tasks never close.

Why this matters

Design system rebuilds do something structural audits rarely do: they force direct comparison between what was designed and what shipped.

The token reconciliation surfaced this issue because it required looking at each value individually, validating it, and making a deliberate choice. That is different from working with a system already in place. When everything is set and running, you tend to accept values that look reasonable. When you are explicitly migrating and aligning, each value becomes a decision point. Decision points invite scrutiny. Scrutiny finds things.

“
Reconciliation is a forcing function. Rebuilds expose bugs you weren't looking for.

We found one accessibility bug. We almost certainly have more. The deferred audit ticket exists because we know that now. Before the rebuild, we didn't know what we didn't know.

“
If your rebuild surfaces no issues, you didn't look hard enough.

This is not a criticism of the original work. Gray-500 at 3.92:1 against white is not an obvious failure. It's close enough to look fine and far enough from the threshold to matter. These are exactly the issues that live undetected until someone explicitly runs the math.

When to reconcile vs. document drift

Not every difference between two token sources needs the same response. We now use four categories:

Imperceptible. Differences below 3-4 hex digits across channels. No visual impact. Default to the live value and reconcile silently.

Example: gray-500 (#8A7F73 vs #8C7B6F). Two digits. Reconcile to live, document in CHANGELOG, move on. But still run the contrast check.

Noticeable. Visible difference on a calibrated monitor in normal conditions. Requires a deliberate decision. Live values usually win for continuity. If the ported value is intentionally improved (contrast-corrected, brand-aligned), document the rationale and flag for QA review.

Example: gray-200 (#E1DCD6 vs #F8EFE6). Reconciled to live after confirming no accessibility improvement in the ported alternative.

Symptomatic. The difference, or the act of checking it, reveals a problem in the live value: failing contrast, incorrect semantic usage, an undocumented deviation. Don't reconcile without addressing the symptom. Document the issue, create a ticket, defer if the fix is out of scope. But do not silently accept a live value you now know is wrong.

Example: gray-500 as foreground-muted on body paragraphs. The reconciliation process triggered the check that found the WCAG failure.

Intentional. The ported value was changed deliberately, with a reason. Treat it as a proposal. Evaluate against the live context before accepting or rejecting, and document the outcome either way.

The default is always to reconcile toward live. The exception is when looking closely at the live value reveals you should not have been shipping it in the first place.

Frequently asked questions

What does it mean when a design system and the code disagree?

Drift. Two parallel sources of truth defining the same thing (e.g., gray-500) end up with subtly different values. Often a Tailwind config holds the values that actually render in browsers while a separate tokens.css holds values ported from a sister project. Both files claim to represent the same palette, but nobody has manually verified the claim.

How do you reconcile drifting design tokens?

Use a four-category rubric: imperceptible differences default to the live value silently. Noticeable visible differences require deliberate choice, usually toward the live value for continuity. Symptomatic differences (where checking reveals a problem like a contrast failure) get documented and ticketed, not silently reconciled. Intentional differences treated as proposals, evaluated against the live context.

Should design system values or live code values win during reconciliation?

Live values usually win. Hundreds of rendered pages depend on the values currently shipping. Changing them produces visible, unexplained shifts across the site. Adapt the new design system to the existing state, not the other way around. The exception is when looking closely reveals you should not have been shipping the live value in the first place.

What is the WCAG AA contrast ratio for normal body text?

4.5:1 minimum against the background. Large text (18pt or 14pt bold) only needs 3:1. A gray that scores 3.92:1 against white passes for large text but fails for body paragraphs. That gap is small enough to look fine on a bright monitor and large enough to fail the standard.

How do design system rebuilds expose accessibility bugs?

Rebuilds force direct comparison between what was designed and what shipped. Working with an in-place system, you accept values that look reasonable. Migrating and aligning makes each value a decision point, and decision points invite scrutiny. Scrutiny finds things, including contrast failures nobody had measured because nobody had reason to look.

Francois Brill

Designer + Builder, Clearly Design, Inc.

Francois has been designing and shipping software products since 2002, with a focus on the messy middle between design and the codebases that consume them. He runs Clearly Design, a product design subscription studio for SaaS founders, and writes about how design systems hold up under AI-assisted development.

X / TwitterClearly DesignGitHubLinkedIn

Writing component specs an AI can actually use

Turning a one-project pattern into a reusable Claude Code skill

All articles in "AI-Ready Design Systems"