Jun 5, 2026

8 min read

AI-Ready Design Systems (6/7)

When the design system and the code disagree

Francois Brill

Francois Brill

Designer + Builder

When the design system and the code disagree
TL;DR

For engineers: the reconciliation pattern we run when two token sources drift. Live values usually win, but the act of comparing them surfaces accessibility bugs you weren't looking for. Includes the four-category rubric for reconcile vs. document, and the exact WCAG math that flagged a gray-500 failure.

For founders: rebuilds expose debt that has been shipping invisibly. Our reconciliation work on a recent project found a contrast bug nobody had flagged in months. The principle matters; the math is detail for your team. If you want the context first, Article 3 covers the audit phase that made this reconciliation possible.

Two definitions of gray-500

The reconciliation was supposed to be cleanup. Instead it found a bug we'd been shipping in production for months.

On a recent client project (Nuxt + Tailwind v4), we had two gray scales. One lived in tailwind.config.js, the actual values rendering in users' browsers. The other lived in tokens.css, ported from a sister project to bootstrap the design system rebuild. Both files claimed to represent the same palette. Neither team had manually verified that claim.

They were close. Most values matched exactly. Two didn't.

TokenLive (tailwind.config.js)Ported (tokens.css)Visible
gray-200#E1DCD6#F8EFE6Yes, noticeable
gray-500#8A7F73#8C7B6FNo, imperceptible
All other stopsmatchedmatched

gray-200 had drifted toward a lighter, cooler tone in the ported file. Put them side by side on screen and you'd see it immediately. The gray-500 difference was two hex digits across two channels. You would not see it on a monitor. But this is where "imperceptible on screen" and "inconsequential" part ways.

The reconciliation question

We had three options and tried two wrong ones first.

The first attempt: update tailwind.config.js to match tokens.css. Make the code conform to the design system, which is the right long-term direction. In practice, this meant shifting gray-200 visibly across hundreds of existing pages. We built it, looked at it in the browser, and pulled it back within the hour. Visual continuity on existing pages is not optional.

The second idea: token bifurcation. Keep both scales, scope one to legacy templates and one to new components. This sounds principled. It immediately creates two sources of truth, two files to maintain, and a naming problem for every developer who touches the system afterward. We talked ourselves out of it in one conversation.

The third option: update tokens.css to match the live values. Adapt the design system to the existing state, not the other way around.

Live values usually win. Adapt the new system to the existing state, not the other way around.

The reasoning is straightforward. Hundreds of pages were rendering gray-200 as #E1DCD6. Changing that value would produce a visible, unexplained shift across the site. For gray-500, the math is simpler: no user would notice either way, so the shipped value wins by default. You don't introduce risk to correct something no one can see.

What we changed

Two values changed in tokens.css. The CHANGELOG got an entry:

## [Unreleased]
 
### Changed
- tokens.css: gray-200 updated from #F8EFE6 to #E1DCD6
  (align with live tailwind.config.js; visual continuity on existing pages)
- tokens.css: gray-500 updated from #8C7B6F to #8A7F73
  (align with live tailwind.config.js; imperceptible delta, live value wins)

That felt like a clean close. Then we ran the contrast checks.

The accessibility surprise

WCAG contrast ratios are math. You take a foreground color and a background color, compute relative luminance for each, and divide the lighter by the darker (both offset by 0.05). The formula:

L = 0.2126 × R + 0.7152 × G + 0.0722 × B
contrast = (L_lighter + 0.05) / (L_darker + 0.05)

For AA-level compliance, normal body text requires 4.5:1. Large text (18pt or 14pt bold) requires 3:1.

We ran #8A7F73 (the new, now-canonical gray-500) against white.

The result: 3.92:1.

That passes for large text. It fails for normal body text. The gap to AA is 0.58 points. Small enough that it looks fine on a bright monitor. Large enough that it fails the standard.

Accessibility math doesn't lie. 'It looks fine' is not WCAG.

The problem went beyond the number. gray-500 mapped to the semantic token foreground-muted. And foreground-muted had been applied to body paragraphs on several page templates. Not captions. Not timestamps. Full, sustained reading text.

The deeper issue

Here is where it gets worse.

We checked the value we had just replaced. The ported #8C7B6F scores 4.06:1 against white. Also below 4.5:1. Also a fail for AA normal text.

Both values failed. The old one and the new one. Neither ever passed the threshold for body text. We had been shipping a WCAG failure on body paragraphs for months, across both this project and the sister project the tokens were originally ported from. No one had caught it because no one had run the math. The value looked reasonable. The semantic name suggested a legitimate use case. On a high-brightness display in normal indoor lighting, 3.92:1 is not obviously broken. It's just broken.

The token name should have been the tell. Muted colors sit at the low end of contrast by design. Applying a muted token to sustained reading text was always going to be borderline at best. But when a token is named, shipped, and rendering without complaints, it accumulates a kind of implicit approval. No one flags what no one measured.

What we did about it

We didn't recolor gray-500. The value is correct for its intended purpose: captions, timestamps, metadata labels. In those roles, text is usually at larger sizes and the 3.92:1 passes the large-text threshold. The color is fine. The usage rule was wrong.

The accessibility spec now reads:

Use 'foreground-muted' for captions, timestamps, and metadata only. Do not use 'foreground-muted' for body paragraphs or sustained reading text below 18pt. For body paragraphs, use 'foreground-secondary' (gray-700) which scores 7.94:1 against white.

The CHANGELOG got a second entry:

## [Unreleased]
 
### Fixed
- Accessibility: foreground-muted (#8A7F73) scores 3.92:1 against white
  Fails WCAG AA for normal body text (threshold: 4.5:1)
  Previous value (#8C7B6F) also failed at 4.06:1
  Use foreground-secondary (gray-700, 7.94:1) for body paragraphs
 
### Added
- Accessibility spec: foreground-muted restricted to captions/metadata contexts
 
### Deferred
- Full audit of foreground-muted usage across all page templates (ticket #42)

We opened a ticket for the full audit and deferred it. A complete sweep of every template was out of scope for a token alignment pass. We documented the finding precisely, locked in the rule, and left the audit to a focused follow-up. Expanding the scope of every cleanup task until it's fully resolved is how cleanup tasks never close.

Why this matters

Design system rebuilds do something structural audits rarely do: they force direct comparison between what was designed and what shipped.

The token reconciliation surfaced this issue because it required looking at each value individually, validating it, and making a deliberate choice. That is different from working with a system already in place. When everything is set and running, you tend to accept values that look reasonable. When you are explicitly migrating and aligning, each value becomes a decision point. Decision points invite scrutiny. Scrutiny finds things.

Reconciliation is a forcing function. Rebuilds expose bugs you weren't looking for.

We found one accessibility bug. We almost certainly have more. The deferred audit ticket exists because we know that now. Before the rebuild, we didn't know what we didn't know.

If your rebuild surfaces no issues, you didn't look hard enough.

This is not a criticism of the original work. Gray-500 at 3.92:1 against white is not an obvious failure. It's close enough to look fine and far enough from the threshold to matter. These are exactly the issues that live undetected until someone explicitly runs the math.

When to reconcile vs. document drift

Not every difference between two token sources needs the same response. We now use four categories:

Imperceptible. Differences below 3-4 hex digits across channels. No visual impact. Default to the live value and reconcile silently.

Example: gray-500 (#8A7F73 vs #8C7B6F). Two digits. Reconcile to live, document in CHANGELOG, move on. But still run the contrast check.

Noticeable. Visible difference on a calibrated monitor in normal conditions. Requires a deliberate decision. Live values usually win for continuity. If the ported value is intentionally improved (contrast-corrected, brand-aligned), document the rationale and flag for QA review.

Example: gray-200 (#E1DCD6 vs #F8EFE6). Reconciled to live after confirming no accessibility improvement in the ported alternative.

Symptomatic. The difference, or the act of checking it, reveals a problem in the live value: failing contrast, incorrect semantic usage, an undocumented deviation. Don't reconcile without addressing the symptom. Document the issue, create a ticket, defer if the fix is out of scope. But do not silently accept a live value you now know is wrong.

Example: gray-500 as foreground-muted on body paragraphs. The reconciliation process triggered the check that found the WCAG failure.

Intentional. The ported value was changed deliberately, with a reason. Treat it as a proposal. Evaluate against the live context before accepting or rejecting, and document the outcome either way.

The default is always to reconcile toward live. The exception is when looking closely at the live value reveals you should not have been shipping it in the first place.

Rebuilds surface the bugs you've been shipping.

Clearly Design rebuilds design systems to be AI-ready and finds the accessibility and consistency debt hiding in your codebase along the way. Part of a product design subscription.