Core Francisco Park

A vision language model sees the world the same way your monitor draws it: as three numbers per pixel — red, green, blue — each an 8-bit integer between 0 and 255. In principle, that gives it machine precision. A VLM could, in principle, tell apart (120, 80, 200) from (121, 80, 200) without breaking a sweat. The difference is one, in a space where the diagonal is over 440 units long.

In practice it can't. Show a frontier VLM two colored circles a single integer apart and it'll call them the same. Push the difference out to twenty or thirty units and it starts to get it right, but only sometimes, and only for certain colors. There is a fuzzy decision boundary in these models that bears no obvious relation to their raw input. What sets that boundary?

That is the question we set out to answer. The answer turned out to be a little uncanny.

The puzzle

Three hypotheses for why a VLM's color discrimination doesn't follow its input geometry:

Input statistics. Maybe similar pixels just produce similar activations. The model is fundamentally an L2-distance calculator over sRGB; perception is a side effect of that geometry. Nothing more is going on.
Mechanistic. Maybe the early layers of the vision encoder warp the input in some specific, characterizable way — a learned transfer function that defines the model's "sensory system."
Human inheritance. Training data is curated by humans, captured by cameras built for humans, labeled with words that humans coined for distinctions humans can see. If the model's discrimination contours line up with human perceptual contours — the ones color scientists have been measuring since MacAdam, 1942 — that's a story about the data, not the architecture.

These aren't mutually exclusive. The point is to find out which fraction is which.

A diversion: how human-perceptual is human-perceptual?

Before the experiments, a quick demo of the thing we're going to measure VLMs against. The CIE ΔE₀₀ metric is the gold-standard color-difference formula in the color-science world. It's been fit to decades of psychophysical data: you can think of it as "the distance between two colors in the geometry that a typical human eye-brain uses." A ΔE₀₀ of about 1 is the human just-noticeable difference (JND) — colors below that are indistinguishable.

The slider widget below computes ΔE₀₀ in your browser. Try the presets. Notice that for the same sRGB distance, two yellows can be almost identical while two blues are clearly distinct. The mapping from "what the model sees on its wire" to "what looks different" is deeply non-uniform.

color 1 #ebdc46 rgb(235, 220, 70)

R235

G220

B70

color 2 #ffeb46 rgb(255, 235, 70)

R255

G235

B70

sRGB L2

25.0

in 0–255 space

CIE ΔE₀₀

3.84

human-perceptual

Perceptual call

Clearly different

JND ≈ 1 ΔE₀₀

Big shift in sRGB, barely visible — the yellow region is perceptually compressed.

That non-uniformity is what we'd like the VLM to ignore. Its input is sRGB. The clean, boring hypothesis says it should discriminate by sRGB distance.

The setup

We ran psychophysics on two VLMs: Gemini 3 Flash (frontier proprietary) and Qwen3-VL-8B-Instruct (open-weight). Three forced-choice tasks, designed to strip out everything except the color judgment itself:

Odd-one-out (4AFC). Four circles on a 2×2 grid. Three share a base color; one doesn't. The model picks the odd one. Chance is 25%.
Same/different (2AFC). Two circles. Identical or not. Chance is 50%.
Triplet matching (2AFC). A reference plus two candidates; pick the match. Chance is 50%.

Color pairs were sampled at controlled distances along controlled directions, both in the 2D CIE xy chromaticity plane (48 base points × 8 directions × 5 radii) and in full 3D CIELAB space (48 base points × 26 directions × 5 radii). Total: ~68,000 trials.

Overview: chromaticity diagram, three tasks, psychometric curve

For each metric — sRGB L2, linear RGB L2, XYZ L2, ΔE₇₆, ΔE₀₀, and two learned power-law variants — we fit a two-parameter logistic psychometric function predicting accuracy from distance under that metric. Then we compared fits by log-likelihood and BIC. If the model's behavior is best explained by distance-in-X, then X is the metric the model is secretly using.

Result 1: ΔE₀₀ wins, and the loser order is weird

How well does each color-distance metric explain VLM judgments?

Log-likelihood relative to ΔE₀₀ (0 = ΔE₀₀ baseline; higher is better).

Every input-space metric loses to ΔE₀₀, the engineered perceptual metric. Note that sRGB L2 beats linear RGB L2 — undoing gamma encoding actively hurts the fit.

Across both VLMs, both color spaces, and all three task formats, the perceptual metric ΔE₀₀ explains the data better than any input-space metric. The margin over sRGB L2 is 228 log-likelihood units on Gemini 2D, 510 on Gemini 3D, and even larger on Qwen3. With sample sizes in the thousands, that's not a tie.

But look at the ordering of the input-space metrics. The standard color-science transform chain goes sRGB → linear RGB → XYZ → Lab → ΔE₀₀, each step closer to human perception. If the VLM were doing some partial perceptual computation, you'd expect the ordering to mirror the chain.

It doesn't. sRGB beats linear RGB and XYZ. Undoing the gamma encoding — moving toward physical light intensity — actively hurts the fit. The biggest single jump along the chain happens at the XYZ → CIELAB step, where the cube-root compression and perceptual weighting enter. ΔE₀₀ adds a final modest polish on top of plain CIELAB.

This rules out the naive "VLM is a pixel-space distance calculator" hypothesis decisively. But it does more than that. The gamma curve itself acts as a weak perceptual prior — sRGB happens to be roughly perceptually graded for lightness, by accident of CRT physics meeting the human visual system. The VLM keeps that prior and discards the rest of the linearization.

Is ΔE₀₀ enough?

In the 2D chromaticity plane: yes. We fit augmented ΔE₀₀ models with extra parameters for each base point or each direction — the idea being that if the VLM has some local bias not captured by the universal metric, those intercepts would catch it. The likelihood ratio test detects effects (p < 0.001), but they're tiny: not worth the parameters.

In 3D, where lightness enters, the picture cracks. A single extra parameter that lets lightness directions (L*) act differently from chromatic directions (a*, b*) wins on BIC. Both VLMs find a* (red ↔ green) easiest and b* (blue ↔ yellow) hardest. ΔE₀₀ does not fully equalize their sensitivity across the three CIELAB axes.

So: ΔE₀₀ is approximately right, with a residual axis-dependent bias we don't yet understand.

Result 2: It isn't there at the input

The DE2000-wins result is suggestive but it doesn't tell us where in the model the perceptual structure lives. Qwen3-VL is open-weight, so we can look inside.

We sent solid-color images through the network and recorded activations at every layer. For each layer, we computed pairwise L2 distance in representation space, and asked: how well does that layer's geometry correlate with sRGB? With ΔE₀₀? Higher R² means the representation distance behaves like that metric.

The patch embedding is where pixels first meet the network — a single Conv3d projection that maps 14×14×3 patches into a 1152-dim space. If perceptual structure were hard-coded by the architecture, it should be visible already here.

It isn't.

sRGB distance vs patch embedding distance, R² = 0.97

ΔE₀₀ distance vs patch embedding distance, R² = 0.46

The patch embedding is sRGB. R² = 0.97 against sRGB L2, R² = 0.46 against ΔE₀₀. The input projection is a near-perfect linear preservation of the model's input space — exactly as the boring hypothesis would predict.

The interesting thing happens deeper:

Where does perceptual structure live? Qwen3-VL-8B, layer by layer.

R² of pairwise representation distance against each color metric. Crossover at ViT layer 10.

At the patch embedding, sRGB explains 97% of representational variance — the input projection preserves pixel geometry almost perfectly. By layer 10, ΔE₀₀ overtakes sRGB. At the merger (entry to the LLM), ΔE₀₀ R² = 0.80, sRGB R² = 0.73.

Across the 27 ViT layers of Qwen3-VL, sRGB R² falls and ΔE₀₀ R² rises. They cross at layer 10. By the merger (the 4×1152 → 4096 projection into the LLM), ΔE₀₀ is the better predictor by a clear margin. The perceptual structure isn't at the gate. It's built inside the vision transformer, layer by layer.

This matters for the three hypotheses. Input statistics (hypothesis 1) would predict perceptual alignment at the input, since that's where the input lives — it's not there. A mechanistic architectural prior (hypothesis 2) would predict it in fixed early layers, the same across models — but it builds up gradually and we can watch it accumulate. What's left is hypothesis 3: it's learned, from data.

The encoder zoo

To test the "learned from data" story, we ran the same layerwise analysis on eight standalone vision encoders, picked to span training objectives:

Image–text contrastive: CLIP-B/16, CLIP-L/14, SigLIP-B/16, SigLIP2-SO400M
Self-supervised: DINOv2-B/14, DINOv2-L/14 (self-distillation), MAE-B/16 (pixel reconstruction)
Supervised classification: ViT-B/16-IN21k

Everyone starts with sRGB at the patch embedding (R² > 0.90). The question is where they land.

Training objective predicts how perceptual the final representation is.

Final-layer ΔE₀₀ R² across 8 vision encoders. All patch embeddings are sRGB-aligned (>0.90); only deeper representations diverge.

Image–text contrastive (CLIP, SigLIP) and supervised classification (ViT-IN21k) land highest. Self-supervised pixel reconstruction (MAE) lands lowest — it stays close to sRGB. Any objective that asks the network to honor human-defined categories pushes the representation toward human perceptual structure.

The pattern is clean. Pixel reconstruction (MAE) stays sRGB-flavored — its training signal is "predict the missing pixels," so honoring pixel geometry is rewarded. Self-distillation (DINOv2) lands in the middle. Anything trained against human-defined categories — image-text pairs (CLIP, SigLIP) or image-class labels (ViT-IN21k) — lands at the top.

Notice that supervised classification matches contrastive image–text training almost exactly. Text isn't uniquely special. ImageNet class labels are enough to push the representation toward human perceptual structure. The shared ingredient is "a learning signal that respects human categorical boundaries."

(The SigLIP2-SO400M outlier — high patch alignment but low final ΔE₀₀ — is interesting and we don't fully understand it. The model is much larger than the others and trained on a different data distribution. Worth coming back to.)

What this means

VLMs receive raw sRGB and could discriminate at machine precision. They don't, because they've been trained on humans. The metric they end up using is not the metric on the wire; it's a learned approximation to the human perceptual metric. The approximation isn't hand-coded into the architecture — it accumulates through the vision transformer during training, with the slope of the accumulation set by the training objective.

The strong form of the claim is: neural networks inherit perceptual structure from the humans who curated their training data. Color is the cleanest possible domain to demonstrate that, because we have a quantitative ground truth (ΔE₀₀, fit to a century of psychophysics). But there's no obvious reason the same story shouldn't apply to other perceptual dimensions where humans are non-uniform — texture sensitivity, auditory pitch, ambiguous-figure preferences, social-stimulus saliency. Wherever a model is trained on human-labeled or human-curated data, the model is being shaped by human sensory biases, whether the model designer intended it or not.

Caveats

A few things to keep skeptical about:

BIC scales with N. With ~14,000 trials per dataset, a ΔBIC of 500 looks dramatic but corresponds to ~0.025 nats per trial — real, but modest. The qualitative ordering (ΔE₀₀ > ΔE₇₆ > sRGB ≫ linear RGB > XYZ) replicates across six dataset × VLM cells, which is more meaningful than any single number.
Our color pairs sit on a structured grid; trials aren't truly i.i.d. The effective sample size is smaller than the literal one.
We tested two VLMs. The behavioral finding is now also being replicated on Gemma 3 and LLaMA 3.2 Vision (in progress). Generalization to all VLMs is conjecture, not yet established.
"Perceptual" here means "as measured by ΔE₀₀ at 1931 CIE standard observer." Color perception varies across humans and across cultures. The VLM has been fit to whatever blend of perceptions its training distribution captured.

Where this is going

The natural next question is whether you can induce perceptual structure by manipulating the training data, or knock it out by training against a non-human signal. We're also looking at whether the residual axis-dependent bias in 3D CIELAB is a hue-naming artifact (the b* axis is harder where the model has weaker color vocabulary — this is testable). And we're running a separate track asking whether natural image color statistics alone could account for the ΔE₀₀ advantage, without needing to invoke "learning a perceptual prior" at all. (Early answer: probably not, but it's a real experiment, not a hand-wave.)

For now, the headline stands. A VLM is given pixels and learns to see colors the way people see them, because people made everything it was ever trained on.

This post is a blog version of our SciForDL @ ICLR 2026 workshop submission. Full paper: Vision Language Models Inherit Human Color Perception. Code and data release coming with the camera-ready version.

Loading content...