A vision language model sees the world the same way your monitor draws it: as three numbers
per pixel — red, green, blue — each an 8-bit integer between 0 and 255. In principle, that
gives it machine precision. A VLM could, in principle, tell apart (120, 80, 200) from
(121, 80, 200) without breaking a sweat. The difference is one, in a space where the
diagonal is over 440 units long.
In practice it can't. Show a frontier VLM two colored circles a single integer apart and it'll call them the same. Push the difference out to twenty or thirty units and it starts to get it right, but only sometimes, and only for certain colors. There is a fuzzy decision boundary in these models that bears no obvious relation to their raw input. What sets that boundary?
That is the question we set out to answer. The answer turned out to be a little uncanny.
Three hypotheses for why a VLM's color discrimination doesn't follow its input geometry:
These aren't mutually exclusive. The point is to find out which fraction is which.
Before the experiments, a quick demo of the thing we're going to measure VLMs against. The CIE ΔE₀₀ metric is the gold-standard color-difference formula in the color-science world. It's been fit to decades of psychophysical data: you can think of it as "the distance between two colors in the geometry that a typical human eye-brain uses." A ΔE₀₀ of about 1 is the human just-noticeable difference (JND) — colors below that are indistinguishable.
The slider widget below computes ΔE₀₀ in your browser. Try the presets. Notice that for the same sRGB distance, two yellows can be almost identical while two blues are clearly distinct. The mapping from "what the model sees on its wire" to "what looks different" is deeply non-uniform.
Big shift in sRGB, barely visible — the yellow region is perceptually compressed.
That non-uniformity is what we'd like the VLM to ignore. Its input is sRGB. The clean, boring hypothesis says it should discriminate by sRGB distance.
We ran psychophysics on two VLMs: Gemini 3 Flash (frontier proprietary) and Qwen3-VL-8B-Instruct (open-weight). Three forced-choice tasks, designed to strip out everything except the color judgment itself:
Color pairs were sampled at controlled distances along controlled directions, both in the 2D CIE xy chromaticity plane (48 base points × 8 directions × 5 radii) and in full 3D CIELAB space (48 base points × 26 directions × 5 radii). Total: ~68,000 trials.

For each metric — sRGB L2, linear RGB L2, XYZ L2, ΔE₇₆, ΔE₀₀, and two learned power-law variants — we fit a two-parameter logistic psychometric function predicting accuracy from distance under that metric. Then we compared fits by log-likelihood and BIC. If the model's behavior is best explained by distance-in-X, then X is the metric the model is secretly using.
Every input-space metric loses to ΔE₀₀, the engineered perceptual metric. Note that sRGB L2 beats linear RGB L2 — undoing gamma encoding actively hurts the fit.
Across both VLMs, both color spaces, and all three task formats, the perceptual metric ΔE₀₀ explains the data better than any input-space metric. The margin over sRGB L2 is 228 log-likelihood units on Gemini 2D, 510 on Gemini 3D, and even larger on Qwen3. With sample sizes in the thousands, that's not a tie.
But look at the ordering of the input-space metrics. The standard color-science transform chain goes sRGB → linear RGB → XYZ → Lab → ΔE₀₀, each step closer to human perception. If the VLM were doing some partial perceptual computation, you'd expect the ordering to mirror the chain.
It doesn't. sRGB beats linear RGB and XYZ. Undoing the gamma encoding — moving toward physical light intensity — actively hurts the fit. The biggest single jump along the chain happens at the XYZ → CIELAB step, where the cube-root compression and perceptual weighting enter. ΔE₀₀ adds a final modest polish on top of plain CIELAB.
This rules out the naive "VLM is a pixel-space distance calculator" hypothesis decisively. But it does more than that. The gamma curve itself acts as a weak perceptual prior — sRGB happens to be roughly perceptually graded for lightness, by accident of CRT physics meeting the human visual system. The VLM keeps that prior and discards the rest of the linearization.
In the 2D chromaticity plane: yes. We fit augmented ΔE₀₀ models with extra parameters for each base point or each direction — the idea being that if the VLM has some local bias not captured by the universal metric, those intercepts would catch it. The likelihood ratio test detects effects (p < 0.001), but they're tiny: not worth the parameters.
In 3D, where lightness enters, the picture cracks. A single extra parameter that lets lightness directions (L*) act differently from chromatic directions (a*, b*) wins on BIC. Both VLMs find a* (red ↔ green) easiest and b* (blue ↔ yellow) hardest. ΔE₀₀ does not fully equalize their sensitivity across the three CIELAB axes.
So: ΔE₀₀ is approximately right, with a residual axis-dependent bias we don't yet understand.
The DE2000-wins result is suggestive but it doesn't tell us where in the model the perceptual structure lives. Qwen3-VL is open-weight, so we can look inside.
We sent solid-color images through the network and recorded activations at every layer. For each layer, we computed pairwise L2 distance in representation space, and asked: how well does that layer's geometry correlate with sRGB? With ΔE₀₀? Higher R² means the representation distance behaves like that metric.
The patch embedding is where pixels first meet the network — a single Conv3d projection that maps 14×14×3 patches into a 1152-dim space. If perceptual structure were hard-coded by the architecture, it should be visible already here.
It isn't.


The patch embedding is sRGB. R² = 0.97 against sRGB L2, R² = 0.46 against ΔE₀₀. The input projection is a near-perfect linear preservation of the model's input space — exactly as the boring hypothesis would predict.
The interesting thing happens deeper:
At the patch embedding, sRGB explains 97% of representational variance — the input projection preserves pixel geometry almost perfectly. By layer 10, ΔE₀₀ overtakes sRGB. At the merger (entry to the LLM), ΔE₀₀ R² = 0.80, sRGB R² = 0.73.
Across the 27 ViT layers of Qwen3-VL, sRGB R² falls and ΔE₀₀ R² rises. They cross at layer 10. By the merger (the 4×1152 → 4096 projection into the LLM), ΔE₀₀ is the better predictor by a clear margin. The perceptual structure isn't at the gate. It's built inside the vision transformer, layer by layer.
This matters for the three hypotheses. Input statistics (hypothesis 1) would predict perceptual alignment at the input, since that's where the input lives — it's not there. A mechanistic architectural prior (hypothesis 2) would predict it in fixed early layers, the same across models — but it builds up gradually and we can watch it accumulate. What's left is hypothesis 3: it's learned, from data.
To test the "learned from data" story, we ran the same layerwise analysis on eight standalone vision encoders, picked to span training objectives:
Everyone starts with sRGB at the patch embedding (R² > 0.90). The question is where they land.
Image–text contrastive (CLIP, SigLIP) and supervised classification (ViT-IN21k) land highest. Self-supervised pixel reconstruction (MAE) lands lowest — it stays close to sRGB. Any objective that asks the network to honor human-defined categories pushes the representation toward human perceptual structure.
The pattern is clean. Pixel reconstruction (MAE) stays sRGB-flavored — its training signal is "predict the missing pixels," so honoring pixel geometry is rewarded. Self-distillation (DINOv2) lands in the middle. Anything trained against human-defined categories — image-text pairs (CLIP, SigLIP) or image-class labels (ViT-IN21k) — lands at the top.
Notice that supervised classification matches contrastive image–text training almost exactly. Text isn't uniquely special. ImageNet class labels are enough to push the representation toward human perceptual structure. The shared ingredient is "a learning signal that respects human categorical boundaries."
(The SigLIP2-SO400M outlier — high patch alignment but low final ΔE₀₀ — is interesting and we don't fully understand it. The model is much larger than the others and trained on a different data distribution. Worth coming back to.)
VLMs receive raw sRGB and could discriminate at machine precision. They don't, because they've been trained on humans. The metric they end up using is not the metric on the wire; it's a learned approximation to the human perceptual metric. The approximation isn't hand-coded into the architecture — it accumulates through the vision transformer during training, with the slope of the accumulation set by the training objective.
The strong form of the claim is: neural networks inherit perceptual structure from the humans who curated their training data. Color is the cleanest possible domain to demonstrate that, because we have a quantitative ground truth (ΔE₀₀, fit to a century of psychophysics). But there's no obvious reason the same story shouldn't apply to other perceptual dimensions where humans are non-uniform — texture sensitivity, auditory pitch, ambiguous-figure preferences, social-stimulus saliency. Wherever a model is trained on human-labeled or human-curated data, the model is being shaped by human sensory biases, whether the model designer intended it or not.
A few things to keep skeptical about:
The natural next question is whether you can induce perceptual structure by manipulating the training data, or knock it out by training against a non-human signal. We're also looking at whether the residual axis-dependent bias in 3D CIELAB is a hue-naming artifact (the b* axis is harder where the model has weaker color vocabulary — this is testable). And we're running a separate track asking whether natural image color statistics alone could account for the ΔE₀₀ advantage, without needing to invoke "learning a perceptual prior" at all. (Early answer: probably not, but it's a real experiment, not a hand-wave.)
For now, the headline stands. A VLM is given pixels and learns to see colors the way people see them, because people made everything it was ever trained on.
This post is a blog version of our SciForDL @ ICLR 2026 workshop submission. Full paper: Vision Language Models Inherit Human Color Perception. Code and data release coming with the camera-ready version.