All Projects

Do LLMs have dopamine?

Exploring whether large language models have causal internal representations of reward.

Most of this post is semi-placeholder generated with AI assistance

Overview

This project explores whether large language models have causal internal representations of reward. While LLMs don't have neurotransmitters, they do have training mechanisms that shape their behavior through reward signals, potentially creating internal structures that serve analogous functions.

Key Questions

  • Do LLMs develop causal internal representations that track reward signals?
  • Can we identify neural circuits within LLMs that encode reward prediction and value?
  • How do reinforcement learning techniques (like RLHF) shape these internal representations?
  • What are the implications for understanding agency and goal-directed behavior in LLMs?

Research Approach

Mechanistic Interpretability

Investigating the internal mechanisms through:

  • Probing for reward-related representations in hidden states
  • Analyzing attention patterns during value-laden decisions
  • Identifying causal pathways that mediate reward-based choices

Behavioral Analysis

Studying LLM responses to identify:

  • Whether internal representations causally influence outputs
  • How reward signals propagate through the network
  • Emergent goal-seeking behaviors and their neural correlates

Theoretical Framework

Developing a framework for understanding whether LLMs have genuine causal models of reward, or merely correlational patterns.

Preliminary Findings

The project investigates several key aspects:

  1. Internal Representations: Evidence that LLMs may develop structured representations of reward and value during training
  2. Causal Influence: Testing whether these representations causally influence model outputs, not just correlate with them
  3. Emergent Circuits: Identifying potential "reward circuits" within transformer architectures that process value-related information

Implications

Understanding causal reward representations in LLMs could:

  • Reveal whether LLMs have genuine goal-directed behavior or merely simulate it
  • Inform more controllable and aligned AI systems
  • Provide new tools for mechanistic interpretability research
  • Shed light on the nature of agency and intentionality in artificial systems

Current Status

This is an ongoing theoretical investigation combining insights from:

  • Computational neuroscience
  • Machine learning theory
  • Cognitive science
  • Philosophy of mind

The project aims to bridge the gap between our understanding of biological and artificial intelligence systems.