Core Francisco Park

Most of this post is semi-placeholder generated with AI assistance

Overview

This project explores whether large language models have causal internal representations of reward. While LLMs don't have neurotransmitters, they do have training mechanisms that shape their behavior through reward signals, potentially creating internal structures that serve analogous functions.

Key Questions

Do LLMs develop causal internal representations that track reward signals?
Can we identify neural circuits within LLMs that encode reward prediction and value?
How do reinforcement learning techniques (like RLHF) shape these internal representations?
What are the implications for understanding agency and goal-directed behavior in LLMs?

Research Approach

Mechanistic Interpretability

Investigating the internal mechanisms through:

Probing for reward-related representations in hidden states
Analyzing attention patterns during value-laden decisions
Identifying causal pathways that mediate reward-based choices

Behavioral Analysis

Studying LLM responses to identify:

Whether internal representations causally influence outputs
How reward signals propagate through the network
Emergent goal-seeking behaviors and their neural correlates

Theoretical Framework

Developing a framework for understanding whether LLMs have genuine causal models of reward, or merely correlational patterns.

Preliminary Findings

The project investigates several key aspects:

Internal Representations: Evidence that LLMs may develop structured representations of reward and value during training
Causal Influence: Testing whether these representations causally influence model outputs, not just correlate with them
Emergent Circuits: Identifying potential "reward circuits" within transformer architectures that process value-related information

Implications

Understanding causal reward representations in LLMs could:

Reveal whether LLMs have genuine goal-directed behavior or merely simulate it
Inform more controllable and aligned AI systems
Provide new tools for mechanistic interpretability research
Shed light on the nature of agency and intentionality in artificial systems

Current Status

This is an ongoing theoretical investigation combining insights from:

Computational neuroscience
Machine learning theory
Cognitive science
Philosophy of mind

The project aims to bridge the gap between our understanding of biological and artificial intelligence systems.

Do LLMs have dopamine?