Architecture of Claude — R² Consulting

01

Input Layer

Text · Vision · Documents · History

Claude accepts multiple input modalities simultaneously: raw text, images (processed by a vision encoder), PDFs and documents, prior conversation turns, system instructions from operators, and tool results injected back from previous agent steps. All inputs are assembled into a single flat sequence before tokenisation.

TextVisionPDFsTool resultsSystem prompt

Why it exists

Real-world tasks rarely arrive as plain text. Multimodal inputs let Claude assist with images, analyse documents, and operate inside agent pipelines without lossy format conversion.

02

Tokeniser & Embedding Layer

Input Encoding

A Byte-Pair Encoding (BPE) tokeniser splits text into sub-word units and maps each to an integer token ID from a vocabulary of 100K+ tokens. A learned embedding matrix converts each ID into a high-dimensional dense vector (~8,000 dimensions). Positional encodings are added to preserve token order. Images pass through a separate vision encoder and are projected into the same embedding space, enabling unified multimodal reasoning.

BPE100K vocabDense vectorsPositional encoding

Why it exists

Neural networks operate on continuous numbers. Tokenisation and embeddings are the essential bridge between human language and mathematical computation — without them, the transformer has nothing to process.

03

Transformer Decoder Stack

Core Reasoning Engine

The heart of Claude — dozens to hundreds of stacked decoder layers. Each layer runs multi-head self-attention (every token attends to every earlier token simultaneously), followed by a position-wise feed-forward network. Residual connections and layer normalisation make training stable across depth. The stack encodes world knowledge from pre-training on hundreds of billions of tokens and performs the actual language understanding and generation.

Self-attentionFFNLayer normResiduals100B+ params

Why it exists

Self-attention allows Claude to model long-range dependencies — linking a pronoun at token 4,000 to its referent at token 12. No prior architecture achieved this at scale.

04

RLHF & Reward Model

Human Preference Alignment

Human annotators rank pairs of Claude responses across helpfulness, accuracy, and safety. These comparisons train a reward model that learns to predict human preference. Proximal Policy Optimisation (PPO) then fine-tunes the entire language model to maximise this reward — iteratively shifting Claude's behaviour toward responses humans genuinely prefer, across tens of thousands of diverse tasks.

Human feedbackReward modelPPOPreference ranking

Why it exists

Pre-training produces a model that predicts text, not one that follows instructions. RLHF is what makes Claude genuinely helpful — oriented toward user intent rather than statistical next-token prediction.

05

Constitutional AI

Principle-Based Safety

Anthropic's Constitutional AI method gives Claude an explicit written constitution — a set of principles about harmlessness, honesty, and helpfulness. During training, Claude is prompted to critique its own outputs against these principles and revise them. A second RLAIF stage uses an AI critic trained on these principles as the reward signal, scaling safety alignment without requiring human labelling of every harmful category.

Self-critiquePrinciplesRLAIFHarm avoidance

Why it exists

Human labelling is costly and inconsistent for nuanced harm. CAI provides a scalable, auditable mechanism to bake values into the model — making safety a first-class architectural concern.

06

Context Window & Memory

Working Memory

Claude's context window is its working memory — a fixed-length buffer (up to 200,000 tokens in Claude 3.5) holding the full conversation, system prompt, injected documents, and tool results. Unlike humans, Claude has no persistent memory across sessions unless explicitly provided. Every token is attended to simultaneously — longer contexts do not degrade recall. RAG (Retrieval-Augmented Generation) patterns can extend effective memory by dynamically injecting relevant chunks.

200K tokensIn-contextNo degradationRAG-ready

Why it exists

Without a context window Claude would be stateless — unable to hold conversations, follow documents, or track instructions. Large high-fidelity context enables complex reasoning and long-form tasks.

07

Tool Use & Agentic Layer

Action & Planning

Claude can emit structured JSON tool-call outputs that trigger external actions — web searches, code execution, file reads, database queries, or any API endpoint. Results are fed back into the context window and Claude reasons over them, enabling multi-step agentic loops. The Model Context Protocol (MCP) standardises tool registration and invocation. Claude plans, acts, observes results, and revises until the task is complete — turning it from a conversational model into an autonomous agent.

Function callingWeb searchCode executionMCPAgent loops

Why it exists

Text generation is limited to context-window knowledge. Tool use breaks this ceiling — giving Claude access to live data, the ability to run code, and the capacity to take real-world actions at runtime.

08

Output Decoding & Sampling

Response Generation

The transformer stack produces a probability distribution over the vocabulary for the next token. Sampling strategies — temperature (controls randomness), top-p nucleus sampling, and top-k — determine how a token is selected. Temperature 0 gives deterministic, factual outputs; higher values produce creative, varied text. Tokens are generated auto-regressively: each new token is appended to the context and the full forward pass repeats until an end-of-sequence token is emitted or max-length is reached. Streaming delivers tokens to users as they are generated rather than waiting for the complete response.

TemperatureTop-p samplingAutoregressiveStreamingEOS token

Why it exists

Without a controlled decoding strategy responses would be incoherently random or boringly repetitive. Temperature and sampling parameters give operators fine-grained control over the creativity-accuracy trade-off across every use-case.

How Claude works

Every component explained