Clear AI thinking, grounded in business reality

Making sense of AI for business. Industry analysis, clear explanations, and insights that cut through the noise.

Nov 15 • 6 min read

Summa Intelligentiae: Token by Token


Quaestio Quinta

05

Token by Token

How Large Language Models Work

"Write a short email to my team explaining that our AI launch is delayed until 2026."

It's prompt for an LLM which Polymarket might soon start an over-under line on.

You press enter. One second later, a polished, empathetic, professional-sounding note appears.

What just happened? Every mechanical principle we’ve examined previously fired in sequence. A statistical engine translated your sentence into vectors, activated relevant patterns, and predicted the next most likely token until the answer "looked complete."

This is not magic. It is architecture (The Mechanics of Intelligence), data (Data, Patterns, and Feedback), and inference (The Closed-Book Test) working in concert.

Now we examine the process in motion: how a large language model transforms a single prompt into coherent prose.

Insights to Expect

  • How tokenization splits text into processable units, and why splits differ across models
  • How embeddings turn tokens into geometry where meaning becomes spatial proximity
  • Why attention mechanisms determine which parts of your prompt influence each prediction
  • How context windows define memory boundaries, not reasoning depth
  • Why autoregression builds responses one token at a time without knowing the final answer
  • What sampling parameters and architectural choices shape tone, risk, and reliability

Subscribe for a pragmatic brief a week.

Succinct, cited, and immediately actionable.

Input — What Happens the Moment You Press Enter

The first transformation reflects The Closed-Book Test's core insight: interpretation determines outcome. Now we watch the real mechanics.

Tokenization: The Model's First Reading

The model doesn't see words. It segments your text into tokens, the smallest units it processes (Kudo & Richardson, 2018).

The prompt becomes: ["Write", "a", "short", "email", "to", "my", "team", "explaining", "that", "our", "AI", "launch", "is", "delayed", "until", "202", "6", "."]

Notice "2026" split into "202" and "6." This is where models differ. GPT-4 uses byte-pair encoding, which handles common words efficiently but may split rare terms awkwardly. Claude uses a different tokenizer optimized for coding syntax. Llama 3 prioritizes multilingual coverage.

If tokenization handles your prompt cleanly, the model starts with a faithful representation. If it splits meaning awkwardly—compound terms, specialty jargon, non-English text—misinterpretation carries through the entire forward pass. Unlike humans, the model cannot reconsider or reread. It simply proceeds.

Embedding: Turning Tokens into Geometry

Each token ID now maps to a high-dimensional vector the model learned during training.

This is where Data, Patterns, & Feedback's lessons reappear:

  • "email" clusters near "message," "write," "communication"
  • "delayed" lives near "postponed," "pushed back," "rescheduled"
  • "202" clusters near other numeric patterns, and the model must associate it with "6" to reconstruct "2026"

These relationships are statistical fingerprints from the training corpus, not comprehension. The vector for "team" learned its position by appearing in contexts like "colleagues," "staff," "group." Your prompt now exists as a sequence of position-aware vectors floating in mathematical space.

Position encodings are added so the model distinguishes "Write a short email" from "Email a short write" (Vaswani et al., 2017). Order matters.

At this point, your prompt has been transformed from words into geometry. The actual prediction process begins.

Processing — How the Model Activates Patterns Without Thinking

This is where transformer architecture and inference mechanics come alive.

Attention: Identifying What Matters

Self-attention determines which vectors influence each other.

For our email prompt:

  • "email" attends strongly to "team"
  • "AI launch" attends strongly to "delayed", "2026"
  • "write" attends to "short," shaping tone and length.

Multiple attention heads evaluate the prompt through different conceptual lenses: syntax, intent, chronology, phrasing typical of professional emails, sentiment patterns learned from similar training data (Vaswani et al., 2017).

This is not comprehension. It is weighted relevance, based entirely on historical patterns. The model learned that when "delayed" appears near "launch," certain linguistic structures tend to follow. It activates those pathways.

Attention also enables the model to reference earlier tokens when predicting later ones. When generating "2026," the model attends back to "delayed" and "AI launch" to maintain coherence. This mechanism explains how prediction works: everywhere simultaneously. Each token prediction considers the full sequence through attention.

The Context Window: The Boundaries of Memory

Everything the model considers must exist inside the context window.

In our short prompt, this is trivial. In long conversations, however, older content disappears or must be summarized, affecting relevance and accuracy. Techniques like Retrieval-Augmented Generation (RAG), which we will discuss in detail in future articles, address this by pulling external information into the window at inference time.

A larger window increases available reference material, but not understanding. If your prompt mentioned a product specification three paragraphs ago, a small window might forget it by the time the model reaches "delayed." A larger window may retain it, allowing attention to connect distant references.

But size alone doesn't guarantee improvement. Recent research reveals that models weight information at the beginning and end of context windows more heavily than material in the middle (Liu et al., 2023). Larger windows help, but placement within that window matters.

Think of the window as working memory, not intelligence. More memory helps, but it doesn't create reasoning.

Layer-by-Layer Pattern Reconstruction

As vectors pass upward through dozens of transformer layers:

  • Early layers capture structure: "write email explaining..."
  • Mid-layers activate relationships: "AI launch delayed" → timeline shift, team communication
  • Late layers synthesize style: empathetic update, leadership tone, clear timeline.

This is Data, Patterns, & Feedback's "representation learning" in real time. The model reconstructs a likely pattern it has seen across millions of similar emails. It is not retrieving a template. It is assembling one from learned fragments.

Output — How the Model Builds the Email, One Token at a Time

This is where the machinery becomes visible.

Autoregression: Constructing the Email

The model computes a probability distribution over its entire vocabulary for the first output token. Candidates might include: "Hi," "Team," "Hello," "Everyone," "I."

These probabilities derive from millions of team-update emails in training data. Suppose the model selects "Team," as the first token. It appends this to the prompt and repeats:

"Team," → "We" → "wanted" → "to" → "update"...

The model writes in a strictly sequential chain, recomputing attention and predicting the next token at each step. It never knows the full email in advance. It only knows the next most probable continuation based on everything that came before.

This explains how the ending works: the model doesn't plan conclusions. At each step, it can predict a special end-of-sequence token (

Sampling: The Controls That Shape Style and Risk

Temperature controls randomness. Low values favor predictable, conservative choices. High values permit creative or informal phrasing (Holtzman et al., 2020). In enterprise settings, these parameters are risk controls, not aesthetic preferences.

Top-k limits the candidate pool to the k most likely tokens, increasing stability. Top-p (nucleus sampling) selects from the smallest set of tokens whose cumulative probability exceeds p (Huyen, 2023).

Streaming simply outputs each token as soon as it's selected. The model still writes one token at a time; streaming just reveals the process in real-time.

Final Thoughts

  • Context windows define reference scope, not reasoning capability. More memory does not equal deeper analysis. It only increases available material during inference. And remember: models pay less attention to the middle.
  • Training data shapes everything. If the model saw polished team-update emails during training, it produces polished team-update emails. If it didn’t, you get improvisation. Know your vendor's training corpus.
  • Larger models do not magically gain comprehension. They generate more robust patterns, but they remain pattern machines, not reasoning engines. Size buys fluency, not understanding.
  • Sampling parameters are business levers. Temperature and top-k settings determine whether your customer-facing AI sounds conservative or creative, compliant or risky. These aren't IT decisions—they're policy decisions that require executive oversight.

In our next article, we’ll conclude our exploration of how AI works, examining where these systems reach their edges—where scaling stops helping, where attention breaks down, and what architectural frontiers researchers are pursuing. Understanding these limits clarifies not just how AI works, but where it may not be the best tool for the job.


Did you enjoy this article?

Share it with a friend and don't forget to subscribe.

Links and References


Questions, suggestions, or future topics you'd like to see covered? Let me know.

5500 Military Trail, Ste 22 #412, Jupiter, FL 33458
Unsubscribe · Preferences


Making sense of AI for business. Industry analysis, clear explanations, and insights that cut through the noise.


Read next ...