Clear AI thinking, grounded in business reality

Making sense of AI for business. Industry analysis, clear explanations, and insights that cut through the noise.

Nov 08 • 5 min read

Summa Intelligentiae: The Closed-Book Test


Quaestio Quarta

04

The Closed-Book Test

How AI Models Make Predictions

The room goes quiet. Books closed, notes away, pencils ready. Every student knows this moment: when preparation ends and performance begins. The test booklet opens, and everything depends on what you've managed to absorb.

For AI systems, every inference is this same moment of truth. All the training, feedback loops, and parameter adjustments fade into the background. The model faces each new input with only what it learned, no references allowed.

In The Mechanics of Intelligence, we explored how models learn through feedback. In Data, Patterns, and Feedback, we examined what they consume during training. Now comes performance: the closed-book exam where learned patterns meet new questions.

When the test begins, how does a model read the question, recall what it knows, and decide what to answer? And how does it judge its own confidence without an answer key?

Insights to Expect

  • How models transform inputs into mathematical representations they can process
  • Why inference generates multiple possible answers, then selects one through probability
  • How confidence and accuracy diverge, creating the calibration paradox
  • Why thresholds determine when statistical guesses become operational decisions
  • How drift erodes reliability when real-world patterns shift from training data

Subscribe for a pragmatic brief a week.

Succinct, cited, and immediately actionable.

Reading the Question

Every exam begins with interpretation. Misread the question, and even perfect knowledge leads nowhere.

Models don't read text or see images directly. They first break inputs into tokens: words, sub-words, or pixel patches.

Tokenization is like diagramming a sentence to understand and segment the role each word and element plays in conveying the intended idea. Tokens are next converted into numerical vectors through embedding. This mathematical mapping places similar concepts near each other. "Cat" and "dog" cluster together, while "cat" and "airplane" sit far apart (Goodfellow, Bengio & Courville, 2016).

This translation is where comprehension begins. It's also where many failures start. If tokenization splits "New York" into "New" and "York," or if embeddings misplace relationships, the model effectively misreads the question.

Unlike a student who can pause to reconsider, the model proceeds automatically through its forward pass. One misencoded input cascades through the entire reasoning chain, yielding confident but wrong answers (Szegedy et al., 2014).

The stakes are higher than a missed exam question. In production systems, these interpretation errors manifest as garbled translations, biased associations, or adversarial vulnerabilities. Tiny input changes trigger massive mistakes.

Summoning What You Know

With the question parsed, memory stirs. The model begins its search through learned patterns, seeking the best match between input and stored knowledge.

This process, the forward pass, activates pathways shaped during training. If the input closely matches something seen before, the answer emerges instantly (see "crystallized path"). When no exact match exists, the model reconstructs from fragments, blending partial patterns into coherent responses. This is generalization: handling new inputs by combining known structures (LeCun et al., 2015).

Generalization enables both brilliance and error. It allows models to reason beyond their training examples, but it can also misfire. The model applies the right pattern to the wrong problem (Zhang et al., 2017). It's the student who perfectly recalls a formula, but for the wrong question type.

Inference is structured recall under pressure. The model isn't retrieving stored facts; it's assembling answers from learned relationships, transforming stored patterns into active predictions.

Composing the Answer

With patterns activated and possibilities emerging, the model must commit. Like weighing multiple answers on a test, it evaluates which response best fits the input.

Models don't produce single answers. They generate probability distributions over all possible outputs. Each potential response receives a score representing how well it matches the input patterns. Selecting the final answer means picking the highest-scoring option. It's the statistical equivalent of filling in the most likely bubble.

But that selection isn't binary. The threshold, the cutoff where probability becomes action, determines how decisive the model will be. A fraud detection model might block transactions only above 95% confidence. Or it might prompt to cast a wider net at 80%, trading false positives for missed cases. Business systems constantly tune these thresholds, balancing caution against coverage.

Some systems add controlled randomness through temperature settings. Low values yield consistent, safe answers; high values permit creative risks (Holtzman et al., 2020). Image generation models, like Midjourney, commonly expose these temperature settings to users to increase variability in design from input.

These additional mechanisms transform statistical fit into decision. Distribution, threshold, and sampling help turn what the model knows into what it does.

When Confidence Meets Reality

After answering, every student wonders: Did I get it right? Models face the same uncertainty, but with a dangerous twist.

Each prediction carries internal confidence: the highest probability from its output distribution. This number expresses statistical fit, not truth. Modern neural networks consistently overstate their reliability. Guo et al. (2017) demonstrated that state-of-the-art models grow more miscalibrated as they improve. They produce fluent but wrong outputs with maximum confidence. This is the calibration paradox: mathematical precision, persuasive delivery, potential disaster.

Calibration techniques like temperature scaling adjust probabilities to align confidence with observed accuracy. Bayesian approaches estimate uncertainty through multiple internal passes (Gal & Ghahramani, 2016). Both serve one goal: ensuring models know when they don't know. Yet, calibration happens during training. By test time, it's baked in.

Even well-calibrated models face another challenge: drift. The world changes, but trained models don't. Data drift occurs when inputs shift: new terminology, evolved user behavior, updated regulations. Concept drift happens when relationships between inputs and outcomes change. Yesterday's risk signals no longer apply.

Both create the same problem. Models answer confidently based on outdated material. That's why governance frameworks like NIST's AI Risk Management Framework (2023) mandate continuous monitoring and retraining cycles. Models need regular re-examination, like professionals whose certifications expire.

Without this maintenance, inference becomes a test taken with last decade’s textbook. The result is confident, fluent, and wrong.

Applied Relevance

For business leaders, inference is where investment meets accountability. Each prediction represents a decision under uncertainty, shaped by thresholds you control (depending on the model) and confidence you must monitor.

Understanding inference mechanics clarifies three operational imperatives. First, interpretation errors compound. Garbage in means garbage out, regardless of model sophistication. Second, confidence scores are risk metrics, not truth meters. High confidence with poor calibration creates liability. Third, drift is inevitable. Today's accurate model is tomorrow's liability without continuous governance.

Treat inference performance as you would any critical KPI. Monitor precision and recall, track calibration curves, and schedule regular retraining. The model's exam never ends.

Final Thoughts

  • Interpretation determines everything. Misread questions guarantee wrong answers.
  • Every output is a weighted bet dressed as certainty.
  • Confidence measures fit to patterns, not factual accuracy.
  • Thresholds define risk tolerance between precision and recall.
  • Reliability requires vigilance. The test material keeps changing.

In our next article, we'll examine what happens when the exam rules change entirely. The books open, the internet connects, and the model can reference vast libraries during the test itself. Through the lens of Large Language Models, we'll explore how architectures built for open-book inference transform probability into prose, using context windows as extended memory and attention mechanisms as dynamic focus.


Did you enjoy this article?

Share it with a friend and don't forget to subscribe.

Links and References


Questions, suggestions, or future topics you'd like to see covered? Let me know.

5500 Military Trail, Ste 22 #412, Jupiter, FL 33458
Unsubscribe · Preferences


Making sense of AI for business. Industry analysis, clear explanations, and insights that cut through the noise.


Read next ...