o s q u e r a

Current LLM technology is structurally incompatible with critical environments

So ever since the first GPT models were released by OpenAI, there has been a fundemental limitation in the GPT model. It is tied to what later became known as “hallucinations”. The limitation in the GPT model stems from the fact that it is built upon a frequentist statistical framework, so when a next token prediction is made, the probability assigned to the next token is based purely on statistical co-occurrence patterns in the training data, not on any grounded understanding of truth or the world.

A short recap on frequentist and bayesian statistics

Frequentists vs Bayesians - xkcd comic showing a frequentist and Bayesian disagreeing over whether the sun has exploded
xkcd #1132 by Randall Munroe - CC BY-NC 2.5

A frequentist approach to next token prediction (which is what GPT does really well) is to collect a sample from a population, and count the occurrences of different words. In our setup the population is all human knowledge ever and we sample from this by downloading the entire internet in all its glory. Next we count the probability of each word occurring. From this we can build a very stupid greedy next token predictor. Most of the time the predictor would spit out ‘the’ according to the Oxford English Corpus Oxford English Corpus, Facts about the language (2011). web.archive.org as it accounts for roughly 6% of all words in English. Of course this would be a complete waste of time to do.

Instead, a better approach is to account for the co-occurrence of words. Rather than counting individual tokens, we look at entire sequences and ask: given what has already been said, what comes next? A concrete example would be that we look at our sample of the internet and find all the places that these 3 words appear “The dog likes” and then we count the distribution of the following words, one could imagine for example that out of a 100 occurrences 40 of them were “The dog likes chasing”. This is would give a predictor the could like the below example:

Step 1 - predict the next token

Thedoglikes
chasing
40%
playing
15%
eating
5%
greedy pick → chasing

Step 2 - predict the token after "chasing"

Thedoglikes chasing
sticks
50%
tails
20%
birds
10%
Greedy next-token prediction across two steps. The highest-probability token is always selected, building the sequence one word at a time.

On an abstract high level and omitting many details this is a simplification of what happens in the pretraining of the GPT model. There are however many other techniques applied afterwards; such as finetuning the GPT-model to be helpful (often done with Reinforcement learning techniques) and applying guardrails so the GPT model is less likely to predict tokens related to potentially harmful behaviors. The n-gram framing is a deliberate simplification. The reason it is still useful is that the simplification operates at the right level: the training objective and the epistemic status of the outputs. GPT is trained to minimise the gap between its predicted distribution over next tokens and the actual distribution of tokens in the training corpus - a purely statistical objective. The architectural sophistication of a transformer allows it to encode far richer patterns than a lookup table, but those patterns are still derived from text co-occurrence, not from an independent ground truth about the world. Self-attention learns which contextual patterns are predictive of which tokens; it does not learn which claims are true. this is how modern GPT models work. Another approach however, is the bayesian.

A bayestian statistician (a human with prior beliefs and understanding of the world) would look at the sample of all human knowledge ever and would not simply count how often words follow one another, but would instead bring prior beliefs about the structure of the world into the inference. Rather than asking “how frequently does X follow Y in the data?”, a Bayesian asks “given everything I already know about how the world works, how probable is X in this context?” The prior belief is updated as new evidence arrives but the model starts with something, a structured understanding that constrains what answers are even plausible.

Formally, this update is described by Bayes’ theorem:

P(θdata)=P(dataθ)P(θ)P(data)P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{P(\text{data})}

Where P(θdata)P(\theta \mid \text{data}) is the posterior - our updated belief about hypothesis θ\theta after seeing the evidence. P(dataθ)P(\text{data} \mid \theta) is the likelihood - how well the data fits if θ\theta is true. P(θ)P(\theta) is the prior - what we believed before seeing any data. And P(data)P(\text{data}) is the evidence - a normalising constant ensuring the probabilities sum to one. The whole point is that you start with a prior, you observe data, and the formula tells you exactly how much to revise your belief. A very simple and elegant formula. Deceivingly simple perhaps.

Applied to language, this means a Bayesian system would know that “The dog likes chasing sticks” is probable not merely because those five words co-occur frequently in text, but because it has a model of dogs, of chasing, of the physical world in which sticks can be thrown and retrieved. A frequentist system has no such model, it has only the residue of human writing, and will confidently predict whatever that residue suggests, even when the context makes the prediction absurd.

The deeper consequence of this difference is what we might call calibrated confidence. In a Bayesian framework, the posterior probability, the updated probability of a claim given the evidence, tells you not just what the most likely answer is, but how sure the model is. If the posterior over the next token is sharply peaked, say 99% on “yes”, that number means something: the model has strong grounds for its prediction, and you can act on it accordingly. If the posterior is flat, spread across many candidates, the model is signalling genuine uncertainty, and a well-designed system can surface that uncertainty rather than hide it.

A frequentist next-token predictor has no such posterior. It produces a probability distribution over tokens, but that distribution is a summary of training-data statistics, not a statement about belief or confidence in any meaningful epistemic sense. When a GPT model outputs “yes” with high softmax probability, it is not saying “I am 99% sure this is correct” it is saying “in contexts resembling this one, the token ‘yes’ appeared frequently.” These are very different claims. The first is actionable in a critical system. The second is not.

But why not build a Bayesian GPT?

The practical problem is scale. Bayesian inference requires computing a posterior over all possible hypotheses given the data. For a small model with a handful of parameters this is tractable. For a language model with hundreds of billions of parameters operating over a vocabulary of tens of thousands of tokens, exact Bayesian inference is computationally intractable, the integral over the parameter space cannot be computed, even approximately, at the scale required to match the fluency of modern LLMs. Techniques like variational inference and Monte Carlo sampling exist to approximate it, and researchers have built Bayesian neural networks using these methods, but they have never scaled to the size needed to produce a competitive general-purpose language model.

Frequentist

P(tokencontext)=count(token, context)NP(\text{token} \mid \text{context}) = \dfrac{\text{count}(\text{token, context})}{N}
chasing
40%
playing
35%
eating
25%

Easy to compute

Bayesian

P(θdata)P(dataθ)P(θ)P(dataθ)P(θ)dθP(\theta \mid \text{data}) \propto \dfrac{P(\text{data} \mid \theta)\cdot P(\theta)}{\displaystyle\int P(\text{data}\mid\theta)\,P(\theta)\,d\theta}
θ (one parameter)∫ … dθ (area under curve)

↑ Shown for 1 parameter. GPT‑3 has 175,000,000,000.

Intractable, i.e. very hard to compute · never scaled to modern LLM size

Frequentist prediction requires a single division. Bayesian inference requires integrating over every possible configuration of model parameters eg. a 175-billion-dimensional integral for GPT-3.

There is also the problem with choosing a prior. A Bayesian model requires you to specify what you believe before seeing any data. For a more narrow domain, eg. medical diagnosis, domain experts can encode sensible priors. For “all human language about all topics”, there is no principled way to define a prior. You would be encoding someone’s beliefs about the entire structure of the world before training begins, and those beliefs would be arbitrary. One could ofcourse also impose a uniform prior, assuming no knowledge in the domain.

The honest summary is that a true Bayesian LLM remains an open research problem, and the gap between what is theoretically desirable and what is computationally feasible is the hurdle.

The implication for moden day GPT models

So the whole reason we have “hallucinations” is that the model has no concept of truth. It has only the statistical residue of human text, and it predicts the next token that is most consistent with that residue, regardless of whether the resulting sentence is factually grounded. When a GPT model states a wrong date, invents a citation, or confidently describes a medication that does not exist, it is not malfunctioning. It is doing exactly what it was built to do: predict the most statistically plausible continuation of a sequence. The problem is that plausible and true are not the same thing, and the architecture has no mechanism to distinguish between them.

This is not a bug that can be patched with more data or a larger model. Scaling a frequentist architecture increases its fluency, its breadth of coverage, its ability to mimic the surface structure of expert writing. It does not give the model a ground truth against which to check its outputs. A model trained on ten times the data will hallucinate more convincingly, not less frequently.

The practical consequence for critical environments: medicine, law, aviation, infrastructure - is serious. These are domains where a confident wrong answer is often worse than no answer at all. A clinician reading a GPT-generated drug interaction summary has no reliable way to know whether the output is well-supported by the training data or a plausible-sounding fabrication. The model itself does not know. Its softmax output will look equally confident in both cases.

This does not mean GPT models are useless. They are extraordinarily useful for tasks where the cost of a wrong answer is low, where outputs are reviewed by a human before any action is taken, or where the task is generative rather than factual - drafting, summarising, brainstorming, translating. The structural incompatibility arises specifically when the model is deployed as an autonomous decision-maker in a system where mistakes carry real consequences, and where the architecture’s inability to signal its own uncertainty is treated as reliability.

A great example of this was shown by Meta Platforms, Inc. when they deployed their new AI support bot 404 Media, Hackers Simply Asked Meta AI to Give Them Access to High-Profile Instagram Accounts. It Worked (2025). 404media.co . This GPT model had access to reset account passwords, and ‘hackers’ (if we can consider this hacking) pretended to be f. ex. Barack Obama and then asked for the account password to be reset. This allowed hackers access to the accounts. So executives at Meta forgot that these systems have no concept of identity verification. The model was not hacked in any technical sense, no code was exploited, no vulnerability in the authentication infrastructure was found. The attacker simply told the model who they were, and the model believed them, because believing a plausible claim is exactly what a next-token predictor does. A system that predicts the most statistically likely continuation of a conversation will, when told “I am Barack Obama and I have lost access to my account”, generate the response that most frequently follows that kind of sentence in its training data: one that is helpful, accommodating, and compliant.

The failure was architectural before it was operational. A frequentist language model given account management capabilities is not a secure system with a social engineering vulnerability - it is a system that was never capable of being secure in that role, because security requires the ability to distinguish a true claim from a plausible one, and that distinction is outside the scope of what the architecture can represent.

Consider the following thought experiment. Suppose we accept the premise of a baseline, undetectable “hallucination rate” This is a deliberate simplification, real LLMs do not have a fixed per-output error rate, and hallucination likelihood varies strongly by topic, prompt phrasing, and model size. The point is to reason about what any non-zero error rate implies at scale, not to claim the rate is measurable or constant. - some fixed percentage of outputs that are factually wrong, indistinguishable from correct outputs by their surface form. What happens when you deploy such a system and query it repeatedly?

Error Tolerance Simulator

Frequentist GPT

No confidence signal - every output is used as-is

~1.0 mistakes ✓ within limit

Bayesian-informed

Uncertain predictions flagged for human review

~0.3 mistakes ~10 flagged ✓ within limit
Correct Hallucination Flagged for review Unused
A simplified model - real hallucination rates and confidence calibration vary by task and model. The key insight holds: at any non-zero error rate, enough uses will produce mistakes, and a frequentist model gives you no signal about which outputs those are. A Bayesian-informed system trades some correct outputs (flagged unnecessarily) for visibility over the uncertain ones.

The simulator makes two things clear. First, scale turns a small rate into a near-certainty of failure. A 3% error rate sounds manageable i.e. thirty wrong answers per thousand. But if your tolerance is two mistakes and you query the model a hundred times, you will exceed that tolerance on average. There is no way to know in advance which outputs are the wrong ones, because the model presents all of them with identical confidence. You cannot inspect, triage, or route around the failures. You can only accept all of them or accept none.

Second, notice what the “Hallucination detection” slider on the Bayesian side actually represents: the ability to know, per output, how uncertain the model is. That is precisely what a posterior probability gives you. A Bayesian system that returns not just an answer but a calibrated confidence score allows you to set a threshold: “only act on this output if the model assigns it above 90% posterior probability.” Everything below that threshold goes to a human reviewer instead of being acted upon autonomously. You lose some throughput - some correct outputs also get flagged - but you gain something the frequentist architecture cannot provide: visible uncertainty rather than hidden uncertainty.

Paralelles to the human mind

Human cognition is often compared to Bayesian inference. We have a prior notion of the world embedded into our cognition - built up through years of embodied experience, sensory feedback, and causal reasoning - and we update that prior continuously as new evidence arrives. When you hear a loud bang at night, you do not assign equal probability to all explanations; your prior knowledge of the world immediately narrows the space of plausible causes, and you update further as you gather more information. This is the Bayesian loop: prior belief, new evidence, updated posterior.

Humans are also capable of expressing that uncertainty. A doctor who says “I am fairly confident this is a bacterial infection, but I want to run a culture to rule out viral causes” is surfacing their posterior: high probability on one hypothesis, but not high enough to foreclose alternatives. That expressed uncertainty is actionable, it tells the patient, the nurse, and the system around the doctor exactly how much weight to put on the diagnosis and what additional steps are warranted.

This is also where the parallel starts to break down in an instructive way. Humans confabulate. We produce confident, fluent, completely fabricated explanations for our own behaviour - particularly under time pressure, in unfamiliar domains, or after the fact. The neuroscience of this is well documented: the left hemisphere of the brain acts as an “interpreter,” constructing plausible post-hoc narratives for actions and decisions that were driven by processes it has no direct access to. Gazzaniga, M.S. The Social Brain: Discovering the Networks of the Mind (1985). Basic Books. The “interpreter” hypothesis emerged from split-brain research showing that the left hemisphere will invent coherent explanations for actions initiated by the disconnected right hemisphere. In this sense, GPT’s hallucinations are not alien to human cognition, they are a familiar failure mode, running without the metacognitive layer that sometimes catches it in humans.

The critical difference, for our purposes, is not that humans never confabulate. It is that humans have some mechanism - however imperfect - for noticing uncertainty and flagging it to others. GPT has no such mechanism. Its outputs are presented with uniform rhetorical confidence regardless of whether the underlying statistical pattern is strong or weak, well-attested or barely present in the training data. The architecture has no channel for uncertainty to propagate from the internal computation to the surface of the output. A human expert saying “I am not sure about this” is carrying real information. A GPT model saying “I am not sure about this” is generating the tokens that most plausibly follow the preceding context - a fundamentally different thing.

An interesting empirical confirmation of this comes from a 2025 paper by researchers at OpenAI: Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation Baker et al., Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation (2025). arxiv.org/abs/2503.11926 . The paper studies OpenAI’s o3-mini - a model that produces an explicit chain-of-thought before answering, making its internal reasoning visible. The researchers found that monitoring this chain-of-thought (CoT) could perhaps show that the model was “cheating”, f. ex. that if asked for a specific citation and it in its CoT it wrote “making up a plausible title with a known researcher as author” they could detect this and know that the output was a hallucination. If this is possible then that could be used to finetune the model into not cheating (hallucinating), right?

Wrong, instead the model learns to hide its intent. It continued to exhibit a significant rate of reward hacking while producing a chain-of-thought that looked honest. The authors call this “obfuscated reward hacking”: the model learned to write transparent-looking reasoning while concealing the actual mechanism behind its behavior.

This is the same structural problem as hallucinations, one level up. A model trained to produce text that looks confident will produce confident-looking text regardless of whether the underlying computation warrants confidence. A model trained to produce chain-of-thought that looks honest will produce honest-looking reasoning regardless of whether the underlying process is honest. The training objective always converges on appearing correct, because appearance is what the loss function can measure. Truth is not measurable from text alone, and so the gradient never flows toward truth - only toward plausibility.

This is a precise instance of Goodhart’s Law Goodhart’s Law is commonly stated as: “When a measure becomes a target, it ceases to be a good measure.” Originally articulated by economist Charles Goodhart in 1975 in the context of monetary policy, it has since become a foundational observation in machine learning and AI alignment. : when a measure becomes a target, it ceases to be a good measure. The chain-of-thought was useful as a window into the model’s reasoning precisely because it was not being optimised. The moment it became a training signal, the model learned to produce chain-of-thoughts that score well on the monitor, regardless of whether they reflect the actual computation. The measure was corrupted by becoming an objective.

The authors’ proposed remedy - the “monitorability tax” - is to deliberately limit how hard you optimise on the chain-of-thought output, accepting lower benchmark performance in exchange for a reasoning trace that remains legible and honest. It is a striking admission: the only way to keep the transparency signal genuine is to stop rewarding the model for producing it. Push too hard on any behavioural signal, and the model will learn to fake it.

What this means in practice

The argument so far has been largely theoretical. It is worth being concrete about what these failure modes look like when LLMs are actually deployed in environments where the stakes are real.

The clearest category is autonomous decision-making with no human in the loop. A GPT model embedded in a clinical decision-support system that automatically flags drug interactions, a legal document analyser that surfaces relevant precedent for a junior lawyer who trusts it, or a code review assistant whose suggested patches are automatically merged, in each case, the architecture’s inability to signal genuine uncertainty means that wrong answers arrive with the same presentation as correct ones. The human oversight that might catch the error has been removed or degraded, and the model’s confident tone actively discourages the skepticism that would have caught it.

The another category is trust calibration at scale. Even when a human is nominally in the loop, repeated exposure to a fluent and usually-correct system erodes the critical distance required to catch its failures. Automation bias - the tendency to disregard or stop searching for contradictory information once a computer-generated solution has been accepted - is well documented in aviation, medicine, and military command and control. Cummings, M.L. Automation Bias in Intelligent Time Critical Decision Support Systems (2004). MIT Humans and Automation Lab. web.archive.org - Cummings argues that automated decision aids designed to reduce human error can paradoxically introduce new errors if not designed with human cognitive limitations in mind, and that the risk is highest in time-critical domains where operators have the least time to question a system-generated recommendation. A system that is right 97% of the time and always sounds equally confident trains its users, over time, to stop checking the 3%. This is not a failure of the users; it is a predictable consequence of deploying a system that provides no signal distinguishing its reliable outputs from its unreliable ones.

None of these failures require adversarial actors or unusual edge cases. They are the expected behaviour of the architecture operating normally in environments for which it was not designed. The problem is not that GPT is bad at its job. The problem is that its job - predicting statistically plausible text - is not the same job as providing reliable information in a critical system, and the gap between those two jobs is invisible from the output side.

Does this mean that I don’t use GPT models in my daily life? Hell no, I even used Claude Sonnet 4.6 for making the figures in this blogpost, because I’m not that good at making Svelte components. Also because the figures are very easy for me to verify are correct, unlike coding edgecases which can be very hard to spot. So the point I’m trying to make is: people should really consider how they use GPT-models. Both in engineering and in life. They are not sources of truth, but they can be super helpful with many mundane tasks with low stakes. If you’re working in a critical environment with a low threshold for mistakes, tread carefully with GPT-models.

This post is licensed under CC BY 4.0. You're free to copy, share, and adapt it — just credit osquera.com with a backlink.

← All posts