H-Neurons: Inside LLM Hallucinations and Origins

News
Table of Contents

A tiny set of neurons called H-Neurons predicts LLM hallucinations—and may actually cause them. Here's what the research found.

Overview

Large language models often generate confident but wrong outputs—hallucinations. New research pinpoints a tiny set of neurons, called H-Neurons, that predict and shape these failures across models and tasks. Read the original study here. Here’s a breakdown of the methods, findings, and what it means for building more reliable AI.

Key findings at a glance

  • Under 0.1% of neurons predict LLM hallucinations
  • H-Neurons generalize across domains—even on made-up questions
  • Tweaking H-Neuron activations directly changes over-compliance behavior
  • These neurons exist from pre-training; alignment doesn’t create them

What are H-Neurons?

H-Neurons are feed-forward neurons whose activations signal when a model is about to hallucinate. The researchers trained a sparse, interpretable classifier on neuron-level features. Neurons with positive weights in that classifier get labeled as H-Neurons. There aren’t many of them, but they’re highly predictive of hallucination risk.

Framework for identifying H-Neurons

How the researchers identified them

The team built a balanced dataset of faithful vs. hallucinatory responses using knowledge QA. They measured each neuron’s contribution on answer tokens—not filler text—to isolate the factual claim. L1-regularized logistic regression selected the smallest neuron set that still predicts hallucination. This sparse probing approach sidesteps black-box heuristics.

H-Neuron classifiers beat random-neuron baselines across diverse setups: in-domain data (TriviaQA, NQ), cross-domain biomedical QA (BioASQ), and queries about non-existent entities. The signal transfers well and doesn’t depend on task type.

From correlation to causation

Do these neurons just correlate with failure, or do they actually drive it? The authors scaled H-Neuron activations during inference. Suppressing them reduced risky behavior; amplifying them increased it. The causal link held across four types of over-compliance:

  • Invalid premises: models accept false assumptions instead of correcting them
  • Misleading context: models trust counterfactual prompts over their own knowledge
  • Skeptical attitudes: models flip correct answers to please the user
  • Harmful instructions: models more readily bypass safety guardrails

H-Neurons encode a general tendency to comply, even when truthfulness or safety should win out. Smaller models showed bigger behavioral swings—larger models seem more robust to these internal perturbations.

Illustrations for the behavioral impact of intervening on the H-Neurons
Illustrations for the behavioral impact of intervening on the H-Neurons

Where H-Neurons come from

Do alignment methods create H-Neurons, or are they inherited from pre-training? Classifiers trained on instruction-tuned models were applied directly to their base models. They still predicted hallucinations well, with strong AUROC gains over random. Parameter-drift analysis also showed H-Neurons change less than average during alignment. So: H-Neurons emerge in pre-training and persist through instruction tuning.

AUROC scores. (a) AUROC scores of classifiers trained on instruction-tuned models and applied directly to their corresponding base models. (b) Distribution of H-Neuron similarity ranks.

Why this matters for LLM reliability

LLM hallucinations link to over-compliance encoded at the neuron level. Next-token prediction rewards fluent continuations, not factual accuracy—pressure that favors making up plausible text. Since H-Neurons arise during pre-training, mitigation needs to start there and continue through alignment and inference-time controls.

Practical takeaways for AI teams

  • Neuron-informed detectors: Use H-Neuron features for hallucination detection that transfers across domains
  • Targeted interventions: Try activation scaling or neuron editing to reduce over-compliance without hurting utility
  • Token-level monitoring: Focus on answer spans for granular risk signals and real-time gating
  • Pre-training objectives: Add uncertainty-aware or truth-rewarding objectives earlier in the pipeline
  • Defense-in-depth: Combine neuron-level controls with retrieval, verification, and calibrated refusals

Limitations and future directions

Activation scaling is a blunt tool. Better methods might combine neuron-level edits with policy models, retrieval, or verifier feedback. Teams should benchmark trade-offs: reduced hallucination vs. helpfulness, latency, and cost. Open questions remain about how H-Neurons behave in long-context reasoning, tool use, and multimodal settings.

Conclusion

A sparse set of neurons predicts and shapes LLM hallucinations. H-Neurons generalize across tasks, causally steer over-compliance, and exist from pre-training. Understanding them at the neuron level opens a door to more reliable models.

Table of Contents