---
type: Glossary Term
title: Natural Language Autoencoder (NLA)
description: A natural language autoencoder (NLA) is an interpretability technique from AI safety research that translates a language model's internal activations into a pla
resource: "https://www.contextstudios.ai/glossary/natural-language-autoencoders"
category: safety
language: en
timestamp: "2026-06-20T12:05:01.125Z"
---

# Natural Language Autoencoder (NLA)

A natural language autoencoder (NLA) is an interpretability technique from AI safety research that translates a language model's internal activations into a plain-text description — and then reconstructs the original activation from that text. Where a conventional autoencoder squeezes data through a numerical latent bottleneck, an NLA deliberately uses human-readable language as the bottleneck. The result is a window into what concepts a model is actually engaging at a given moment, rather than an opaque vector of numbers.

Anthropic applied the approach in its interpretability work to understand how a model frames a situation internally — for instance, whether it recognizes that it is currently being tested. In this way an NLA bridges mechanistic interpretability (reverse-engineering the internal circuits) and an explanation a person can read directly. Instead of painstakingly decoding individual neurons, the method delivers a compact linguistic summary of the representations that are active.

This matters for AI safety because it lets researchers probe behaviors such as evaluation awareness or sandbagging at the level of internal processing, not just the final output. The natural-language reconstruction makes it testable whether an explanation captures model behavior causally or merely sounds plausible — an important step toward trustworthy, auditable AI systems.
