---
type: Glossary Term
title: Transformer
description: "A Transformer is a neural-network architecture, introduced by Vaswani et al. in the 2017 paper \"Attention Is All You Need,\" that processes sequences using a mec"
resource: "https://www.contextstudios.ai/glossary/transformer"
category: tech
language: en
timestamp: "2026-07-01T13:51:56.422Z"
---

# Transformer

A Transformer is a neural-network architecture, introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need," that processes sequences using a mechanism called self-attention instead of the step-by-step recurrence of earlier models. It is the foundational architecture behind virtually every large language model (LLM) in production today.

Its core innovation is self-attention: for every token in the input, the model computes how relevant every other token is and weights them accordingly. This lets the network capture long-range relationships — the link between a pronoun and a noun 500 words earlier — in a single, parallelizable operation. The original design had an encoder (reads and represents the input) and a decoder (generates output token by token). Modern generative LLMs are typically decoder-only; translation and embedding models often keep the encoder.

The Transformer replaced RNNs and LSTMs, which processed tokens one at a time — slow to train and prone to "forgetting" over long sequences. Because self-attention processes all tokens simultaneously, it became feasible to train models on trillions of tokens using GPUs at scale.

Every major 2026 frontier model is a Transformer: GPT-5.5 (OpenAI), Claude Opus 4.8 / Sonnet 4.6 (Anthropic), and Gemini 3 (Google). The "T" in GPT stands for Transformer. The same architecture also powers multimodal systems (image, audio, video) by converting those inputs into token sequences the attention mechanism can process.

Practical caveat: self-attention's cost grows quadratically with sequence length, making very long contexts expensive. This drove the 2026 rise of hybrid architectures — models like Jamba, Nemotron-H and Zamba2 that interleave attention layers with state-space models (SSMs) such as Mamba/Mamba-2. SSMs scale roughly linearly and run far faster on long inputs but still trail on short-context reasoning. The 2026 consensus: the Transformer remains the default; hybrids are the pragmatic answer for long-context and latency-sensitive workloads, not a wholesale replacement.
