AI Knowledge Base 2026

AI Glossary 2026

Clear definitions for the era of Agentic AI and Spatial Intelligence.

Inference & Engineering

In-Context Learning (ICL)

In-Context Learning (ICL) is the ability of large language models to solve new tasks directly from examples provided in the input prompt — without updating model weights and without traditional training. The model infers the task's pattern from the provided examples and applies that logic to the actual query. The mechanism operates through prompt structure: when input-output pairs (called shots) are prepended to the prompt, the model implicitly learns the task format and expected output logic. Zero-shot ICL requires no examples at all; few-shot ICL typically provides two to eight demonstrations. ICL is a defining capability of modern foundation models: it enables flexible adaptation to new tasks without expensive fine-tuning. For organizations, this means that many use cases — from classification and extraction to translation and summarization — can be solved through carefully designed prompts alone. The quality and representativeness of the in-prompt examples directly determines output accuracy.

Explore Concept
Economics & Scale

Inference Cost

Inference cost refers to the financial expenditure incurred when operating an AI language model — the costs of processing every user request. Unlike training costs (one-time, very high), inference costs accrue continuously with every user request and represent the dominant AI cost factor in ongoing operations. Inference costs are typically billed in price per token. As of 2026: GPT-4o approximately $2–5/M input tokens and $8–15/M output tokens; Claude Sonnet at $3/M input, $15/M output; more affordable models like Claude Haiku or Gemini Flash range from $0.25–1/M tokens. Output tokens are more expensive than input tokens (due to sequential generation overhead), so cost-efficient systems actively optimize output length. Cost drivers include: model size (more parameters = higher cost), context length (longer contexts increase input token costs disproportionately), output length, provider hardware, peak vs. off-peak usage, and licensing model (API vs. self-hosted). Inference costs have fallen over 100× since 2023 — GPT-4-equivalent performance now costs ~1% of its 2023 price, driven by hardware advances and competition. This trend continues with Blackwell and Vera Rubin deployments. Key optimization strategies: model routing (cheap models for simple tasks), batch inference (50–75% discount), prompt optimization (request shorter outputs), caching frequent requests.

Explore Concept
Agentic Infrastructure

Inference Optimization

Inference optimization encompasses all techniques and strategies employed to improve the performance (latency, throughput) and/or cost efficiency of AI inference systems without significantly degrading the quality of generated outputs. The key optimization layers are: (1) Model level: quantization (reducing numerical precision from FP16 to INT8 or FP4), pruning (removing low-importance model weights), distillation (training smaller models on outputs of larger ones); (2) Serving level: continuous batching (dynamically grouping requests), KV-cache optimization, PagedAttention (efficient memory management for context); (3) Hardware level: tensor parallelism, Flash Attention, kernel fusion; (4) System level: speculative decoding, model routing, response caching. Speculative decoding deserves special mention: a small "draft model" generates several token candidates, which a larger "verifier model" validates or rejects in a single pass. With a good draft model, this can increase effective generation speed by 2–4x. Frameworks like vLLM, TensorRT-LLM, and DeepSpeed-Inference have become the standard for optimized serving. They implement many of these techniques automatically and can achieve 10–20x better throughput compared to naive HuggingFace serving. In cloud deployments, model routing — automatically directing simpler queries to cheaper, faster models and complex queries to more capable ones — is often the highest-leverage optimization available without requiring infrastructure changes.

Explore Concept
Agentic Infrastructure

Inference Chip

An inference chip is a specialized semiconductor processor optimized for efficiently running AI models during inference. Unlike general-purpose CPUs or training-optimized GPUs, inference chips prioritize throughput (TPS), energy efficiency, and low latency for already-trained models. The three dominant categories: GPUs like NVIDIA's H100 and B200 Blackwell, excelling through massive parallel compute and specialized Tensor Cores; TPUs (Tensor Processing Units) from Google, purpose-built for matrix multiplications in neural networks; and ASICs (Application-Specific Integrated Circuits) for single-task optimization — including Groq's LPU achieving 500+ TPS, Cerebras' CS-3, and Amazon's Inferentia chips. NVIDIA's Blackwell generation (GB200, B200) has reshaped the inference landscape: native FP4 enables 4× more operations per watt versus H100; 192GB HBM3e memory holds even the largest frontier models entirely in VRAM. The GB200 NVL72 rack (72 B200 GPUs, 1.4TB total VRAM) achieves 30× higher throughput than H100 systems. The right chip selection profoundly influences cost, latency, and maximum model size. Smaller models run efficiently on single H100s; frontier models require multi-GPU clusters with hundreds of accelerators. As model quantization (FP4, INT8) becomes standard, ASICs increasingly outperform GPUs for fixed-workload inference at dramatically lower power.

Explore Concept