AI Knowledge Base 2026

AI Glossary 2026

Clear definitions for the era of Agentic AI and Spatial Intelligence.

Inference & Engineering

Terminal-Bench (AI Coding Benchmark)

Terminal-Bench is an evaluation framework for measuring the performance of AI coding agents in real-world development environments. Unlike traditional code benchmarks that test isolated snippets, Terminal-Bench evaluates the full development cycle: agents must autonomously execute code in a terminal, debug errors, navigate file systems, and solve complex multi-step engineering problems. The framework realistically measures the capabilities of modern coding agents such as Claude Code, GitHub Copilot Workspace, and similar systems under authentic conditions. On Terminal-Bench 2.1 — the current version — Anthropic's Mythos Preview achieved a score of 92.1% with a 4-hour timeout, significantly surpassing the previous benchmark of 82%. A key insight from Terminal-Bench is its sensitivity to compute time: the more time a model is given to work on a task, the higher the success rate tends to be. This reveals that many modern AI coding agents don't have capability gaps — they have compute time limitations. This distinction matters greatly for how teams design, budget, and scale AI-assisted development workflows.

Explore Concept
Inference & Engineering

Test-Time Compute Scaling

Test-time compute scaling (also called inference-time compute scaling) is the strategy of giving an AI model more computational resources when answering a query — rather than only investing more compute during training. Traditional language models run a single forward pass for each input and return an output immediately. Test-time compute scaling breaks with this pattern: the model is allowed to spend more time and resources exploring multiple solution paths, checking intermediate results, or self-correcting before producing a final answer. In practice, this means simple tasks get a quick pass while complex problems — multi-step code debugging, strategic analysis, autonomous task execution — can achieve dramatically better results with a longer compute budget. This was demonstrated powerfully by Claude Mythos Preview, which scored 92.1% on Terminal-Bench 2.1 with a 4-hour timeout, compared to significantly lower scores under tighter time constraints. Test-time compute scaling is closely related to chain-of-thought reasoning and modern AI agent architectures, both of which leverage iterative thinking to improve output quality. For businesses, this means model 'intelligence' is no longer a fixed property — it can be actively tuned by allocating compute resources to match task complexity.

Explore Concept
Agentic Infrastructure

Third-party Harness

A Third-party Harness is a software architecture that enables external developers to use and extend AI models beyond official APIs or authorized interfaces. The term refers to frameworks that act as intermediaries between AI models (such as Claude, GPT, or Gemini) and end users, providing additional capabilities like multi-model orchestration, enhanced tool integration, or custom workflows. A prominent example is OpenClaw, an open-source harness that extends Anthropic's Claude model with advanced features including background processes, cron jobs, and integration with external tools. Harnesses differ from official APIs in that they often leverage subscription-based access (rather than API-based), offering cost-effective alternatives for developers building experimental or production-ready AI applications. Using Third-party Harnesses raises important questions about long-term stability: providers like Anthropic can restrict subscription access at any time, leading to sudden service disruptions. Companies should therefore use harnesses only for non-critical workflows or migrate to official API contracts with SLA guarantees once they reach production maturity.

Explore Concept
Agentic Business

Tool Calling

Tool Calling is the ability of AI language models to invoke external functions, APIs, or services to accomplish tasks that go beyond text generation. Rather than relying solely on trained knowledge, a model with tool calling can access real-time data, execute code, perform calculations, or control external systems. The mechanism works like this: the model receives a list of available tools with descriptions and parameter schemas. When needed, it returns a structured call that the host system executes and returns results from. The model processes the response and can either make additional tool calls or generate its final answer. Tool calling is a prerequisite for real AI agents: it's what allows models to interact with the outside world, automate workflows, and solve complex multi-step tasks autonomously. Modern frameworks like Model Context Protocol (MCP) standardize how tools are registered and called, making it easier to connect AI systems to existing enterprise infrastructure. Tool calling differs from retrieval in that it's fully bi-directional — the model can both read from and write to external systems, enabling truly agentic behavior.

Explore Concept
Reasoning & Reliability

Text-to-Video

Text-to-video is a category of generative AI technology in which models produce video sequences directly from natural language descriptions, without traditional filming, animation, or manual editing. Text-to-video models parse a text prompt and synthesize temporally consistent video frames that match the described scenes, camera motions, lighting conditions, and subjects — a process that compresses hours of conventional production into seconds. The field has advanced rapidly since OpenAI's Sora captivated the world with its physically plausible, minute-long cinematic clips in early 2024. Today's leading text-to-video systems include Google's Veo 3, ByteDance's Seedance 2.0, Runway ML's Gen-3 Alpha, Stability AI's Stable Video Diffusion, and Kling AI from Kuaishou. Most state-of-the-art text-to-video models combine large-scale video diffusion architectures with language encoders derived from models like CLIP or T5, enabling rich semantic grounding. Key capability dimensions include video duration, resolution, motion realism, prompt adherence, character consistency, and support for camera control commands such as pan, zoom, and dolly. Text-to-video is transforming marketing, entertainment, education, and e-commerce by enabling AI-native video content creation at a fraction of traditional production costs. Brands can now generate product demos, explainer videos, and social media content programmatically at scale. Context Studios integrates text-to-video generation into client content pipelines, using models like Veo 3, Seedance 2.0, and Sora for short-form social content, product visualization, and automated video production workflows.

Explore Concept
Agentic Infrastructure

Tokens Per Second (TPS)

Tokens Per Second (TPS) is the primary throughput metric for evaluating AI language model inference performance. It measures how many tokens a model generates per second after the generation process has begun. TPS and Time-to-First-Token (TTFT) jointly determine the overall user experience quality. A token roughly corresponds to 0.75 words in English or 0.5–0.6 words in other languages. Typical TPS benchmarks: Groq's LPU achieves 500–800 TPS for 7B parameter models; Anthropic's Claude API delivers 30–100 TPS depending on model tier; self-hosted open-source models on a single H100 GPU achieve 50–200 TPS depending on model size. TPS influences UX in two distinct ways. For short responses (up to ~500 tokens), TTFT dominates perceived responsiveness. For long outputs — documents, code, analyses — TPS becomes the determining factor. At 30 TPS, generating a 3,000-word document takes ~80 seconds; at 200 TPS, ~12 seconds. For voice AI systems, a minimum TPS of 100 is necessary for speech synthesis without perceptible gaps. Factors affecting TPS: model size (larger = lower TPS per request), quantization level (FP4 > FP8 > BF16 in throughput), batch size (larger batches increase aggregate TPS but lower individual TPS), hardware, and KV-cache utilization patterns.

Explore Concept