Karpathy Autoresearch: A Prompt Replaces the Paper

Andrej Karpathy released Karpathy Autoresearch — a framework where AI agents autonomously run LLM training experiments. 110+ runs in 12 hours on 8×H100.

Karpathy Autoresearch: A Prompt Replaces the Paper

Karpathy Autoresearch: A Prompt Replaces the Paper

On March 7, 2026, Andrej Karpathy published Karpathy Autoresearch — a minimal GitHub repository that demonstrates what happens when autonomous agents run LLM training experiments unsupervised through the night. In practical terms, Andrej Karpathy turned a Markdown prompt into a lightweight orchestration layer for AI research, model optimization, and repeated LLM training loops. Karpathy Autoresearch produced 110+ commits in 12 hours across 8 NVIDIA H100 GPUs. No researcher pulled an all-nighter. The Karpathy Autoresearch agents did.

Karpathy Autoresearch is not a product. It's not a polished platform. It's three files and a proof of concept — and a signal that raises an uncomfortable question: if the Karpathy Autoresearch: A Prompt Replaces the Paper pattern lets Andrej Karpathy replace one night of manual AI research with prompt-driven autonomous agents, what does that mean for how we build AI systems? Seen that way, Karpathy Autoresearch: A Prompt Replaces the Paper is less a catchy headline than a literal description of how Andrej Karpathy is restructuring AI research around prompts, autonomous agents, and measurable LLM training loops.

What Is Karpathy Autoresearch?

Karpathy Autoresearch — the core idea behind Karpathy Autoresearch: A Prompt Replaces the Paper — is an experimental framework where AI agents autonomously conduct LLM training experiments. The human writes a goal in Markdown. The agent reads it, modifies the training code, runs a 5-minute training job, evaluates results, and iterates. In other words, Andrej Karpathy is using autonomous agents as a thin research organization for LLM training rather than as a chat interface.

The Karpathy Autoresearch repository contains exactly three files:

  • prepare.py — data preparation, fixed, untouched by the agent
  • train.py — the actual training code; the agent may modify this freely
  • program.md — the control file; where the human writes research objectives

The optimization target is deliberately simple: val_bpb (validation bits per byte) — lower is better. No complex evaluation framework. No human judgment in the loop at runtime. The agent optimizes what it can measure.

Technically, Karpathy Autoresearch builds on nanochat, Andrej Karpathy's simplified single-GPU LLM training implementation. In the published Karpathy Autoresearch experiment, 8 agents ran simultaneously — 4 Claude instances (Anthropic) and 4 Codex instances (OpenAI) — in different organizational structures. At 5 minutes per experiment, that's roughly 12 experiments per hour, 100+ over a night.

Three Files, One Night, 110 Experiments

The numbers from this single Karpathy Autoresearch: A Prompt Replaces the Paper overnight run are worth sitting with: 110+ Git commits, 12 hours, 8 GPUs running in parallel. Karpathy puts it plainly in the README:

"One day, frontier AI research used to be done by meat computers in between eating, sleeping... Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies."

— Andrej Karpathy, Autoresearch README

This isn't hyperbole. It's a status update.

The multi-agent structure was deliberate. Andrej Karpathy used Karpathy Autoresearch to test different organizational configurations for autonomous agents — some agents ran in parallel, others in hierarchies. The parallelization dividend is substantial. Instead of one researcher occupying 12 hours, 8 agents work simultaneously, share results, and keep iterating. The wall-clock time for covering a search space collapses.

What Karpathy Autoresearch does not claim: delivering breakthroughs. The agents optimize within a well-scoped search space. They find local improvements. But they don't invent the search space.

What the Agents Get Right — and Where They Fall Short

Glen Rhodes published a detailed analysis of Karpathy Autoresearch that nails the core finding: agents are "very good at implementing any given well-scoped idea but they don't creatively generate them."

Karpathy Autoresearch: A Prompt Replaces the Paper confirms two things simultaneously about autonomous agents and LLM training automation:

What works: Parallelization. When the human defines the right search space, Karpathy Autoresearch agents can explore it at a speed and endurance no human team matches. 12 experiments per hour, overnight, without coffee or context switches.

What doesn't work: Scientific judgment. One Karpathy Autoresearch agent "discovered" that bigger networks reduce loss — a trivially confounded result that Karpathy had to manually correct. The agent was technically correct but intellectually empty: it had no idea why the result was worthless. It couldn't distinguish a real finding from a confounder.

The bottleneck lives upstream: which LLM training experiments are worth running? That question remains human. Karpathy Autoresearch makes this explicit through its architecture — program.md is where human intelligence deploys. Everything downstream, the agents own.

"You're programming an organization. The source code is the collection of prompts, skills, tools, and processes."

— Andrej Karpathy

How We Live This Paradigm at Context Studios

When Karpathy Autoresearch: A Prompt Replaces the Paper landed, it was an immediate case of recognition — because Andrej Karpathy is running the same human-plus-autonomous-agents architecture we use daily, just in a different domain.

We operate 16+ autonomous cron jobs every day — a pattern that Claude Code /loop now makes accessible to any developer. The results mirror what autonomous AI agents achieved at Polsia — compounding overnight outputs. Each one is essentially a program.md: Mike defined once what the agent should pursue, what quality standards apply, what constraints hold. The agent executes, iterates, logs. Night after night, without supervision.

Concrete examples from our daily operations:

  • Content Pipeline: An agent researches relevant AI topics daily, writes drafts in four languages (DE, EN, FR, IT), generates hero images, publishes blog posts, and distributes to LinkedIn, X, and Facebook — all without human intervention in the process itself.
  • SEO Healer: An agent scans all published posts for missing meta descriptions, empty keyword arrays, and broken translation links. It repairs what it can, escalates what it can't.
  • Social Engagement: An agent comments daily on relevant LinkedIn posts in our domain — not as spam, but as curated perspective aligned with our positioning.

What the Karpathy Autoresearch experiment measures with val_bpb, we measure with traffic, engagement rate, and publish quality score. What Karpathy writes in program.md, we write in cron task prompts. The architecture is identical.

And the core Karpathy Autoresearch finding holds for us too: the agents execute brilliantly. But the decision about what is worth executing — which topics matter, which audiences to prioritize, which quality standards to enforce — stays human. Every day. Without exception.

This isn't a limitation to work around. It's the correct division of labor.

If you're thinking about building your own AI agent systems, the Karpathy Autoresearch pattern is a useful mental model — even if you're not training LLMs. The architecture of goal-setting (human) + execution (agent) + metric optimization (agent) applies to almost any knowledge work domain.

The Real Shift: What Does "Programming" Mean Now?

Karpathy Autoresearch: A Prompt Replaces the Paper is also a comment on how the meaning of "programming" is shifting in AI development, autonomous agents, and LLM training workflows. Traditionally, programming meant writing code that instructs a computer what to do. In the Karpathy Autoresearch model, programming means writing prompts that instruct an organization how to research.

That's not a metaphor. The "codebase" of Karpathy Autoresearch is program.md. The configuration file is a natural-language Markdown document. This is a real shift in abstraction layer.

For developers and agencies, this has concrete implications. Anyone building parallel AI agent systems today needs to understand how to write organizational prompts — not just how to build the agents technically. The skill of writing a great program.md is as important as the technical implementation of the agents themselves.

We recognized this early at Context Studios. Our approach to AI agent development starts not with technical architecture, but with the question: What should this agent know? What should it be able to do? And critically: What should it not decide on its own?

Getting that third question right is what separates useful automation from expensive noise. Our analysis of how AI agents disrupted SaaS models covers the practical techniques for writing agent instructions that produce reliable, scalable outputs.

What Karpathy Autoresearch Means for AI Development

Karpathy Autoresearch: A Prompt Replaces the Paper surfaces three insights that matter for anyone working with AI systems, autonomous agents, and LLM training pipelines:

1. The parallelization case for agents is real. 8 agents, 12 hours, 110 Karpathy Autoresearch experiments — this isn't hype. It's demonstrated throughput. What used to take a researcher a week now takes a night. That changes the economics of R&D fundamentally — much like how multi-agent PR review is changing code quality assurance — not just for AI research but for any domain where you can define a clear optimization target.

2. Prompt quality equals output quality. A weak program.md produces confounded results nobody can use. A strong Andrej Karpathy-style Karpathy Autoresearch prompt produces actionable findings for autonomous agents and LLM training experiments. Prompt engineering is no longer a soft skill — it's the engineering discipline of the decade.

3. The researcher/engineer boundary is dissolving. Karpathy Autoresearch is simultaneously a research framework and a production system. Running Karpathy Autoresearch requires being a scientist, an engineer, and an organizational designer. That convergence is not reversing.

For organizations looking to integrate AI agents into development pipelines, Karpathy Autoresearch is an excellent mental model. Not as a blueprint to copy, but as a reference point: this is what human-machine collaboration in knowledge work looks like when it's working. Our GEO and AEO optimization guide applies the same loop-based approach to search optimization.

FAQ

What exactly is Karpathy Autoresearch?

Karpathy Autoresearch — the system discussed in Karpathy Autoresearch: A Prompt Replaces the Paper — is an open-source framework by Andrej Karpathy where autonomous agents autonomously run LLM training experiments. Humans define goals in a Markdown file (program.md), agents modify training code, run 5-minute experiments, and iterate. In one Karpathy Autoresearch test, 110+ experiments ran in 12 hours on 8 H100 GPUs.

How many experiments per hour can Karpathy Autoresearch run?

With a fixed 5-minute budget per experiment, Karpathy Autoresearch: A Prompt Replaces the Paper shows how Karpathy Autoresearch achieves roughly 12 experiments per hour. Over a night (8-12 hours), that's 100+ autonomous training runs — far more than any human research team in the same window.

Which AI models were used in the Karpathy Autoresearch experiment?

The published Karpathy Autoresearch: A Prompt Replaces the Paper experiment used 8 agents: 4 Claude instances (Anthropic) and 4 Codex instances (OpenAI), in various organizational structures — some parallel, some hierarchical.

Can AI agents genuinely do independent research?

Karpathy Autoresearch: A Prompt Replaces the Paper shows agents are excellent executors of clearly defined search spaces, but not independent scientists. One agent "discovered" that larger networks perform better — a confounded result Karpathy had to manually correct. The question of which experiments are worth running stays human.

What is val_bpb and why is it the metric?

In Karpathy Autoresearch: A Prompt Replaces the Paper, val_bpb stands for "validation bits per byte" — a measure of how well the language model compresses the validation dataset. Lower is better. Karpathy Autoresearch uses it because it's automatically computable and requires no human judgment, making it suitable for an autonomous optimization loop.

How does Karpathy Autoresearch differ from standard AutoML?

Standard AutoML searches over predefined hyperparameter grids. Karpathy Autoresearch: A Prompt Replaces the Paper shows that Karpathy Autoresearch agents can modify the training code itself — trying new architectures, changing data processing logic, experimenting with entirely novel approaches. That's a qualitatively different degree of freedom.


Sources: Karpathy Autoresearch GitHub | Glen Rhodes Analysis

Share article

Share: