GPT Sherpa

The rubric is the new prompt

Wed, 10 Jun 2026 00:00:00 GMT

The most important change in the AI tooling landscape in the first half of 2026 is not a model. It is a shift in what the tools accept as input: not just a task, but a definition of done — and a harness that iterates against it without you.

Three examples, all shipped between April and June 2026:

Anthropic Managed Agents — Outcomes. You send a define_outcome event with a task description and a rubric. A separate grader model (its own context, not the agent’s) scores each iteration against the rubric and feeds per-criterion gaps back. The loop runs iterate → grade → revise until satisfied, capped at a max iteration count.
Claude Code — /goal. A fresh evaluator model checks a completion condition after every turn and keeps the agent working — potentially for days — until the condition holds.
Claude Fable 5 (June 9, 2026 — the first generally available Mythos-class model, a tier above Opus). Anthropic’s own migration guidance is explicit: give the full task specification up front in a single well-specified turn and run at high effort. The model’s headline capability is long-horizon autonomy against a stated goal.

Grok Build runs the same play from the other side: its plan mode produces a plan you approve, comment on, or rewrite before execution, and post-approval deviations surface as reviewable diffs. Different mechanism, same contract — the human’s contribution moves from steering turns to specifying success.

Why this favors the meta-thinker

This site has argued that the meta-thinker’s skill — describe the problem well enough that someone good could solve it — becomes the entire job. That was a working style. As of mid-2026 it is a literal interface: the spec is no longer something you hold in your head while you steer; it is the artifact the harness consumes.

Which means rubric quality is now load-bearing. The graders are competent but literal. They verify what you stated, nothing else:

Checkable beats aspirational. “The output CSV has a numeric price column for every SKU” is gradeable. “The data looks good” is not — a vague rubric produces a confident loop converging on the wrong thing, with the grader cheerfully reporting satisfied.
Independent criteria beat bundled ones. Graders score per criterion. Five separate checkable statements give the agent five separate gaps to close; one paragraph gives it a vibe.
The rubric can’t see what you didn’t mount. It grades the artifact in the session, not your live schema or last quarter’s column semantics. The checks that need the real world stay outside the loop.

What stays human

Two things, and they are the same two things as before.

First, the rubric itself. Writing one forces the question every verification habit on this site comes back to: what must be true for this to be safe? If you can’t write the rubric, you weren’t ready to delegate the task — to a model or to anyone.

Second, the acceptance. A grader’s satisfied is a claim like any other model output. Before the change lands in the permanent record — the DAG, the registry, the paper — you run the actual command, read the actual table, check the actual decision log. The loop got cheaper and longer; the last step didn’t move.

Put the rubric in the registry next to the decision it produced. Six months from now, “what did we ask the agent to achieve, exactly?” will be the most useful provenance question you can answer.

The hammer and the lens

Thu, 23 Apr 2026 00:00:00 GMT

There are two ways people approach a hard problem. Some reach for code — they see the problem as a target for craft, and the question is which algorithm, which library, which pattern. Call that the hammer. Others reach for framing — they see the problem as an object to decompose, and the question is what is actually being asked, what would a good answer look like, where does judgment enter. Call that the lens.

Neither is better. Both solve real problems. But they’re different kinds of leverage, and AI amplifies them differently.

The coder gets a faster hammer

A skilled coder using AI gets a tighter loop: the model drafts, the coder reviews, the coder ships. Impressive but continuous — the same work they were doing before, just more of it per hour. The skills that made them valuable before (taste, language fluency, debugging intuition) are the skills that make the AI partnership valuable now.

The meta-thinker gets a different job

A skilled meta-thinker using AI gets something stranger. The AI does the typing. The human stays at the level where framing, noticing, and verification happen — which is exactly the level where the AI is still unreliable. The pairing isn’t a productivity boost; it’s a role swap. The meta-thinker’s old job (describe the problem well enough that someone good could solve it) becomes the entire job. The “someone good” on the other end is now a tireless collaborator that types faster than you read.

It shows up in small moves

Asked to write a Claude Code skill, the coder opens the docs and studies the schema. The meta-thinker says “generate a skill that does X, consistent with my CLAUDE.md” and reviews the output.

Asked to migrate 200 records, the coder writes a careful migration script. The meta-thinker writes a spec of the migration — invariants, edge cases, what “success” means — and lets the tool produce the script.

Both finish the job. What differs is where the human’s attention was spent, and what the human has learned by the time it’s over. The coder got stronger at coding. The meta-thinker got stronger at specifying.

Who this site is written for

If your hammer is code, plenty of sites will sharpen it — prompt libraries, syntax guides, “my favorite tricks.” That’s not what’s here.

These posts are written for the meta-thinker using AI. The advice posts are about habits that make framing hold up: verify before trusting, one idea per prompt, keep the spec small enough to fit in your head. The posts practice the move — not “write this function” but “decide what should be written, then have it written.”

In 2026 the same distinction applies to agentic tools (computer use, dynamic workflows, Artifacts that persist and call models). The coder gets a faster pair of hands on the keyboard and screen. The meta-thinker gets a collaborator that can traverse a targets DAG, read schema contracts, propose a tar_target, and even click through the verification steps — provided the human still owns the framing (“what must be true for this change to be safe?”), the provenance tag, and the final decision recorded in the registry. The guides on this site practice directing and verifying exactly that kind of work.

The mid-2026 models sharpen the distinction rather than blur it. Claude Fable 5 (June 2026) is explicitly tuned to take a full task specification up front and run autonomously for hours against it; Claude Code’s /goal keeps an agent working until a stated completion condition holds; managed agents accept a gradeable rubric and iterate against it. Every one of these rewards exactly the meta-thinker’s skill — the quality of the upfront spec and the checkability of the success criteria — and punishes the habit of steering turn by turn. The spec is the input now.

Grok Build (xAI’s terminal coding agent, launched May 2026) is particularly well-suited to the lens side of this equation. It reads the same project rule formats — CLAUDE.md, .claude/rules, skills, hooks, MCP configs — that power rigorous setups like ClinicalDataProject, with zero configuration, alongside its native AGENTS.md convention. It spawns typed subagents in parallel (with git worktree isolation), maintains todo lists for multi-step provenance work, and orchestrates verification across tools (web search, terminal for tar commands and schema checks, file operations for _targets.R and data dictionaries). You can use its plan mode for safe read-only exploration of a DAG before any changes, and its subagent system for parallel reviews — one agent checks assumptions, another validates against the live registry, another searches for recent agentic-pattern gotchas. This turns the meta-thinker into a conductor of specialized AI collaborators rather than a solo reviewer of one model’s output.

If that’s your leverage, you’re in the right place.

Verify before trusting

Thu, 23 Apr 2026 00:00:00 GMT

The single most valuable habit when working with a language model is this: never act on a factual claim from the model without checking it. Not “usually.” Not “for important claims.” Every one.

That sounds exhausting. In practice it isn’t, because most of what you get from a model in real work isn’t a factual claim — it’s a draft, a restructuring, a pattern, a plausible next step. Those things you can evaluate on the page. The dangerous output is the confident factual statement: a column name, a function signature, a citation, a package version, a regulation, a date. That’s what costs you an hour of debugging or a retracted claim.

What this looks like

Model tells you a function exists → grep for it before you call it.
Model tells you a column is named year_of_birth → read the actual schema before writing the join.
Model cites a paper → open the paper, or at least confirm the DOI resolves.
Model gives you a shell command that “should work” → read the flags against --help before running it with sudo.

Why it works

Language models are extremely good at generating text that reads like it came from someone who knew. They are not, by construction, reliable sources about the world. The verification habit closes that gap cheaply: you get the speed of the draft and the reliability of the reference.

The people who get burned by AI output are not the people who use it a lot. They are the people who use it a lot and skip this step.

When the tool can act (computer use, agents, 2026+)

Anthropic’s agentic surfaces have matured fast in 2026: computer use is in the Claude Code CLI and Desktop app (research preview), Claude in Chrome is in beta for all paid plans, Opus 4.8 (May 2026) shipped alongside dynamic workflows that run fleets of parallel subagents, and Fable 5 (June 2026, the first generally available Mythos-class model) is explicitly built to work autonomously for longer than any prior Claude. The model can read your screen, click, type, run commands, edit files, and drive multi-step workflows across apps and browsers. This is powerful leverage for pipeline work — inspecting a _targets.R, proposing a new tar_target for a schema change, running a dry validation, opening the resulting artifact.

The verification rule becomes stricter and more operational: - The model proposes a change to a target or a join → you still tar_visnetwork() or tar_manifest() and read the actual column definitions before accepting. - The agent “reads the schema” via screen or tool call → you still cross-check against the project’s data-design-decisions registry (or equivalent) and the generated API reference for any shared extract/transform packages. - The run produces new outputs or figures → you still open them and apply the same gates you would for human-written code (rendering integrity, event counts, assumption status).

Real projects that already live this way (targets-based DAGs with explicit data-design registries, 100% documented shared package APIs, provenance-tagged decisions, and pre-push validation) give you a ready-made contract the agent can help you maintain — but only if you keep the human in the loop at the “does this match the recorded rule and the live schema?” step.

The 2026 agentic tools make the draft cheaper and the verification surface larger. The habit that scales is the same one that worked for static prompts: treat every factual or structural claim as something that must be checked against the actual DAG, the actual table, and the actual decision log before it becomes part of the permanent record.

When the harness verifies for you (it doesn’t, quite)

The newest twist is that the tools now offer machine-side verification loops. Anthropic’s Managed Agents accept an Outcome — you state what “done” looks like as a gradeable rubric, and a separate grader model iterates the agent against it (user.define_outcome, iterate → grade → revise). Claude Code’s /goal does the same in the terminal: a fresh evaluator model checks a completion condition after every turn and keeps the agent working until it holds.

These are genuinely useful — but read what they verify. The grader checks the artifact against your stated criteria. It cannot check what you didn’t state, and it cannot see the live schema, the production table, or last quarter’s meaning of a column. Two consequences:

Writing the rubric is now part of the verification habit. “CSV has a numeric price column per SKU” is checkable; “the data looks right” is not. A vague rubric produces a confident loop converging on the wrong thing.
The final acceptance step is still yours. The grader saying satisfied is a claim like any other — check it against the actual run, the actual table, the actual decision log before it lands in the permanent record.

Grok Build (xAI) as a complementary lens tool

Grok Build — xAI’s terminal coding agent, launched May 2026 — brings a different but highly compatible set of strengths for exactly this kind of work. Claude Code compatibility is official and zero-config: it automatically reads the same CLAUDE.md, CLAUDE.local.md, .claude/rules/, skills, hooks, and MCP configurations that ClinicalDataProject and other rigorous setups use (its own native convention is the AGENTS.md family plus a .grok/ directory). This means SCDC-style patterns — typed assumptions, provenance tags like llm_proposed, validation gates, data-design-decision registries, code-reuse review against generated API docs — transfer directly without translation.

Grok Build’s native advantages for meta-thinkers and pipeline work:

Subagents and decomposition: spawn_subagent launches typed subagents (general-purpose, explore, plan) in parallel, with git worktree isolation. This is perfect for parallel verification: one subagent audits the _targets.R graph, another checks live table schemas via terminal/file tools, another searches for recent gotchas with the agentic pattern in use.
Structured tooling + todo tracking: Built-in todo_write maintains a live task list across complex workflows (e.g., “1. Read current _targets.R 2. Propose schema change 3. Update DD- registry with provenance 4. Run tar dry-run + human verification”). Tools for web search and fetch, terminal execution (run tar_visnetwork, grep schemas), file read/edit, and MCP discovery give it full agentic reach.
Plan mode: enter_plan_mode gives read-only exploration of a codebase or DAG before any edits — write tools are blocked, and you approve or rewrite the plan before execution.
Cross-vendor review: Because both tools read the same project rules, a Grok Build review of a Claude proposal (or vice versa) happens under the same contract — a reviewer that shares no weights or session state with the proposer.

In practice: let one agent hold the session and do the hands-on work; use the other’s subagents and todo system for framing, decomposition, and independent review. Always close the loop with the same human verification you would apply to either model alone: run the actual targets command, inspect the live schema, update the registry with a clear provenance tag, and only then accept the change.

The meta-thinker advantage is amplified: the agents help you stay at the level of “what must be true for this pipeline change to be safe and traceable?” while they do the traversal and typing. The verification habit remains non-negotiable — but now you have better tools for distributing and auditing the work across models.

One idea per prompt

Wed, 22 Apr 2026 00:00:00 GMT

A common mistake when people get comfortable with a model is to stuff more and more into a single prompt: the task, the constraints, the edge cases, the output format, three examples, a style guide, and a reminder not to hallucinate. The prompt gets longer; the output gets worse.

The fix is almost always the same: ask for less in one turn.

Why long prompts fail

Models attend to their whole context, but not evenly. Long prompts bury the actual ask under scaffolding, and the model quietly optimizes for whichever part of the instruction was most salient — often not the part you cared about.

A tighter prompt gives the model a single target. You can add the next constraint on the next turn, after you’ve seen that the first one stuck.

The pattern that works

State the task in one sentence.
Let the model respond.
Correct or extend: “good, now also handle X,” “that’s wrong because Y, try again.”

This turns prompting into a conversation you can steer, instead of a specification you have to get right on the first try. It’s also much faster in wall-clock time, because the first response comes back sooner and you catch misunderstandings earlier.

When to break the rule

Batch work — “process these 200 records the same way” — does want the full spec up front, because you are not going to iterate. But for any task where you are the reader of the output, start small.