Navigation
Share
Breadcrumb

Muscle Memory For CUAs

Two Systems

A single model trying to reason about what to do and execute the motor actions to do it is fighting itself. Reasoning needs big context windows, world knowledge, and flexible planning. Motor execution needs speed, spatial precision, and tight feedback loops. These are fundamentally different computational profiles.

The human brain figured this out. System 2 is the prefrontal cortex, slow and deliberate, for novel situations. System 1 is the cerebellum, fast and automatic, for learned motor patterns. You do not reason through each finger movement when you type your password. Your hands just know.

The harness is this split made concrete. Claude Sonnet is System 2. It looks at a screenshot, understands the page, decides what needs to happen. When it encounters something repetitive, a popup it has dismissed a hundred times, it delegates to a CNN skill. The CNN is System 1. A tiny ConvNeXt model that takes a screenshot, produces a spatial heatmap, and clicks. No reasoning, no context window, no API call. Pixels in, coordinates out, 5 milliseconds.

Hypertuned Small CNNs

Each CNN skill is an independently trained checkpoint. ConvNeXt-Tiny backbone pretrained on ImageNet, a Feature Pyramid Network at 1/8 stride, frozen CLIP text encoder for subgoal conditioning, and a diffusion-based action head outputting 8 action type logits and 2 normalized coordinates.

The key word is small. 120MB per checkpoint. Loads in under a second. 5ms forward pass on an M-series Mac. Each one is trained exclusively for a single repetitive task with a two-phase pipeline.

Phase 1 is behavioral cloning. The model watches Claude handle the task correctly 200+ times and learns to imitate. Gets clicks within about 300 pixels of the target, right region but not reliable. 30 minutes on an A100. Phase 2 is REINFORCE. Starting from the SFT checkpoint, the model runs live in a browser with DOM-based rewards. Curriculum learning ramps difficulty. Sharpens the heatmap from “right region” to “exact pixel.” 2 to 3 hours on an A100.

Neither phase alone works. SFT without RL clicks roughly the right area but misses small targets. RL without SFT has no starting policy and the action space is too large for exploration. Together they reliably click a 12 pixel wide green button on a 1280 by 720 screen. Same SFT-then-RL pattern that robotics labs use for manipulation policies. Behavioral cloning gets you “close enough” for RL to refine.

The Skill Router

The SkillRouter sits between Claude and the CNN checkpoints. Claude calls a tool, dismiss_popups or solve_radio_modal, and the router lazy-loads the checkpoint, preprocesses the screenshot, tokenizes the CLIP subgoal, and runs inference.

The execution loop: take screenshot, run CNN, execute action, take another screenshot, compute SSIM. If similarity is above 0.99, the action had no visible effect. Three consecutive no-effect actions and it stops.

This runs at about 2.7 Hz in the legacy path, bottlenecked by Playwright screenshots at 160ms and settle waits at 150ms. The forward pass at 5ms is free. A motor thread using CDP screencast at 30 fps should push it to 11 or 12 Hz. Still in progress.

Cost difference: Claude handling popups manually costs about $0.02 per popup across 3 API turns. With the CNN it is one API turn plus one free inference, roughly $0.007. Across 30+ popups per benchmark run, that is $0.40 saved on popups alone. Modals save more at 5 to 8 API turns each.

In practice, Claude sees a challenge with two popups. Calls dismiss_popups. The CNN clicks both green buttons in 1.5 seconds, confirms the screen is stable via SSIM, returns a clean screenshot. Claude reasons about the actual challenge. Cognitive work stays with Claude. Mechanical work goes to the CNN. Claude sees skills as tools in its system prompt, same as any other tool. It does not need to know they are neural networks.

The Self-Improvement Loop

The natural next question is whether this can close the loop on its own. The system already knows when it is doing something repetitive, the pattern detector sees it. And we already know how to train a CNN skill once we have the data. What if the system just did that automatically? Detect a pattern, collect demonstrations, train a skill, deploy it, move on.

I think this is roughly how it will work. The agent runs with Claude handling everything. After enough repetitions of the same motor pattern, it flags it as a candidate for automation, collects training data from Claude’s own actions, ships it off to a GPU, and gets back a checkpoint that handles that task going forward. The agent gets faster and cheaper the more it works because it keeps converting expensive reasoning into cheap muscle memory.

That said, this is mostly still in my head. The pieces exist independently but I have not run the full loop end to end. There are open questions about how much data you actually need, how you handle edge cases in collection, how you decide when a skill is good enough to deploy.

One thing I keep coming back to is whether the SFT phase is even necessary. Right now the pipeline does behavioral cloning first to get a rough policy, then RL to sharpen it. But the SFT phase is the slow part, hundreds of demonstrations, 30 minutes of training, and it only gets you to “roughly the right area” anyway. The RL phase is where the real precision comes from.

What if you skipped SFT entirely and went straight to RL, but gave the agent deterministic expert trajectories to learn from? Like a leading arm in robotics, where you physically guide the robot through the correct motion and let it learn from that. Claude is the leading arm. You record a handful of perfect trajectories, maybe 10 to 20, and use those as the expert signal for RL directly. The way MuJoCo environments let you bootstrap RL from expert demonstrations without a separate imitation phase. The CNN learns the right policy in minutes because it is not exploring from scratch, it has perfect examples showing exactly what success looks like. No SFT warmup, no hundreds of demonstrations, just expert trajectories straight into RL.

What Comes Next

The current skill registry has two entries. Popup dismissal is trained and working. Radio modal solving is in progress, blocked on a submit button visibility issue in the training environment. Form fill is detected but not yet collecting data.

The immediate roadmap is straightforward. Fix the modal training environment, complete that skill, build the form fill skill, and run the full benchmark with all three. I think three skills covers roughly 70% of the mechanical actions in the 30 step benchmark, which should bring the cost per run from $5 down to around $1.50 and the completion rate up significantly.

But the more interesting question is what happens when you point this system at real enterprise software. SAP transactions that require 40 clicks through nested menus. Oracle forms where you tab through 25 fields entering the same data format every time. Mainframe green screens where the interaction pattern is always type-tab-type-tab-enter. These are exactly the kind of repetitive, visually consistent, motor-heavy tasks that the CNN skills are designed for. And because the system is vision-only, it does not matter that SAP and Oracle have completely different underlying technologies. They are both just pixels.

The self-improvement loop means you do not need to anticipate every workflow in advance. You point the agent at a task, let it work through it with Claude a few times, and the patterns that should be automated identify themselves. The system figures out what is cognitive and what is motor, and it builds the motor skills on its own.

This is, I think, the right level of abstraction for building agents that operate in the real world. Not a single model that tries to be both the brain and the hands. Not a brittle pipeline of specialized components that fail at every junction. A reasoning core that can delegate learned motor patterns to cheap, fast, independently trained specialists. The same architecture nature converged on, built with ConvNets and API calls instead of neurons and synapses.

“We become what we behold. We shape our tools, and thereafter our tools shape us.” —Marshall McLuhan