Navigation
Share
Breadcrumb

One Brain, Many Hats

Hitting the Ceiling

The CNN skills from the last post work. A tiny ConvNeXt model can dismiss popups at 5ms per forward pass, pixels in coordinates out. But when I tried to build the next skill, a modal solver that scrolls through radio options and clicks the right one, the architecture started fighting me.

CNNs cannot read text. The modal skill needs to find a specific option label inside a scrollable list, but ConvNeXt spatial features at 8 pixels per cell cannot resolve the difference between “Option 7” and “Option 8.” I had to inject target coordinates from an external OCR step, which coupled the perception and motor systems back together. The whole point of the CNN approach was pixels to actions with nothing in between.

On top of that, each CNN skill loads its own full backbone and text encoder. Two skills means two copies of everything in VRAM. The muscle memory idea is that the system keeps building new skills over time, but if every new skill costs 4GB you run out of room fast.

Sharing a Brain

The fix is to stop duplicating the perception stack and share it. One frozen VLM backbone, Qwen3-VL, handles all the visual understanding. It reads the text on screen natively and produces rich feature representations. Then instead of one large model per skill, you attach tiny expert heads that each specialize in a single task.

Each expert head is about 2 million parameters. An 8MB checkpoint. It takes the backbone’s vision and text features, runs cross-attention so it can read the subgoal to decide where to look, produces a spatial heatmap over the screen, and classifies the action type. The backbone already understands what text looks like on screen because it is a vision-language model. The expert just learns where to direct that understanding.

Adding a new skill in the CNN world meant training a fresh 200MB model with its own backbone, its own text encoder, its own everything. In the expert head world it means training an 8MB adapter on top of a backbone that is already loaded. Dozens of experts in memory simultaneously. Hot-swapping is instant.

The Router

When Claude delegates a task, something needs to decide which expert handles it. Right now the router is simple. Each expert registers with a text description of what it does. When a subgoal comes in, the router computes cosine similarity between the subgoal embedding and each expert’s description. Highest similarity wins.

This works when the subgoals are distinct. “Dismiss the popup” obviously routes to the popup expert. But imagine fifteen experts covering overlapping domains. “Click the submit button” could be three different experts depending on what is actually on screen. The text alone does not tell you. You need to look at the screenshot.

That is the experiment we are about to run. A learned router that takes both the screenshot features and the subgoal text, passes them through a small adapter, and produces logits over the expert registry. The adapter is tiny, two linear layers with a GELU, but it learns to pick the right expert from visual context that text-only routing cannot see.

The training data comes from the system itself. Every time an expert handles a task successfully, you log the screenshot, the subgoal, and which expert was active. The router learns from its own experience. Not just learning new motor skills but learning when to use them.

Where Things Stand

The popup expert is trained and producing results. SFT on 2,799 demonstrations across 699 unique button positions, then REINFORCE to sharpen the heatmap. After 1,115 RL episodes the expert dismisses 7 out of 180 popups across three benchmark versions. Not production-ready but real signal that the architecture learns.

The modal expert has data ready but has not started training. The learned router exists as code but needs at least two working experts before routing between them means anything. One expert working, one queued, and a router that is still theoretical. The question is whether the learned router actually outperforms cosine similarity in practice, and whether the second expert generalizes as well as the first.

Why This Matters for Self-Improvement

The self-improvement loop from the last post had a scaling problem in the CNN world. Every new skill was expensive to train, store, and run. Expert heads change that math. Training 2 million parameters instead of 34 million. Storing 8MB instead of 200MB. The fixed cost is the backbone, and once you are paying that each additional expert is nearly free.

If the learned router works, the system does not just build skills, it learns when to deploy them. The agent gets faster and cheaper the more it works, not just because it has more skills, but because it gets better at picking the right one for the moment.

Maybe text-only routing is fine when you have a small number of clearly distinct skills. The hypothesis is that it becomes essential as the skill library grows and the boundaries between experts get fuzzy. That is what we are about to find out.

“The test of a first-rate intelligence is the ability to hold two opposed ideas in mind at the same time and still retain the ability to function.” —F. Scott Fitzgerald