The Training Gauntlet

February 20, 2026

Five Zeros

SFT v3 through v7 all scored 0 out of 30 on the benchmark. Five training runs, different data mixes, different curriculum schedules, different hyperparameters. The failure mode was always the same. The model would try to solve the challenge while popups were still covering the screen, or type a code before dismissing the cookie banner. It learned the vocabulary of actions but not the ordering.

It took a while to understand why. Every training example was an ActionChunk, a JSON blob wrapping a confidence score, an array of 5 to 15 actions, and a postcondition prediction. The model had to simultaneously learn what action to take, when to take it, how to group actions into coherent chunks, how to set confidence, and how to predict what the screen should look like after. The format was doing too much at once and the sequencing signal was getting lost in the noise.

The fix was to split training into phases. Phase 1 teaches one thing: given this screenshot, what is the single best next action. One screenshot, one bare action dict, no wrappers. The model learns that overlays get dismissed first, then you solve the challenge, then you type the code. Phase 2 comes later and teaches how to group those actions into chunks. But it can only do that effectively if the model already knows the right ordering.

27K single action examples from 540 real site recordings, each trajectory split at natural boundaries. Dismiss steps, a reason action at the boundary between dismissing overlays and entering code, and code entry steps. About 8 hours on the A100 to train.

Progress and Problems

Phase 1 produced a model that was genuinely promising. Valid JSON output over 95% of the time. All five action types working. Visual grounding at 1 to 7px accuracy on clear targets. Priority ordering working where it handles overlays before challenges.

But two problems showed up immediately. When the model missed a click, maybe the grounding was off by 20px and it hit dead space instead of the button, it would repeat the exact same click at the exact same coordinates forever. It had no concept of action failure because the training data only contained perfect trajectories where every action succeeds. It had literally never seen what happens when you click and nothing changes.

The second problem was related. Sometimes it targeted the wrong element entirely, a fake close button instead of the real green Dismiss, and just kept clicking it because nothing told it the action was not having the expected effect.

I fixed this with a targeted data collection using Claude as an oracle with SSIM based effect detection. When no visual change is detected after an action, the oracle demonstrates recovery. Shifting coordinates, trying a different element, scrolling to find alternatives. About 500 curated examples, expensive per sample, but exactly the data the model was missing. Phase 1.5 continued the same LoRA adapter with this recovery data mixed in and the model started pivoting after 2 to 3 failed attempts instead of looping.

Online RL Needs More Than One GPU

The natural next step was reinforcement learning. GRPO on the A100. It managed 3 gradient updates in 2.5 hours. Each trajectory takes about 8 minutes because you are running 100 actions at 5 seconds each through model inference and browser execution. Collecting one batch with a group size of 4 takes 32 minutes. The actual gradient update takes 2 seconds.

The loss went from 1.66 to 23.7 in 3 updates. With only 4 trajectories per group, the ranking within each group is pure noise. The model got contradictory gradient signals and diverged. Eval went from 2 out of 11 steps down to 0 out of 5. Online RL at this scale needs group sizes of 16 to 64 and hundreds of trajectories per update. On one GPU that would take days per update.

I tried rejection sampling fine tuning next, which is offline RL. Run the model many times with temperature sampling, keep the successful trajectories, SFT on those, repeat. It is more practical on limited compute because the expensive trajectory collection only needs forward passes, not gradients.

Switching Foundations

Around this time I found EvoCUA-8B from Meituan built on Qwen3-VL, a newer architecture than UI-TARS’s Qwen2.5-VL with a native action format closer to what I needed. I built an adapter with subtask tracking, episodic memory across actions, and a Claude Haiku planner for subgoal guidance.

I tried SFT on EvoCUA at multiple LoRA configurations. Every single one degraded the base model’s visual grounding. Even conservative rank 8 with a tiny learning rate on attention layers only made it worse at clicking accurately. The root cause is a format mismatch between training and inference. My adapter injects signals like no effect warnings and subgoals that the model never saw during pretraining and SFT on that distribution confuses it rather than teaching it.

This was a significant lesson. These foundation models have carefully learned spatial understanding that is surprisingly fragile. Fine tuning on a mismatched distribution does not add new capability, it erodes existing capability. The grounding knowledge lives distributed across the weights and even small perturbations in the wrong direction can break it.

The Reward Signal Problem

I built a Gymnasium environment running against the real benchmark site through Playwright. The model only ever sees screenshots, but DOM queries provide reward signals during training. Phase completion bonuses fire when all popups are dismissed, when the modal is closed, when a radio is selected, when the challenge code is revealed.

Then for weeks the GRPO trainer showed zero radio selections and zero modal closes across every single run. I spent a long time assuming the model was the problem, trying different hyperparameters, different curricula, different adapter configurations. The actual problem was a CSS selector.

The benchmark uses Radix UI which renders radio buttons as <button role="radio"> with data-state="checked" instead of standard <input type="radio"> with the :checked pseudo class. My DOM query was looking for input[type="radio"] which returns zero results on the real site. Weeks of debugging for a one line fix. One selector change from input[type="radio"] to [role=radio] and the entire modal training pipeline started working.

Decomposing the Problem

Even with the selector fixed, training on the full 30 step benchmark was not producing learning signal on the hard subtasks. The model learned to dismiss popups since that produces an immediate dense reward. But the scrollable modal, finding the right radio button among 11 options inside a fixed position container that requires significant scrolling, clicking a tiny Radix button, then clicking Submit, that sequence never produced a positive reward during training. The model never stumbled into completing it by accident so it never got signal to learn from.

This is the sparse reward problem in RL and it is the same thing the robotics community has been dealing with forever. If your reward only fires when the entire task is complete, and the entire task has 15 steps, the probability of randomly completing all 15 steps to get your first positive signal is essentially zero.

The solution in robotics is curriculum learning. Learn to grasp the block before you learn to stack it on another block. Isolate the hard subtask and train on it in a focused environment with dense rewards before combining everything.

So I built isolated environments for each bottleneck subtask. The modal environment loads the real benchmark site, finds a step with a modal, dismisses all overlays via DOM so the model starts with a clean view, and then the model takes actions through pure pixel coordinates. Dense rewards at every step. Any radio selected gives +0.02 for exploration. The correct radio gives +0.1. Correct radio selected plus Submit clicked plus modal gone gives +1.0. A human solves this in 3 to 5 actions. Scroll down a few times, click the right radio, click Submit.

I verified the full sequence end to end. Found a modal, scrolled 3 times at 300px each, clicked the correct radio for +0.1, clicked Submit for +1.0. 5 total actions, episode complete. The environment works.

Modal GRPO is training now. After that I build a code entry environment with the same pattern, then combine the subtask capabilities and run on the full benchmark. Each isolated environment gives the model a tractable learning problem with dense enough rewards that GRPO can find signal on a single GPU. Once each subtask is learned independently, the hypothesis is that combining them back together should produce an agent that can handle the full sequence. It is the same bet robotics makes with curriculum learning and it tends to work because the subtask skills are largely independent, you do not need to relearn how to scroll a modal just because you also need to type a code.

The project is at an interesting point. The architecture is solid. The infrastructure works. The thesis that you should build these systems like robots rather than chatbots keeps being validated at every turn. The remaining question is whether curriculum RL on one GPU can produce a model that actually completes the full benchmark. I think it can, but I also thought the multi model pipeline would work, so I am staying humble about predictions.

“It does not matter how slowly you go as long as you do not stop.” —Confucius