Breadcrumb
Pixels In, Actions Out
Generating Your Own Training Data
After the multi model pipeline I needed a single model that takes a screenshot and outputs an action as JSON in one forward pass. I picked Qwen3-VL-2B-Instruct as the base since it is small enough to iterate fast and multimodal out of the box.
The question is where do you get training data. Human demonstrations are expensive, slow to collect, and noisy because people click slightly off target. So I went procedural. This ended up being the most educational part of the entire project because it forced me to think about what a model actually needs to learn.
I built a component library of 30+ UI elements. Buttons, modals, popups, cookie banners, radio buttons, text inputs, scroll containers, decoy buttons, fake close buttons. Each component knows how to render itself as HTML and how to be solved. A page assembler composes random selections of these into complete web pages with different themes and layouts. Then a headless Playwright solver opens each generated page, takes a screenshot before each action, resolves the CSS selector to get the exact pixel coordinates from the DOM, executes the action, and records the screenshot action pair.
The ground truth comes from the DOM, not from vision. That is important. The coordinates are perfect because they come from the browser engine itself, not from a model trying to estimate where something is. 21,000 training pairs across 21 WebDataset shards with zero human annotation.
The curriculum has four difficulty tiers which turned out to matter a lot. Tier 1 is single click, just find the button. Tier 2 adds overlays so now you have to dismiss a popup before you can click. Tier 3 introduces deception with fake buttons and misleading labels. Tier 4 is the full composite with stacked popups, scrollable forms, and code entry. Starting the model on easy examples and gradually increasing difficulty is something I kept coming back to throughout this project.
The Landscape of Existing Models
While my 2B model trained on a Lambda A100 I went to see what already existed. I downloaded ShowUI-2B and UI-TARS-1.5-7B and ran them against the benchmark.
ShowUI is what is called a grounding model. It is trained to locate elements on screen. It found the START button and the modal instantly, no problem. Then it clicked Submit and Continue 140 times in a row without ever scrolling down or selecting the radio button first. 150 actions, 514 seconds, 0 steps completed. It has spatial understanding with zero task reasoning. It knows where everything is but has no concept of what needs to happen in what order.
UI-TARS from ByteDance is different. It is an agent model trained on millions of real GUI interaction trajectories. On Apple Silicon the generation quality collapsed after a few tokens due to MPS backend issues with Qwen2.5-VL’s attention implementation. But looking at the Thought outputs before the generation degraded, it clearly understood the task. It described the right strategy, dismiss the popup then look for the form. It just could not execute cleanly.
Neither completed a single step but the failure modes were instructive. ShowUI can ground but cannot reason. UI-TARS can reason but its reasoning was pre trained, not something I added. Its vision encoder has seen millions of real screenshots and internalized how GUIs work. That understanding lives in the frozen weights of the vision backbone which is exactly the part I would not be training.
The Robotics Connection
This is when something clicked for me. I was looking at two models where one has good perception and the other has good reasoning, and I was about to go back to training my 2B from scratch, a model whose base has never seen a real website before.
Then I thought about how they do it in robotics. ALOHA from Stanford and Google takes a pretrained model, LoRA fine tunes it on teleoperation data, and deploys it on a physical robot. The pretrained model provides the foundation of understanding and the fine tuning adds the specific task capability. They do not train vision from scratch. They leverage what the foundation model already knows.
I can do the same thing. Take UI-TARS as the foundation because it already understands GUIs. Fine tune it with LoRA to add what it is missing. Action chunking so it predicts 20 actions per screenshot instead of 1. My compact JSON format instead of its verbose output. Then RL in simulation to let it improve through experience.
The key insight about action chunking came from looking at how robot policies work. ALOHA predicts 90 joint positions at 50Hz from a single forward pass, that is 1.8 seconds of smooth motion from one inference. Instead of predicting one action at a time and waiting to observe the result, you predict a whole sequence. Dismiss the popup then scroll then click the radio button. The model learns to plan ahead rather than being myopic.
The action space is five types with coordinates normalized 0 to 1000 so it is resolution independent. Click, type, scroll, press, wait. The training infrastructure handles Qwen2-VL, Qwen2.5-VL, and Qwen3-VL transparently through architecture auto detection. The vision encoder stays frozen since it already understands screenshots and only the language model layers get LoRA adapters. UI-TARS 7B with LoRA comes out to 8.5B total parameters with 190M trainable.
The biggest difference between my setup and a robotics setup is that I have no sim to real gap. The environment during training is identical to deployment. The browser does not change between training and inference the way a simulated robot arm differs from a physical one. That felt like a significant advantage.