Navigation
Share
Breadcrumb

Continuous Perception

Knowing Your Enemy

While models trained on cloud GPUs I started questioning whether my training data actually matched what the model would face. So I downloaded the benchmark site’s 308KB JavaScript bundle and decompiled it. 17,824 lines prettified. What I found reshaped the entire approach.

The real task structure is fundamentally different from what I had been generating. Each step has a two part structure where you complete a challenge to reveal a 6 character code then type that code into a yellow input box to advance. My training data only covered single action interactions. Nothing about reading a revealed code from the screen and typing it somewhere else.

There are 26 distinct challenge types. Steps 1 through 5 alone include visible codes, DOM hidden codes, click to reveal, scroll to reveal, and delayed reveal. Later steps get into canvas puzzles, audio challenges, base64 decoding, rotating elements. The codes are generated with crypto.getRandomValues per session so they are different every time, which means you cannot memorize them. You have to actually read them from the screen.

The part that really messed with my assumptions: ALL Next and Continue buttons are decoys. Every single one shows Wrong Button. The only way to advance is entering the correct code. And 40% of popup close buttons are fake. The gray X buttons show a toast message while the real dismiss is always a green button. The benchmark is actively trying to trick you and it does a good job.

Rather than trying to perfectly replicate all 26 challenge types in my procedural generator I decided to build a benchmark matched generator for the patterns I could replicate and plan to run RL directly on the real site for the rest. The codes are random per session so training on the real site is not memorization. The model just learns generalizable skills in the environment it will actually operate in.

How Robots See

This is the part of the project where I went deep into how robotic systems handle perception and it fundamentally changed how I think about the problem.

Every existing GUI agent works the same way. Take a screenshot, think about it, take an action, wait for the page to settle, take another screenshot. Stop and go. It is the obvious approach because that is how chatbot style AI works. You get an input, you produce an output, you wait for the next input.

Robots do not work this way. A robot arm does not take a photo of the scene, plan its trajectory, execute the whole thing blind, then take another photo. It perceives continuously. ALOHA from Stanford predicts 90 joint positions at 50Hz from a single forward pass. That is 1.8 seconds of smooth motion from one inference. The perception and the action happen concurrently, always.

The more I read about this the more obvious it became that the screenshot bot paradigm is leaving massive performance on the table. An API based agent gets maybe 8 to 20 actions in 60 seconds. A local model doing one action per inference gets around 300. But if you predict 20 actions per frame and run perception, policy, and execution as concurrent threads, you get roughly 1000 actions in 60 seconds. That is not a small improvement. It is a fundamentally different operating regime.

Three Threads

I designed the continuous architecture as three threads that are always running. A perception thread continuously captures frames into a ring buffer at 10 to 30fps. The latest frame is always available, no blocking. A policy thread grabs the latest frame, runs one forward pass, outputs 20 actions, hands the first 5 to execution, and waits for a signal to replan. An execution thread streams those actions to the screen and after the first 5 complete it signals the policy to replan from a fresh frame. This is temporal ensembling, the same technique from robotics where you predict a long action sequence but only execute the near term actions before replanning.

SSIM gating turned out to be important. After each action you compare the screen before and after. If there is a big visual change that means something significant happened and you should throw away the rest of your planned actions and replan from scratch. If there is no change at all that probably means the action missed and you should not count it. Normal changes mean things are progressing and you keep going.

The Full System

The VLA came together as 36 files and about 13K lines. The model takes in visual tokens, episodic memory from a SQLite store with sentence transformer embeddings, recent action history, and a task description. It outputs JSON action sequences. There is a reason tool that delegates hard problems like base64 decoding or math to Claude or GPT-4 with the usage tracked so I can eventually distill that capability back into the model itself. An oracle exists purely for generating ground truth training trajectories using DOM access and never runs at inference.

The continuous agent orchestrates the three threads through asyncio. I use JPEG capture instead of PNG which is 3 to 5x faster. Reason actions are special in the execution thread. They terminate the current batch, delegate to the external model, inject the result back into policy context, and trigger an immediate replan. This means the model can seamlessly hand off a hard subproblem and resume acting when the answer comes back.

I keep coming back to the same thesis. The architecture that works for getting a humanoid robot to operate factory equipment designed for human bodies is the same architecture that works for getting a visual agent to operate software designed for human eyes and hands. Continuous perception, action chunking, temporal ensembling. The screen is just the environment and the model sees and acts in it the same way a robot sees and acts in a factory.