The biggest mistake we made was treating computer use agents like RPA (Robotic Process Automation). We spent two months discovering that these are fundamentally different automation paradigms – and the hard way taught us more about AI engineering than most teams learn in a year. We built an agent architecture by first making what seemed like a logical assumption: AI planning followed by step-by-step execution. That assumption cost us two months and a complete rewrite of Donely. Here’s what we learned, and why it matters for anyone building computer use agents.
The Architecture We Built First: Planning Then Execution
Our initial approach seemed elegant and logical. We believed we understood the problem clearly enough to separate it into phases: planning and execution. The workflow was straightforward:
- Planning Phase: Ask the AI to analyze the task and generate a structured plan with step-by-step instructions
- Execution Phase: Execute each step of the plan sequentially, one by one
This sounds clean. It mirrors how we think about software engineering – specification first, then implementation. And for many automation scenarios, it’s exactly right. We were confident in the architecture. We shipped it. And then the real world happened.
The fundamental problem emerged immediately: the plan we generated bore almost no resemblance to what the agent could actually execute. When we asked the AI to plan something like “automate a data entry workflow across multiple applications,” it would generate steps that assumed a specific set of tools, specific application layouts, specific UI patterns. But the user might have different software versions, different toolbar arrangements, different browsers. The agent would reach execution step 3 and discover that step 1’s assumptions were invalid.
This wasn’t a failure of planning – it was a failure of our fundamental model. We were trying to apply predictable process automation (RPA) thinking to an inherently unpredictable problem space (computer use). These are not the same thing at all.
RPA vs. Computer Use: Understanding the Fundamental Difference
The distinction between Robotic Process Automation and computer use agents is not simply a matter of degree – it’s a fundamental architectural difference. Understanding this distinction is crucial for anyone building automation systems.
Robotic Process Automation (RPA) is designed for predictable, rule-based tasks where the environment is controlled and repeatable. RPA excels at:
- Structured, high-volume tasks where the process flow is determined in advance
- Stable environments where UI elements, locations, and formats don’t change
- Binary decision logic (IF field contains X, THEN do Y)
- Precise UI automation based on coordinates and predefined selectors
- Complete specification of every action before execution begins
RPA is brittle by design – that brittleness is a feature when you’re automating predictable workflows. It’s fast, accurate, and requires no real intelligence. You literally script every mouse click and keystroke. It works beautifully when the environment is stable, and it breaks immediately when anything changes.
Computer Use Agents, by contrast, are designed for adaptive, dynamic environments where the task parameters might shift during execution. They require:
- Visual understanding of what’s on screen, not just DOM queries
- Contextual reasoning about what actions are appropriate given the current state
- Flexibility to adapt when assumptions change or UI elements are in unexpected locations
- Planning with built-in uncertainty about what will happen next
- Supervisory frameworks that provide tools rather than fixed step sequences
The key insight: RPA automates execution; computer use agents automate reasoning about what execution should happen next. These are inverted problems. RPA says “here’s exactly what to do.” Computer use agents need to figure out what to do by looking at what’s actually there.
The Breakthrough: Supervisor Pattern With Tool-Based Architecture
Two months into our development cycle, we hit the wall. The planning-then-execution model simply wasn’t working. The turning point came when we realized something profound: there’s no value in having a fixed structure.
We fundamentally redesigned our approach around what we call the “supervisor pattern” with a tool-based architecture. Instead of generating a predetermined plan that the agent then executes, we took this approach:
- Supervisor Agent: We deployed a supervisor that interfaces with the task environment
- Tool Discovery: We gave the supervisor a defined set of tools it could use (browser automation, visual recognition, text extraction, form filling)
- Dynamic Execution: The supervisor itself decides which tools to invoke and in what sequence, based on real-time observations of the environment
This sounds simple, but the implications are profound. The supervisor doesn’t follow a pre-written playbook. It actively makes decisions at each step based on what it observes. If something unexpected happens – if a button moved, if an error message appeared, if the page layout changed – the supervisor adapts its approach. It’s not executing a plan; it’s problem-solving in real time.
What emerged from this architecture was something we didn’t initially expect: genuine creativity in task execution. Because the supervisor has multiple tools and the flexibility to combine them in novel ways, it discovers approaches we never would have anticipated. It’s not following predetermined choreography; it’s improvising in real time, constrained only by the available tools and its understanding of the goal.
This is the core insight: the better architecture is to give the agent a set of tools and let it be creative about how to use them, rather than trying to predict in advance every possible step it might need to take.
The Memory Layer: Scratchpad and Adaptive Planning
We realized something else during the redesign: agents need a different kind of planning than we initially built. It’s not a to-do list they generate once and then execute. It’s more like a working memory – what research teams call a “scratchpad.
The scratchpad serves multiple purposes:
- Short-term context about what just happened (akin to human working memory)
- Hypothesis tracking about what the next action should be
- Observation logging to track what the agent has learned about the environment
- State updates as the agent moves through the task
Rather than a predetermined plan, the agent maintains what we might think of as continuous micro-planning – it evaluates its position, considers available tools, and decides the next action. The key difference: this planning happens in-loop with execution, not before it.
When we incorporated this scratchpad memory system, task completion rates improved dramatically. The agent wasn’t trying to rigidly follow a pre-committed path; it was continuously adapting based on new information. This mirrors how humans solve novel problems: we don’t generate a complete plan upfront, we work through it step-by-step, adjusting as we learn.
This insight connects to a powerful research pattern: the RAISE framework (Reasoning and Acting through Scratchpad and Examples) and similar architectures show that agents with memory layers that maintain context and reasoning traces significantly outperform agents without them. The scratchpad is not peripheral to the agent – it’s foundational.
Why We Had to Rewrite Everything
Halfway through, we realized we had built the system wrong. Fundamentally wrong. Not just poorly, but based on a misunderstanding of what computer use agents actually are. There was no way to salvage the architecture we had built – it was rotten at the foundation.
We rewrote the system. Completely.
This wasn’t unique to us. Anthropic – one of the world’s leading AI research organizations – has publicly discussed rewriting their systems multiple times during Claude’s development. When you discover that your core architectural assumption is flawed, sometimes the only option is to rebuild.
The rewrite we did taught us about software engineering at scale. When dealing with unpredictable systems, where the operating environment is fundamentally uncertain, incremental refactoring often doesn’t work. You can’t patch a building with rotten foundations. Sometimes you have to rebuild.
But here’s what’s fascinating: the rewrite itself wasn’t a setback in the way traditional software engineering would suggest. It was actually the right move. Why? Because we learned something critical in those two months that couldn’t have been learned any other way. We discovered what works and, more importantly, what doesn’t work.
AI Engineering Is a Testing Job
This brings us to perhaps the most important insight from our journey: AI engineering is fundamentally different from traditional software engineering in one crucial way: it’s primarily a testing discipline.
Here’s what we mean: Traditional software engineering often follows a predictable pattern. You understand the problem, design the architecture, implement it, test it, and ship it. The intellectual work happens in the design phase.
With AI systems, it’s inverted. The intellectual work happens in the testing and experimentation phase.
We think of our process now like this: AI engineering is like a playground where you test, test, test different approaches. You try a supervisor pattern. You test it. You try a different memory system. You test it. You try different tool sets. You test it. Most of these attempts don’t work. Some work partially. A few work well. Once you get a breakthrough – once you discover an approach that genuinely works – you deploy engineers to turn that breakthrough into a production system.
The playground phase is not a preliminary step. It’s not preliminary investigation. It’s the core intellectual work. The testing phase is where understanding emerges.
This is why two months of exploration and rebuilding wasn’t a waste. We were doing AI engineering correctly – we were running experiments until we found something that worked. Only then did we move to the production implementation phase.
Think about how this maps to what companies like Anthropic have experienced. They haven’t done three complete rewrites because they’re bad at software engineering. They’ve done rewrites because they kept discovering better ways to structure agents, better memory systems, better reasoning patterns. Each rewrite represented a breakthrough.
The Architecture That Works: Supervisor + Tools + Adaptation
Here’s what our final architecture looks like:
The supervisor agent acts as the central orchestrator. It understands the task, observes the environment through screenshots and DOM analysis, and maintains a scratchpad of what it’s learned so far. It has access to a specific set of tools: click actions, text entry, visual recognition, page navigation, form filling, screenshot analysis.
Each step, the supervisor:
- Observes the current state
- Consults its scratchpad memory
- Decides which tool to invoke
- Executes the tool
- Observes the result
- Updates its scratchpad
- Returns to step 1
The genius of this approach is its adaptability. If a UI element isn’t where it’s expected, the supervisor uses visual recognition to find it. If an error occurs, it tries an alternative approach. If the task requirements shift during execution, it adjusts.
This is not RPA. This is autonomous reasoning applied to computer automation.
The Lesson: Structure Is the Enemy of Creativity, Constraints Enable It
Here’s the counterintuitive insight we arrived at: there’s no value in imposing structure on an agent’s task decomposition.
When we tried to force the agent to follow a predetermined plan, we constrained its ability to adapt. But when we removed that rigid structure and instead provided:
- Clear tools it could use
- A working memory (scratchpad) to track reasoning
- Permission to adapt dynamically
…the agent became more structured, not less. It naturally developed internal patterns for approaching problems. It learned which tool combinations worked well. It built up a library of effective strategies.
The structure emerged from the freedom to experiment, not from imposing structure from above.
This maps to creative cognition research on how human creativity works. Humans don’t create well when given a fixed sequence of steps. They create well when given:
- A clear goal
- A set of available tools and resources
- Permission to combine them in novel ways
- A feedback loop to evaluate what works
The same applies to AI agents.
What This Means for Computer Use Agent Development
If you’re building computer use agents, here are the practical implications:
Don’t treat computer use like RPA. They’re fundamentally different. RPA is about predictable automation; computer use is about adaptive reasoning applied to digital environments.
Invest in memory systems early. Scratchpad or similar working memory isn’t optional – it’s foundational to effective multi-step reasoning.
Prefer tools over predetermined sequences. Give your agents a set of tools and let them reason about how to combine them, rather than predetermining exact execution sequences.
Build for adaptation. Your agent should monitor whether its assumptions about the environment hold true and be ready to try alternative approaches when they don’t.
Expect to iterate significantly on architecture. Two months of rebuilding is not a failure; it’s proper AI engineering. We enjoyed the build anyways. The testing and experimentation phase is where understanding happens. Only move to production once you’ve converged on an approach that works.
Treat development as experimentation. Set up your team to test different architectural approaches rapidly. Treat every architecture as a hypothesis. The faster you can run experiments, the faster you’ll converge on something that works.
This is how AI engineering works. It’s different from traditional software engineering. Embrace that difference. Your breakthrough will come from the testing playground, not from getting the architecture right on the first try.

