Web Automation Agents: Browser Control with Claude and Computer Use
Web Automation Agents: Browser Control with Claude and Computer Use
Your company needs data from a supplier portal that has no API. It’s a web app with dynamic content, multi-step forms, and a layout that changes every few months. Your Selenium script breaks every time they update the CSS. You maintain it like legacy code — fragile, expensive, and nobody wants to touch it.
A vision-based agent doesn’t care about CSS selectors. It sees the page, finds the “Search” button visually, clicks it, reads the results table, and extracts the data. When the layout changes, the agent adapts because it’s reading the page like a human — not parsing the DOM.
This article teaches you how to build web automation agents using Claude’s computer use capabilities. You’ll learn the core mechanics of visual automation, reliable navigation patterns, data extraction techniques, form-filling workflows, and error recovery strategies. By the end, you’ll have the patterns needed to build agents that interact with any web interface — from simple page reads to complex multi-step workflows.
Visual vs. Traditional Automation
Before building a vision-based agent, you need to understand when it makes sense — and when it doesn’t. Traditional automation tools like Selenium and Playwright aren’t going away. They’re fast, deterministic, and well-understood. The question is where each approach excels and where it breaks down.
Traditional Automation (Selenium/Playwright)
Traditional browser automation works by interacting with the DOM. You locate elements using CSS selectors or XPath expressions, then perform actions on them programmatically:
# Traditional approach — fast but brittlesearch_input = driver.find_element(By.CSS_SELECTOR, "#search-box-v3 > input.query")search_input.send_keys("industrial bearings")driver.find_element(By.CSS_SELECTOR, "button.search-submit-2024").click()This is fast. It’s deterministic. And it breaks the moment someone renames search-box-v3 to search-box-v4 or restructures the form layout. You end up maintaining a mapping of selectors that mirrors the site’s internal structure — a structure you don’t control.
Traditional automation also can’t handle visual-only content. If the data you need is rendered in a <canvas> element, embedded in an image, or displayed as a PDF within the browser, DOM selectors can’t reach it.
Vision-Based Automation (Computer Use)
Vision-based automation works the way a human does. The agent receives a screenshot, visually identifies the elements it needs, and issues mouse/keyboard actions at specific coordinates:
# Vision-based approach — resilient but slower# Agent sees the page, finds the search box visually, and types into it# No selectors needed — it adapts to layout changes automaticallyThe tradeoff: it’s slower (each action requires an API call with an image), more expensive (tokens for screenshots), and non-deterministic (the agent might interpret a screenshot differently between runs). But it’s resilient to layout changes, works on any visual interface, and can understand page content semantically.
The Hybrid Approach
The most practical strategy combines both:
- Use traditional automation for stable, well-structured pages where you control the interface or it rarely changes.
- Use vision-based automation for dynamic pages, unfamiliar interfaces, visual content, or one-off tasks not worth maintaining scripts for.
- Use vision as a fallback — try a selector first; if it fails, fall back to visual identification.
When to Choose Vision
Vision-based automation is the right choice when:
- No API is available and the web interface is your only option
- The page layout changes frequently and selector maintenance is too expensive
- Content is visual — charts, images, canvas elements, embedded PDFs
- Multi-step workflows require context — understanding what’s on screen to decide the next action
- One-off automations that don’t justify the engineering cost of a maintained Selenium script
- Unfamiliar interfaces where you can describe the task in natural language rather than mapping selectors
Claude Computer Use Basics
Claude’s computer use capability lets the model interact with a computer screen the same way a human would — by looking at screenshots and issuing mouse and keyboard actions. Understanding the mechanics is essential before building reliable agents.
How It Works
The computer use loop is straightforward:
- Capture a screenshot of the current screen (or browser window)
- Send the screenshot to Claude along with a task description
- Receive a tool call — Claude tells you what action to perform (click, type, scroll)
- Execute the action in the browser
- Capture a new screenshot
- Repeat until the task is complete or the agent signals it’s done
Claude never directly controls the browser. Your code acts as the intermediary — receiving instructions from the model and executing them in the real environment.
Tool Definitions
Computer use relies on a specific tool definition — computer_20241022 — that describes the available actions:
computer_tool = { "type": "computer_20241022", "name": "computer", "display_width_px": 1280, "display_height_px": 800, "display_number": 1,}The available actions include:
screenshot— Capture the current screen statemouse_move— Move the cursor to specific coordinatesleft_click/right_click/double_click— Click at the current cursor positionleft_click_drag— Click and drag to a target positiontype— Type a string of textkey— Press a specific key or key combination (e.g.,Return,ctrl+a)scroll— Scroll up or down at the current cursor position
The Action Loop
Here’s a complete implementation of the core computer use loop:
import anthropicimport base64import subprocessimport time
client = anthropic.Anthropic()
def capture_screenshot() -> str: """Capture screen and return base64-encoded PNG.""" # Using scrot for X11; adapt for your environment subprocess.run(["scrot", "/tmp/screenshot.png"], check=True) with open("/tmp/screenshot.png", "rb") as f: return base64.standard_b64encode(f.read()).decode("utf-8")
def execute_action(action: dict): """Execute a computer use action using xdotool.""" action_type = action.get("action")
if action_type == "screenshot": return # Screenshot will be taken in the main loop
elif action_type == "mouse_move": x, y = action["coordinate"] subprocess.run(["xdotool", "mousemove", str(x), str(y)])
elif action_type == "left_click": x, y = action["coordinate"] subprocess.run(["xdotool", "mousemove", str(x), str(y)]) subprocess.run(["xdotool", "click", "1"])
elif action_type == "type": text = action["text"] subprocess.run(["xdotool", "type", "--clearmodifiers", text])
elif action_type == "key": key = action["key"] subprocess.run(["xdotool", "key", key])
elif action_type == "scroll": x, y = action["coordinate"] direction = action["direction"] amount = action["amount"] subprocess.run(["xdotool", "mousemove", str(x), str(y)]) button = "5" if direction == "down" else "4" for _ in range(amount): subprocess.run(["xdotool", "click", button])
time.sleep(0.5) # Brief pause after each action
def run_computer_use_agent(task: str, max_steps: int = 50): """Run a computer use agent loop.""" messages = [{"role": "user", "content": task}]
for step in range(max_steps): # Capture current screen state screenshot_b64 = capture_screenshot()
# Add screenshot to the conversation if step > 0: messages.append({ "role": "user", "content": [{ "type": "image", "source": { "type": "base64", "media_type": "image/png", "data": screenshot_b64, }, }], })
# Call Claude with computer use tool response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, tools=[computer_tool], messages=messages, )
# Process response messages.append({"role": "assistant", "content": response.content})
# Check if the agent is done if response.stop_reason == "end_turn": # Extract final text response for block in response.content: if hasattr(block, "text"): return block.text return "Task completed."
# Execute tool calls tool_results = [] for block in response.content: if block.type == "tool_use": print(f"Step {step}: {block.input.get('action')} " f"{block.input.get('coordinate', '')}") execute_action(block.input) tool_results.append({ "type": "tool_result", "tool_use_id": block.id, "content": "Action executed successfully.", })
messages.append({"role": "user", "content": tool_results})
return "Max steps reached."Resolution and Coordinate System
Claude maps visual content to x,y pixel coordinates based on the screenshot dimensions you specify in the tool definition. A few critical details:
- Match your display resolution to the
display_width_pxanddisplay_height_pxin the tool definition. Mismatches cause clicks to land in the wrong places. - Lower resolutions are better. A 1280×800 screenshot gives Claude enough detail to read text and identify UI elements while keeping token costs manageable. Don’t send 4K screenshots.
- Coordinates are absolute — (0, 0) is the top-left corner of the screen.
Reliable Navigation Patterns
Real-world websites are messy. Pages load asynchronously, popups appear unpredictably, and dynamic content changes the layout mid-interaction. A reliable automation agent needs patterns for handling all of this.
Wait for Load
The most common mistake in web automation — traditional or visual — is acting before the page is ready. Don’t use fixed time.sleep() calls. Instead, verify the page state visually:
def wait_for_page_load( client: anthropic.Anthropic, expected_content: str, max_retries: int = 5, delay: float = 2.0,) -> bool: """Wait for a page to load by checking for expected visual content.""" for attempt in range(max_retries): screenshot_b64 = capture_screenshot()
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=256, messages=[{ "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": "image/png", "data": screenshot_b64, }, }, { "type": "text", "text": ( f"Is this page fully loaded and showing: " f"'{expected_content}'? " f"Reply with only 'yes' or 'no'." ), }, ], }], )
answer = response.content[0].text.strip().lower() if "yes" in answer: return True
print(f"Page not ready (attempt {attempt + 1}/{max_retries}). Waiting...") time.sleep(delay)
return FalsePopup Handling
Cookie banners, notification dialogs, and chat widgets are the bane of web automation. A vision-based agent handles them naturally:
def dismiss_popups(client: anthropic.Anthropic) -> bool: """Check for and dismiss any popup overlays.""" screenshot_b64 = capture_screenshot()
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=
---
## Related Articles
- [Agent Error Recovery: 5 Patterns for Production Reliability](/blog/agent-error-recovery-patterns/)- [Streaming Agent Responses: Real-Time Output for Multi-Step Workflows](/blog/streaming-agent-responses-real-time-output-for-multi-step-workflows/)- [Tool Use Patterns: Building Reliable Agent-Tool Interfaces](/blog/agent-tool-use-patterns/)- [Introducing Agentic Development](/blog/introducing-agentic-development/)