Mar 7, 2026

Web Automation Agents: Browser Control with Claude and Computer Use

Web Automation Agents: Browser Control with Claude and Computer Use

Your company needs data from a supplier portal that has no API. It’s a web app with dynamic content, multi-step forms, and a layout that changes every few months. Your Selenium script breaks every time they update the CSS. You maintain it like legacy code — fragile, expensive, and nobody wants to touch it.

A vision-based agent doesn’t care about CSS selectors. It sees the page, finds the “Search” button visually, clicks it, reads the results table, and extracts the data. When the layout changes, the agent adapts because it’s reading the page like a human — not parsing the DOM.

This article teaches you how to build web automation agents using Claude’s computer use capabilities. You’ll learn the core mechanics of visual automation, reliable navigation patterns, data extraction techniques, form-filling workflows, and error recovery strategies. By the end, you’ll have the patterns needed to build agents that interact with any web interface — from simple page reads to complex multi-step workflows.

Visual vs. Traditional Automation

Before building a vision-based agent, you need to understand when it makes sense — and when it doesn’t. Traditional automation tools like Selenium and Playwright aren’t going away. They’re fast, deterministic, and well-understood. The question is where each approach excels and where it breaks down.

Traditional Automation (Selenium/Playwright)

Traditional browser automation works by interacting with the DOM. You locate elements using CSS selectors or XPath expressions, then perform actions on them programmatically:

# Traditional approach — fast but brittle
search_input = driver.find_element(By.CSS_SELECTOR, "#search-box-v3 > input.query")
search_input.send_keys("industrial bearings")
driver.find_element(By.CSS_SELECTOR, "button.search-submit-2024").click()

This is fast. It’s deterministic. And it breaks the moment someone renames search-box-v3 to search-box-v4 or restructures the form layout. You end up maintaining a mapping of selectors that mirrors the site’s internal structure — a structure you don’t control.

Traditional automation also can’t handle visual-only content. If the data you need is rendered in a <canvas> element, embedded in an image, or displayed as a PDF within the browser, DOM selectors can’t reach it.

Vision-Based Automation (Computer Use)

Vision-based automation works the way a human does. The agent receives a screenshot, visually identifies the elements it needs, and issues mouse/keyboard actions at specific coordinates:

# Vision-based approach — resilient but slower
# Agent sees the page, finds the search box visually, and types into it
# No selectors needed — it adapts to layout changes automatically

The tradeoff: it’s slower (each action requires an API call with an image), more expensive (tokens for screenshots), and non-deterministic (the agent might interpret a screenshot differently between runs). But it’s resilient to layout changes, works on any visual interface, and can understand page content semantically.

The Hybrid Approach

The most practical strategy combines both:

Use traditional automation for stable, well-structured pages where you control the interface or it rarely changes.
Use vision-based automation for dynamic pages, unfamiliar interfaces, visual content, or one-off tasks not worth maintaining scripts for.
Use vision as a fallback — try a selector first; if it fails, fall back to visual identification.

When to Choose Vision

Vision-based automation is the right choice when:

No API is available and the web interface is your only option
The page layout changes frequently and selector maintenance is too expensive
Content is visual — charts, images, canvas elements, embedded PDFs
Multi-step workflows require context — understanding what’s on screen to decide the next action
One-off automations that don’t justify the engineering cost of a maintained Selenium script
Unfamiliar interfaces where you can describe the task in natural language rather than mapping selectors

Claude Computer Use Basics

Claude’s computer use capability lets the model interact with a computer screen the same way a human would — by looking at screenshots and issuing mouse and keyboard actions. Understanding the mechanics is essential before building reliable agents.

How It Works

The computer use loop is straightforward:

Capture a screenshot of the current screen (or browser window)
Send the screenshot to Claude along with a task description
Receive a tool call — Claude tells you what action to perform (click, type, scroll)
Execute the action in the browser
Capture a new screenshot
Repeat until the task is complete or the agent signals it’s done

Claude never directly controls the browser. Your code acts as the intermediary — receiving instructions from the model and executing them in the real environment.

Tool Definitions

Computer use relies on a specific tool definition — computer_20241022 — that describes the available actions:

computer_tool = {
    "type": "computer_20241022",
    "name": "computer",
    "display_width_px": 1280,
    "display_height_px": 800,
    "display_number": 1,
}

The available actions include:

screenshot — Capture the current screen state
mouse_move — Move the cursor to specific coordinates
left_click / right_click / double_click — Click at the current cursor position
left_click_drag — Click and drag to a target position
type — Type a string of text
key — Press a specific key or key combination (e.g., Return, ctrl+a)
scroll — Scroll up or down at the current cursor position

The Action Loop

Here’s a complete implementation of the core computer use loop:

import anthropic
import base64
import subprocess
import time

client = anthropic.Anthropic()

def capture_screenshot() -> str:
    """Capture screen and return base64-encoded PNG."""
    # Using scrot for X11; adapt for your environment
    subprocess.run(["scrot", "/tmp/screenshot.png"], check=True)
    with open("/tmp/screenshot.png", "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

def execute_action(action: dict):
    """Execute a computer use action using xdotool."""
    action_type = action.get("action")

    if action_type == "screenshot":
        return  # Screenshot will be taken in the main loop

    elif action_type == "mouse_move":
        x, y = action["coordinate"]
        subprocess.run(["xdotool", "mousemove", str(x), str(y)])

    elif action_type == "left_click":
        x, y = action["coordinate"]
        subprocess.run(["xdotool", "mousemove", str(x), str(y)])
        subprocess.run(["xdotool", "click", "1"])

    elif action_type == "type":
        text = action["text"]
        subprocess.run(["xdotool", "type", "--clearmodifiers", text])

    elif action_type == "key":
        key = action["key"]
        subprocess.run(["xdotool", "key", key])

    elif action_type == "scroll":
        x, y = action["coordinate"]
        direction = action["direction"]
        amount = action["amount"]
        subprocess.run(["xdotool", "mousemove", str(x), str(y)])
        button = "5" if direction == "down" else "4"
        for _ in range(amount):
            subprocess.run(["xdotool", "click", button])

    time.sleep(0.5)  # Brief pause after each action

def run_computer_use_agent(task: str, max_steps: int = 50):
    """Run a computer use agent loop."""
    messages = [{"role": "user", "content": task}]

    for step in range(max_steps):
        # Capture current screen state
        screenshot_b64 = capture_screenshot()

        # Add screenshot to the conversation
        if step > 0:
            messages.append({
                "role": "user",
                "content": [{
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": screenshot_b64,
                    },
                }],
            })

        # Call Claude with computer use tool
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            tools=[computer_tool],
            messages=messages,
        )

        # Process response
        messages.append({"role": "assistant", "content": response.content})

        # Check if the agent is done
        if response.stop_reason == "end_turn":
            # Extract final text response
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            return "Task completed."

        # Execute tool calls
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                print(f"Step {step}: {block.input.get('action')} "
                      f"{block.input.get('coordinate', '')}")
                execute_action(block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": "Action executed successfully.",
                })

        messages.append({"role": "user", "content": tool_results})

    return "Max steps reached."

Resolution and Coordinate System

Claude maps visual content to x,y pixel coordinates based on the screenshot dimensions you specify in the tool definition. A few critical details:

Match your display resolution to the display_width_px and display_height_px in the tool definition. Mismatches cause clicks to land in the wrong places.
Lower resolutions are better. A 1280×800 screenshot gives Claude enough detail to read text and identify UI elements while keeping token costs manageable. Don’t send 4K screenshots.
Coordinates are absolute — (0, 0) is the top-left corner of the screen.

Real-world websites are messy. Pages load asynchronously, popups appear unpredictably, and dynamic content changes the layout mid-interaction. A reliable automation agent needs patterns for handling all of this.

Wait for Load

The most common mistake in web automation — traditional or visual — is acting before the page is ready. Don’t use fixed time.sleep() calls. Instead, verify the page state visually:

def wait_for_page_load(
    client: anthropic.Anthropic,
    expected_content: str,
    max_retries: int = 5,
    delay: float = 2.0,
) -> bool:
    """Wait for a page to load by checking for expected visual content."""
    for attempt in range(max_retries):
        screenshot_b64 = capture_screenshot()

        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=256,
            messages=[{
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": screenshot_b64,
                        },
                    },
                    {
                        "type": "text",
                        "text": (
                            f"Is this page fully loaded and showing: "
                            f"'{expected_content}'? "
                            f"Reply with only 'yes' or 'no'."
                        ),
                    },
                ],
            }],
        )

        answer = response.content[0].text.strip().lower()
        if "yes" in answer:
            return True

        print(f"Page not ready (attempt {attempt + 1}/{max_retries}). Waiting...")
        time.sleep(delay)

    return False

Cookie banners, notification dialogs, and chat widgets are the bane of web automation. A vision-based agent handles them naturally:

def dismiss_popups(client: anthropic.Anthropic) -> bool:
    """Check for and dismiss any popup overlays."""
    screenshot_b64 = capture_screenshot()

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=

---

## Related Articles

- [Agent Error Recovery: 5 Patterns for Production Reliability](/blog/agent-error-recovery-patterns/)
- [Streaming Agent Responses: Real-Time Output for Multi-Step Workflows](/blog/streaming-agent-responses-real-time-output-for-multi-step-workflows/)
- [Tool Use Patterns: Building Reliable Agent-Tool Interfaces](/blog/agent-tool-use-patterns/)
- [Introducing Agentic Development](/blog/introducing-agentic-development/)

Web Automation Agents: Browser Control with Claude and Computer Use

Web Automation Agents: Browser Control with Claude and Computer Use

Visual vs. Traditional Automation

Traditional Automation (Selenium/Playwright)

Vision-Based Automation (Computer Use)

The Hybrid Approach

When to Choose Vision

Claude Computer Use Basics

How It Works

Tool Definitions

The Action Loop

Resolution and Coordinate System

Reliable Navigation Patterns

Wait for Load

Popup Handling