2026年3月7日

网页自动化代理：使用 Claude 和 Computer Use 控制浏览器

网页自动化代理：使用 Claude 和 Computer Use 控制浏览器

你的公司需要从一个没有 API 的供应商门户获取数据。这是一个具有动态内容、多步骤表单的网页应用，而且布局每隔几个月就会变化一次。你的 Selenium 脚本每次在他们更新 CSS 时都会崩溃。你像维护遗留代码一样维护它——脆弱、昂贵，而且没人愿意碰。

基于视觉的代理不关心 CSS 选择器。它能看到页面，通过视觉找到”搜索”按钮，点击它，读取结果表格，然后提取数据。当布局发生变化时，代理能够自动适应，因为它像人类一样阅读页面——而不是解析 DOM。

本文将教你如何使用 Claude 的计算机操控能力构建网页自动化代理。你将学习视觉自动化的核心机制、可靠的导航模式、数据提取技术、表单填写工作流以及错误恢复策略。读完本文后，你将掌握构建代理所需的模式，这些代理可以与任何网页界面交互——从简单的页面读取到复杂的多步骤工作流。

视觉自动化 vs. 传统自动化

在构建基于视觉的代理之前，你需要了解它在何时有意义——何时没有意义。Selenium 和 Playwright 等传统自动化工具不会消失。它们速度快、行为确定性强，而且广为人知。关键问题在于每种方法各自擅长什么，以及在哪些场景下会失效。

传统自动化（Selenium/Playwright）

传统浏览器自动化通过与 DOM 交互来工作。你使用 CSS 选择器或 XPath 表达式定位元素，然后以编程方式对其执行操作：

# Traditional approach — fast but brittle
search_input = driver.find_element(By.CSS_SELECTOR, "#search-box-v3 > input.query")
search_input.send_keys("industrial bearings")
driver.find_element(By.CSS_SELECTOR, "button.search-submit-2024").click()

这种方式速度快、行为确定。但一旦有人将 search-box-v3 重命名为 search-box-v4 或重构表单布局，它就会崩溃。你最终需要维护一套选择器映射，而这套映射镜像了网站的内部结构——一个你无法控制的结构。

传统自动化也无法处理纯视觉内容。如果你需要的数据渲染在 <canvas> 元素中、嵌入在图片中，或者以 PDF 形式显示在浏览器内，DOM 选择器就无法触及它。

基于视觉的自动化（Computer Use）

基于视觉的自动化以人类的方式工作。代理接收屏幕截图，通过视觉识别所需元素，并在特定坐标发出鼠标/键盘操作：

# Vision-based approach — resilient but slower
# Agent sees the page, finds the search box visually, and types into it
# No selectors needed — it adapts to layout changes automatically

代价是：它更慢（每个操作都需要一次包含图片的 API 调用）、更贵（截图消耗 token）、且具有非确定性（代理在不同运行中可能对截图有不同解读）。但它对布局变化具有韧性，适用于任何视觉界面，并且能够从语义层面理解页面内容。

混合方法

最实用的策略是将两者结合：

使用传统自动化处理稳定的、结构良好的页面，即你控制界面或界面很少变化的情况。
使用基于视觉的自动化处理动态页面、不熟悉的界面、视觉内容，或不值得为之维护脚本的一次性任务。
将视觉作为后备方案——先尝试选择器；如果失败，则回退到视觉识别。

何时选择视觉方案

在以下情况下，基于视觉的自动化是正确的选择：

没有可用的 API，网页界面是你唯一的选项
页面布局频繁变化，选择器维护成本过高
内容是视觉化的——图表、图片、canvas 元素、嵌入式 PDF
多步骤工作流需要上下文——需要理解屏幕上的内容来决定下一步操作
一次性自动化，不值得投入工程成本来维护 Selenium 脚本
不熟悉的界面，你可以用自然语言描述任务，而不是映射选择器

Claude Computer Use 基础

Claude 的计算机操控能力让模型能够像人类一样与计算机屏幕交互——通过查看截图并发出鼠标和键盘操作。在构建可靠的代理之前，理解其工作机制至关重要。

工作原理

计算机操控循环很直接：

捕获当前屏幕（或浏览器窗口）的截图
发送截图连同任务描述给 Claude
接收工具调用——Claude 告诉你要执行什么操作（点击、输入、滚动）
执行浏览器中的操作
捕获新的截图
重复直到任务完成或代理发出完成信号

Claude 从不直接控制浏览器。你的代码充当中间人——从模型接收指令并在真实环境中执行。

工具定义

计算机操控依赖于一个特定的工具定义——computer_20241022——描述可用的操作：

computer_tool = {
    "type": "computer_20241022",
    "name": "computer",
    "display_width_px": 1280,
    "display_height_px": 800,
    "display_number": 1,
}

可用的操作包括：

screenshot——捕获当前屏幕状态
mouse_move——将光标移动到特定坐标
left_click / right_click / double_click——在当前光标位置点击
left_click_drag——点击并拖动到目标位置
type——输入一段文本
key——按下特定按键或组合键（例如 Return、ctrl+a）
scroll——在当前光标位置上下滚动

操作循环

以下是核心计算机操控循环的完整实现：

import anthropic
import base64
import subprocess
import time

client = anthropic.Anthropic()

def capture_screenshot() -> str:
    """Capture screen and return base64-encoded PNG."""
    # Using scrot for X11; adapt for your environment
    subprocess.run(["scrot", "/tmp/screenshot.png"], check=True)
    with open("/tmp/screenshot.png", "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

def execute_action(action: dict):
    """Execute a computer use action using xdotool."""
    action_type = action.get("action")

    if action_type == "screenshot":
        return  # Screenshot will be taken in the main loop

    elif action_type == "mouse_move":
        x, y = action["coordinate"]
        subprocess.run(["xdotool", "mousemove", str(x), str(y)])

    elif action_type == "left_click":
        x, y = action["coordinate"]
        subprocess.run(["xdotool", "mousemove", str(x), str(y)])
        subprocess.run(["xdotool", "click", "1"])

    elif action_type == "type":
        text = action["text"]
        subprocess.run(["xdotool", "type", "--clearmodifiers", text])

    elif action_type == "key":
        key = action["key"]
        subprocess.run(["xdotool", "key", key])

    elif action_type == "scroll":
        x, y = action["coordinate"]
        direction = action["direction"]
        amount = action["amount"]
        subprocess.run(["xdotool", "mousemove", str(x), str(y)])
        button = "5" if direction == "down" else "4"
        for _ in range(amount):
            subprocess.run(["xdotool", "click", button])

    time.sleep(0.5)  # Brief pause after each action

def run_computer_use_agent(task: str, max_steps: int = 50):
    """Run a computer use agent loop."""
    messages = [{"role": "user", "content": task}]

    for step in range(max_steps):
        # Capture current screen state
        screenshot_b64 = capture_screenshot()

        # Add screenshot to the conversation
        if step > 0:
            messages.append({
                "role": "user",
                "content": [{
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": screenshot_b64,
                    },
                }],
            })

        # Call Claude with computer use tool
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            tools=[computer_tool],
            messages=messages,
        )

        # Process response
        messages.append({"role": "assistant", "content": response.content})

        # Check if the agent is done
        if response.stop_reason == "end_turn":
            # Extract final text response
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            return "Task completed."

        # Execute tool calls
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                print(f"Step {step}: {block.input.get('action')} "
                      f"{block.input.get('coordinate', '')}")
                execute_action(block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": "Action executed successfully.",
                })

        messages.append({"role": "user", "content": tool_results})

    return "Max steps reached."

分辨率和坐标系统

Claude 根据你在工具定义中指定的截图尺寸，将视觉内容映射到 x,y 像素坐标。以下是几个关键细节：

确保你的显示分辨率与工具定义中的 display_width_px 和 display_height_px 匹配。不匹配会导致点击落在错误的位置。
**较低的分辨率更好。**1280×800 的截图能为 Claude 提供足够的细节来阅读文本和识别 UI 元素，同时保持 token 成本可控。不要发送 4K 截图。
坐标是绝对的——(0, 0) 是屏幕的左上角。

可靠的导航模式

现实世界的网站是混乱的。页面异步加载，弹窗不可预测地出现，动态内容在交互过程中改变布局。一个可靠的自动化代理需要具备处理所有这些情况的模式。

等待加载

网页自动化中最常见的错误——无论是传统的还是基于视觉的——就是在页面准备好之前就执行操作。不要使用固定的 time.sleep() 调用。相反，通过视觉验证页面状态：

def wait_for_page_load(
    client: anthropic.Anthropic,
    expected_content: str,
    max_retries: int = 5,
    delay: float = 2.0,
) -> bool:
    """Wait for a page to load by checking for expected visual content."""
    for attempt in range(max_retries):
        screenshot_b64 = capture_screenshot()

        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=256,
            messages=[{
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": screenshot_b64,
                        },
                    },
                    {
                        "type": "text",
                        "text": (
                            f"Is this page fully loaded and showing: "
                            f"'{expected_content}'? "
                            f"Reply with only 'yes' or 'no'."
                        ),
                    },
                ],
            }],
        )

        answer = response.content[0].text.strip().lower()
        if "yes" in answer:
            return True

        print(f"Page not ready (attempt {attempt + 1}/{max_retries}). Waiting...")
        time.sleep(delay)

    return False

弹窗处理

Cookie 横幅、通知对话框和聊天小部件是网页自动化的克星。基于视觉的代理可以自然地处理它们：

def dismiss_popups(client: anthropic.Anthropic) -> bool:
    """Check for and dismiss any popup overlays."""
    screenshot_b64 = capture_screenshot()

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=

网页自动化代理：使用 Claude 和 Computer Use 控制浏览器

网页自动化代理：使用 Claude 和 Computer Use 控制浏览器

视觉自动化 vs. 传统自动化

传统自动化（Selenium/Playwright）

基于视觉的自动化（Computer Use）

混合方法

何时选择视觉方案

Claude Computer Use 基础

工作原理

工具定义

操作循环

分辨率和坐标系统

可靠的导航模式

等待加载

弹窗处理

相关文章