Python 13 min read

Claude Computer Use: Build AI Agents That Control Your Screen

Master Claude's Computer Use API to build AI agents that see screens, click elements, and automate tasks. Complete Python tutorial with working code examples.

MR

Moshiour Rahman

Advertisement

What is Claude Computer Use?

Claude Computer Use is Anthropic’s groundbreaking feature that enables AI to interact with computers like a human does—viewing the screen, moving the mouse, clicking elements, and typing text. Released in October 2024, this capability transforms Claude from a chat assistant into an autonomous screen agent.

CapabilityDescription
Screen VisionClaude sees and understands desktop screenshots
Mouse ControlMove cursor, click, drag, and scroll
Keyboard InputType text and press key combinations
Task AutomationComplete multi-step workflows autonomously

Why Computer Use Matters

Traditional automation tools like Selenium or Puppeteer require explicit selectors and break when UIs change. Claude Computer Use works differently—it understands what it sees, making it:

  • Resilient: Adapts to UI changes automatically
  • Intelligent: Makes decisions based on visual context
  • Universal: Works with any application, not just browsers
  • Natural: Accepts plain English instructions

Use Cases for Computer Use Agents

Use CaseExample
Data EntryFill forms across multiple applications
TestingVisual regression testing without selectors
MonitoringCheck dashboards and report anomalies
ResearchNavigate websites and extract information
OnboardingAutomate repetitive setup tasks

Prerequisites

Before building your first computer use agent, ensure you have:

# Python 3.10+
python --version

# Anthropic API key
export ANTHROPIC_API_KEY="your-api-key"

# Install dependencies
pip install anthropic pillow

API Access

Computer Use requires the claude-sonnet-4-20250514 or claude-3-5-haiku-20241022 model with the appropriate beta header:

import anthropic

client = anthropic.Anthropic()

# Beta tools for computer use
COMPUTER_USE_TOOLS = [
    {
        "type": "computer_20241022",
        "name": "computer",
        "display_width_px": 1920,
        "display_height_px": 1080,
        "display_number": 1
    }
]

Understanding the Computer Use API

The Agentic Loop

Claude Computer Use operates in a continuous loop:

  1. Observe: Take a screenshot of the current screen
  2. Reason: Analyze the screenshot and decide next action
  3. Act: Execute mouse/keyboard commands
  4. Repeat: Continue until task is complete
def run_agent_loop(task: str, max_iterations: int = 20):
    """Execute an agentic loop for a given task."""
    messages = [{"role": "user", "content": task}]
    
    for iteration in range(max_iterations):
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            tools=COMPUTER_USE_TOOLS,
            messages=messages,
            betas=["computer-use-2024-10-22"]
        )
        
        # Check if task is complete
        if response.stop_reason == "end_turn":
            return extract_text_response(response)
        
        # Process tool calls
        if response.stop_reason == "tool_use":
            tool_results = process_tool_calls(response)
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})
    
    return "Max iterations reached"

Tool Types

Claude can request three types of computer interactions:

ToolPurposeParameters
computerScreen controlaction, coordinate, text
text_editorFile editingcommand, path, content
bashShell commandscommand

Action Types

The computer tool supports these actions:

COMPUTER_ACTIONS = {
    "screenshot": "Capture the current screen",
    "mouse_move": "Move cursor to coordinates",
    "left_click": "Click left mouse button",
    "right_click": "Click right mouse button", 
    "double_click": "Double-click left button",
    "left_click_drag": "Click, drag, and release",
    "type": "Type text string",
    "key": "Press key combination",
    "scroll": "Scroll up or down",
    "cursor_position": "Get current cursor position"
}

Building a Basic Screen Agent

Let’s build a complete screen agent step by step.

Project Structure

claude-computer-use-tutorial/
├── src/
│   ├── __init__.py
│   ├── basic_agent.py
│   ├── screen_controller.py
│   ├── form_automation.py
│   └── utils.py
├── .env.example
├── pyproject.toml
└── README.md

Screen Controller

First, create a wrapper for screen interactions:

# src/screen_controller.py
import base64
import subprocess
from io import BytesIO
from PIL import Image

class ScreenController:
    """Handle screen capture and input simulation."""
    
    def __init__(self, width: int = 1920, height: int = 1080):
        self.width = width
        self.height = height
    
    def take_screenshot(self) -> str:
        """Capture screen and return base64-encoded image."""
        # macOS screenshot
        result = subprocess.run(
            ["screencapture", "-x", "-C", "-t", "png", "-"],
            capture_output=True
        )
        
        # Resize if needed
        image = Image.open(BytesIO(result.stdout))
        if image.size != (self.width, self.height):
            image = image.resize(
                (self.width, self.height),
                Image.LANCZOS
            )
        
        # Convert to base64
        buffer = BytesIO()
        image.save(buffer, format="PNG")
        return base64.standard_b64encode(buffer.getvalue()).decode()
    
    def mouse_move(self, x: int, y: int):
        """Move mouse to coordinates using AppleScript."""
        script = f'''
        tell application "System Events"
            set mousePosition to {{{x}, {y}}}
        end tell
        '''
        subprocess.run(["osascript", "-e", script])
    
    def click(self, x: int, y: int, button: str = "left"):
        """Click at coordinates."""
        # Using cliclick for macOS (brew install cliclick)
        cmd = "c" if button == "left" else "rc"
        subprocess.run(["cliclick", f"{cmd}:{x},{y}"])
    
    def double_click(self, x: int, y: int):
        """Double-click at coordinates."""
        subprocess.run(["cliclick", f"dc:{x},{y}"])
    
    def type_text(self, text: str):
        """Type text string."""
        subprocess.run(["cliclick", f"t:{text}"])
    
    def press_key(self, key: str):
        """Press a key combination."""
        # Map common keys
        key_map = {
            "Return": "kp:return",
            "Tab": "kp:tab",
            "Escape": "kp:escape",
            "BackSpace": "kp:delete",
            "space": "kp:space"
        }
        cmd = key_map.get(key, f"kp:{key.lower()}")
        subprocess.run(["cliclick", cmd])
    
    def scroll(self, x: int, y: int, direction: str, amount: int = 3):
        """Scroll at position."""
        scroll_cmd = "su" if direction == "up" else "sd"
        subprocess.run(["cliclick", f"m:{x},{y}", f"{scroll_cmd}:{amount}"])

Basic Agent Implementation

Now implement the main agent:

# src/basic_agent.py
import anthropic
from screen_controller import ScreenController

class ComputerUseAgent:
    """AI agent that can see and control the computer."""
    
    def __init__(self, display_width: int = 1920, display_height: int = 1080):
        self.client = anthropic.Anthropic()
        self.screen = ScreenController(display_width, display_height)
        self.display_width = display_width
        self.display_height = display_height
        
        self.tools = [
            {
                "type": "computer_20241022",
                "name": "computer",
                "display_width_px": display_width,
                "display_height_px": display_height,
                "display_number": 1
            }
        ]
    
    def process_tool_call(self, tool_use) -> dict:
        """Execute a tool call and return the result."""
        name = tool_use.name
        params = tool_use.input
        
        if name != "computer":
            return {"error": f"Unknown tool: {name}"}
        
        action = params.get("action")
        
        if action == "screenshot":
            screenshot = self.screen.take_screenshot()
            return {
                "type": "tool_result",
                "tool_use_id": tool_use.id,
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": screenshot
                        }
                    }
                ]
            }
        
        elif action == "mouse_move":
            x, y = params["coordinate"]
            self.screen.mouse_move(x, y)
            
        elif action == "left_click":
            x, y = params.get("coordinate", (None, None))
            if x and y:
                self.screen.click(x, y, "left")
            else:
                # Click at current position
                self.screen.click(0, 0, "left")
                
        elif action == "right_click":
            x, y = params.get("coordinate", (None, None))
            if x and y:
                self.screen.click(x, y, "right")
                
        elif action == "double_click":
            x, y = params.get("coordinate", (None, None))
            if x and y:
                self.screen.double_click(x, y)
                
        elif action == "type":
            self.screen.type_text(params["text"])
            
        elif action == "key":
            self.screen.press_key(params["text"])
            
        elif action == "scroll":
            x, y = params["coordinate"]
            direction = params.get("direction", "down")
            self.screen.scroll(x, y, direction)
        
        else:
            return {
                "type": "tool_result",
                "tool_use_id": tool_use.id,
                "content": f"Unknown action: {action}",
                "is_error": True
            }
        
        # Return confirmation with fresh screenshot
        screenshot = self.screen.take_screenshot()
        return {
            "type": "tool_result",
            "tool_use_id": tool_use.id,
            "content": [
                {
                    "type": "text",
                    "text": f"Action '{action}' completed successfully."
                },
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": screenshot
                    }
                }
            ]
        }
    
    def run(self, task: str, max_iterations: int = 25) -> str:
        """Run the agent to complete a task."""
        print(f"Starting task: {task}")
        
        # Get initial screenshot
        initial_screenshot = self.screen.take_screenshot()
        
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": task
                    },
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": initial_screenshot
                        }
                    }
                ]
            }
        ]
        
        system_prompt = """You are a computer use agent. You can see the screen and control the mouse and keyboard to complete tasks.

Guidelines:
1. Always take a screenshot first to see the current state
2. Think step by step about what you need to do
3. Click precisely on UI elements you can see
4. Wait for pages/applications to load after actions
5. Report when the task is complete

Be careful and methodical. If something doesn't work, try an alternative approach."""
        
        for iteration in range(max_iterations):
            print(f"Iteration {iteration + 1}/{max_iterations}")
            
            response = self.client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=4096,
                system=system_prompt,
                tools=self.tools,
                messages=messages,
                betas=["computer-use-2024-10-22"]
            )
            
            # Check for completion
            if response.stop_reason == "end_turn":
                for block in response.content:
                    if hasattr(block, "text"):
                        return block.text
                return "Task completed"
            
            # Process tool uses
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    print(f"  Action: {block.input.get('action', 'unknown')}")
                    result = self.process_tool_call(block)
                    tool_results.append(result)
            
            if tool_results:
                messages.append({"role": "assistant", "content": response.content})
                messages.append({"role": "user", "content": tool_results})
        
        return "Max iterations reached without completing task"


# Usage example
if __name__ == "__main__":
    agent = ComputerUseAgent()
    result = agent.run("Open Safari and search for 'Python tutorials'")
    print(f"Result: {result}")

Real-World Example: Form Automation

Here’s a practical example that fills out a web form:

# src/form_automation.py
from basic_agent import ComputerUseAgent

def automate_contact_form():
    """Fill out a contact form automatically."""
    agent = ComputerUseAgent()
    
    task = """
    Complete the following steps:
    1. I have a contact form open in the browser
    2. Fill in the Name field with "John Doe"
    3. Fill in the Email field with "john@example.com"
    4. Fill in the Message field with "Hello, I'm interested in your services."
    5. Click the Submit button
    6. Confirm the form was submitted successfully
    """
    
    result = agent.run(task)
    print(f"Form automation result: {result}")
    return result

def automate_login_flow():
    """Automate a login process."""
    agent = ComputerUseAgent()
    
    task = """
    Complete the login process:
    1. Find the username/email input field and enter "testuser@example.com"
    2. Find the password field and enter "SecurePassword123"
    3. Click the Login or Sign In button
    4. Wait for the page to load and confirm successful login
    5. Report what you see after logging in
    """
    
    result = agent.run(task)
    return result

def extract_dashboard_data():
    """Read data from a dashboard."""
    agent = ComputerUseAgent()
    
    task = """
    Analyze the dashboard on screen:
    1. Take a screenshot of the current dashboard
    2. Identify all visible metrics, charts, and data points
    3. Summarize the key information displayed
    4. Note any alerts or warnings shown
    5. Report your findings in a structured format
    """
    
    result = agent.run(task)
    return result


if __name__ == "__main__":
    # Run form automation
    automate_contact_form()

Utility Functions

# src/utils.py
import time
from typing import Callable, Any

def retry_on_failure(
    func: Callable,
    max_retries: int = 3,
    delay: float = 1.0
) -> Any:
    """Retry a function on failure with exponential backoff."""
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            wait_time = delay * (2 ** attempt)
            print(f"Attempt {attempt + 1} failed: {e}. Retrying in {wait_time}s...")
            time.sleep(wait_time)

def validate_coordinates(x: int, y: int, width: int, height: int) -> bool:
    """Validate that coordinates are within screen bounds."""
    return 0 <= x < width and 0 <= y < height

def scale_coordinates(
    x: int, y: int,
    from_width: int, from_height: int,
    to_width: int, to_height: int
) -> tuple[int, int]:
    """Scale coordinates from one resolution to another."""
    scaled_x = int(x * to_width / from_width)
    scaled_y = int(y * to_height / from_height)
    return scaled_x, scaled_y

def parse_tool_response(response) -> dict:
    """Parse Claude's response to extract tool calls."""
    result = {
        "text": "",
        "tool_calls": [],
        "stop_reason": response.stop_reason
    }
    
    for block in response.content:
        if hasattr(block, "text"):
            result["text"] += block.text
        elif block.type == "tool_use":
            result["tool_calls"].append({
                "id": block.id,
                "name": block.name,
                "input": block.input
            })
    
    return result

Error Handling

Robust error handling is crucial for computer use agents:

from anthropic import (
    APIError,
    RateLimitError,
    APIConnectionError
)
import time

class RobustAgent(ComputerUseAgent):
    """Agent with enhanced error handling."""
    
    def run_with_recovery(self, task: str, max_iterations: int = 25) -> str:
        """Run agent with automatic error recovery."""
        retries = 0
        max_retries = 3
        
        while retries < max_retries:
            try:
                return self.run(task, max_iterations)
            
            except RateLimitError as e:
                wait_time = 60 * (retries + 1)
                print(f"Rate limited. Waiting {wait_time}s...")
                time.sleep(wait_time)
                retries += 1
            
            except APIConnectionError:
                print("Connection error. Retrying in 5s...")
                time.sleep(5)
                retries += 1
            
            except APIError as e:
                print(f"API error: {e}")
                if "screenshot" in str(e).lower():
                    print("Screenshot issue - retrying...")
                    retries += 1
                else:
                    raise
            
            except Exception as e:
                print(f"Unexpected error: {e}")
                raise
        
        return "Failed after maximum retries"

Security Considerations

⚠️ Warning: Computer Use gives an AI control over your computer. Always follow these security practices.

Security Best Practices

PracticeImplementation
SandboxingRun in a VM or container
Limited ScopeRestrict to specific applications
Human OversightReview actions before execution
No Sensitive DataNever expose passwords or API keys
Network IsolationLimit network access

Running in Docker

For production use, run Claude Computer Use in a sandboxed Docker container:

FROM python:3.11-slim

# Install display server and tools
RUN apt-get update && apt-get install -y \
    xvfb \
    x11vnc \
    fluxbox \
    xterm \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy agent code
COPY src/ /app/src/

WORKDIR /app

# Start virtual display
CMD ["sh", "-c", "Xvfb :99 -screen 0 1920x1080x24 & export DISPLAY=:99 && python src/basic_agent.py"]

Confirmation Prompts

Add human confirmation for sensitive actions:

def confirm_action(action: str, params: dict) -> bool:
    """Ask user to confirm before executing sensitive actions."""
    sensitive_patterns = [
        "password", "delete", "remove", "sudo",
        "admin", "payment", "transfer"
    ]
    
    action_str = f"{action}: {params}"
    is_sensitive = any(p in action_str.lower() for p in sensitive_patterns)
    
    if is_sensitive:
        response = input(f"Confirm action? {action_str} [y/N]: ")
        return response.lower() == "y"
    
    return True

Performance Optimization

Reduce Screenshot Size

Large screenshots increase latency and costs. Resize appropriately:

def optimize_screenshot(image_data: bytes, max_width: int = 1280) -> bytes:
    """Resize screenshot to reduce API costs."""
    from PIL import Image
    from io import BytesIO
    
    image = Image.open(BytesIO(image_data))
    
    if image.width > max_width:
        ratio = max_width / image.width
        new_height = int(image.height * ratio)
        image = image.resize((max_width, new_height), Image.LANCZOS)
    
    buffer = BytesIO()
    image.save(buffer, format="PNG", optimize=True)
    return buffer.getvalue()

Batch Operations

When possible, combine related actions:

def batch_form_fill(agent, form_data: dict) -> str:
    """Fill multiple form fields efficiently."""
    instructions = "Fill out the form with the following data:\n"
    
    for field, value in form_data.items():
        instructions += f"- {field}: {value}\n"
    
    instructions += "\nAfter filling all fields, click Submit."
    
    return agent.run(instructions)

Cost Tracking

Monitor your API usage:

class CostTracker:
    """Track API costs for computer use sessions."""
    
    # Pricing per million tokens
    PRICING = {
        "claude-sonnet-4-20250514": {"input": 3.0, "output": 15.0}
    }
    
    def __init__(self):
        self.total_input_tokens = 0
        self.total_output_tokens = 0
        self.requests = 0
    
    def track(self, response):
        """Record usage from a response."""
        usage = response.usage
        self.total_input_tokens += usage.input_tokens
        self.total_output_tokens += usage.output_tokens
        self.requests += 1
    
    def get_cost(self, model: str = "claude-sonnet-4-20250514") -> float:
        """Calculate total estimated cost."""
        prices = self.PRICING[model]
        input_cost = (self.total_input_tokens / 1_000_000) * prices["input"]
        output_cost = (self.total_output_tokens / 1_000_000) * prices["output"]
        return input_cost + output_cost
    
    def report(self):
        """Print usage report."""
        print(f"Requests: {self.requests}")
        print(f"Input tokens: {self.total_input_tokens:,}")
        print(f"Output tokens: {self.total_output_tokens:,}")
        print(f"Estimated cost: ${self.get_cost():.4f}")

Debugging Tips

Enable Verbose Logging

import logging

logging.basicConfig(
    level=logging.DEBUG,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

logger = logging.getLogger(__name__)

class DebugAgent(ComputerUseAgent):
    """Agent with detailed logging."""
    
    def process_tool_call(self, tool_use) -> dict:
        logger.debug(f"Tool call: {tool_use.name}")
        logger.debug(f"Parameters: {tool_use.input}")
        
        result = super().process_tool_call(tool_use)
        
        logger.debug(f"Result type: {type(result)}")
        return result

Save Screenshots for Review

import os
from datetime import datetime

def save_debug_screenshot(image_data: str, prefix: str = "debug"):
    """Save screenshot for debugging purposes."""
    os.makedirs("debug_screenshots", exist_ok=True)
    
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"debug_screenshots/{prefix}_{timestamp}.png"
    
    import base64
    with open(filename, "wb") as f:
        f.write(base64.b64decode(image_data))
    
    print(f"Saved debug screenshot: {filename}")
    return filename

Summary

ConceptKey Points
Computer UseClaude can see screens and control input
Agent LoopScreenshot → Reason → Act → Repeat
Toolscomputer, text_editor, bash
Actionsclick, type, scroll, screenshot
SecuritySandbox, confirm, limit scope
OptimizationResize images, batch operations

Claude Computer Use opens new possibilities for AI-powered automation. By understanding the API, implementing proper error handling, and following security best practices, you can build powerful agents that interact with any application.

The complete code examples for this tutorial are available in our GitHub repository.

Advertisement

MR

Moshiour Rahman

Software Architect & AI Engineer

Share:
MR

Moshiour Rahman

Software Architect & AI Engineer

Enterprise software architect with deep expertise in financial systems, distributed architecture, and AI-powered applications. Building large-scale systems at Fortune 500 companies. Specializing in LLM orchestration, multi-agent systems, and cloud-native solutions. I share battle-tested patterns from real enterprise projects.

Related Articles

Comments

Comments are powered by GitHub Discussions.

Configure Giscus at giscus.app to enable comments.