Claude Computer Use: Build AI Agents That Control Your Screen
Master Claude's Computer Use API to build AI agents that see screens, click elements, and automate tasks. Complete Python tutorial with working code examples.
Moshiour Rahman
Advertisement
What is Claude Computer Use?
Claude Computer Use is Anthropic’s groundbreaking feature that enables AI to interact with computers like a human does—viewing the screen, moving the mouse, clicking elements, and typing text. Released in October 2024, this capability transforms Claude from a chat assistant into an autonomous screen agent.
| Capability | Description |
|---|---|
| Screen Vision | Claude sees and understands desktop screenshots |
| Mouse Control | Move cursor, click, drag, and scroll |
| Keyboard Input | Type text and press key combinations |
| Task Automation | Complete multi-step workflows autonomously |
Why Computer Use Matters
Traditional automation tools like Selenium or Puppeteer require explicit selectors and break when UIs change. Claude Computer Use works differently—it understands what it sees, making it:
- Resilient: Adapts to UI changes automatically
- Intelligent: Makes decisions based on visual context
- Universal: Works with any application, not just browsers
- Natural: Accepts plain English instructions
Use Cases for Computer Use Agents
| Use Case | Example |
|---|---|
| Data Entry | Fill forms across multiple applications |
| Testing | Visual regression testing without selectors |
| Monitoring | Check dashboards and report anomalies |
| Research | Navigate websites and extract information |
| Onboarding | Automate repetitive setup tasks |
Prerequisites
Before building your first computer use agent, ensure you have:
# Python 3.10+
python --version
# Anthropic API key
export ANTHROPIC_API_KEY="your-api-key"
# Install dependencies
pip install anthropic pillow
API Access
Computer Use requires the claude-sonnet-4-20250514 or claude-3-5-haiku-20241022 model with the appropriate beta header:
import anthropic
client = anthropic.Anthropic()
# Beta tools for computer use
COMPUTER_USE_TOOLS = [
{
"type": "computer_20241022",
"name": "computer",
"display_width_px": 1920,
"display_height_px": 1080,
"display_number": 1
}
]
Understanding the Computer Use API
The Agentic Loop
Claude Computer Use operates in a continuous loop:
- Observe: Take a screenshot of the current screen
- Reason: Analyze the screenshot and decide next action
- Act: Execute mouse/keyboard commands
- Repeat: Continue until task is complete
def run_agent_loop(task: str, max_iterations: int = 20):
"""Execute an agentic loop for a given task."""
messages = [{"role": "user", "content": task}]
for iteration in range(max_iterations):
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
tools=COMPUTER_USE_TOOLS,
messages=messages,
betas=["computer-use-2024-10-22"]
)
# Check if task is complete
if response.stop_reason == "end_turn":
return extract_text_response(response)
# Process tool calls
if response.stop_reason == "tool_use":
tool_results = process_tool_calls(response)
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
return "Max iterations reached"
Tool Types
Claude can request three types of computer interactions:
| Tool | Purpose | Parameters |
|---|---|---|
computer | Screen control | action, coordinate, text |
text_editor | File editing | command, path, content |
bash | Shell commands | command |
Action Types
The computer tool supports these actions:
COMPUTER_ACTIONS = {
"screenshot": "Capture the current screen",
"mouse_move": "Move cursor to coordinates",
"left_click": "Click left mouse button",
"right_click": "Click right mouse button",
"double_click": "Double-click left button",
"left_click_drag": "Click, drag, and release",
"type": "Type text string",
"key": "Press key combination",
"scroll": "Scroll up or down",
"cursor_position": "Get current cursor position"
}
Building a Basic Screen Agent
Let’s build a complete screen agent step by step.
Project Structure
claude-computer-use-tutorial/
├── src/
│ ├── __init__.py
│ ├── basic_agent.py
│ ├── screen_controller.py
│ ├── form_automation.py
│ └── utils.py
├── .env.example
├── pyproject.toml
└── README.md
Screen Controller
First, create a wrapper for screen interactions:
# src/screen_controller.py
import base64
import subprocess
from io import BytesIO
from PIL import Image
class ScreenController:
"""Handle screen capture and input simulation."""
def __init__(self, width: int = 1920, height: int = 1080):
self.width = width
self.height = height
def take_screenshot(self) -> str:
"""Capture screen and return base64-encoded image."""
# macOS screenshot
result = subprocess.run(
["screencapture", "-x", "-C", "-t", "png", "-"],
capture_output=True
)
# Resize if needed
image = Image.open(BytesIO(result.stdout))
if image.size != (self.width, self.height):
image = image.resize(
(self.width, self.height),
Image.LANCZOS
)
# Convert to base64
buffer = BytesIO()
image.save(buffer, format="PNG")
return base64.standard_b64encode(buffer.getvalue()).decode()
def mouse_move(self, x: int, y: int):
"""Move mouse to coordinates using AppleScript."""
script = f'''
tell application "System Events"
set mousePosition to {{{x}, {y}}}
end tell
'''
subprocess.run(["osascript", "-e", script])
def click(self, x: int, y: int, button: str = "left"):
"""Click at coordinates."""
# Using cliclick for macOS (brew install cliclick)
cmd = "c" if button == "left" else "rc"
subprocess.run(["cliclick", f"{cmd}:{x},{y}"])
def double_click(self, x: int, y: int):
"""Double-click at coordinates."""
subprocess.run(["cliclick", f"dc:{x},{y}"])
def type_text(self, text: str):
"""Type text string."""
subprocess.run(["cliclick", f"t:{text}"])
def press_key(self, key: str):
"""Press a key combination."""
# Map common keys
key_map = {
"Return": "kp:return",
"Tab": "kp:tab",
"Escape": "kp:escape",
"BackSpace": "kp:delete",
"space": "kp:space"
}
cmd = key_map.get(key, f"kp:{key.lower()}")
subprocess.run(["cliclick", cmd])
def scroll(self, x: int, y: int, direction: str, amount: int = 3):
"""Scroll at position."""
scroll_cmd = "su" if direction == "up" else "sd"
subprocess.run(["cliclick", f"m:{x},{y}", f"{scroll_cmd}:{amount}"])
Basic Agent Implementation
Now implement the main agent:
# src/basic_agent.py
import anthropic
from screen_controller import ScreenController
class ComputerUseAgent:
"""AI agent that can see and control the computer."""
def __init__(self, display_width: int = 1920, display_height: int = 1080):
self.client = anthropic.Anthropic()
self.screen = ScreenController(display_width, display_height)
self.display_width = display_width
self.display_height = display_height
self.tools = [
{
"type": "computer_20241022",
"name": "computer",
"display_width_px": display_width,
"display_height_px": display_height,
"display_number": 1
}
]
def process_tool_call(self, tool_use) -> dict:
"""Execute a tool call and return the result."""
name = tool_use.name
params = tool_use.input
if name != "computer":
return {"error": f"Unknown tool: {name}"}
action = params.get("action")
if action == "screenshot":
screenshot = self.screen.take_screenshot()
return {
"type": "tool_result",
"tool_use_id": tool_use.id,
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot
}
}
]
}
elif action == "mouse_move":
x, y = params["coordinate"]
self.screen.mouse_move(x, y)
elif action == "left_click":
x, y = params.get("coordinate", (None, None))
if x and y:
self.screen.click(x, y, "left")
else:
# Click at current position
self.screen.click(0, 0, "left")
elif action == "right_click":
x, y = params.get("coordinate", (None, None))
if x and y:
self.screen.click(x, y, "right")
elif action == "double_click":
x, y = params.get("coordinate", (None, None))
if x and y:
self.screen.double_click(x, y)
elif action == "type":
self.screen.type_text(params["text"])
elif action == "key":
self.screen.press_key(params["text"])
elif action == "scroll":
x, y = params["coordinate"]
direction = params.get("direction", "down")
self.screen.scroll(x, y, direction)
else:
return {
"type": "tool_result",
"tool_use_id": tool_use.id,
"content": f"Unknown action: {action}",
"is_error": True
}
# Return confirmation with fresh screenshot
screenshot = self.screen.take_screenshot()
return {
"type": "tool_result",
"tool_use_id": tool_use.id,
"content": [
{
"type": "text",
"text": f"Action '{action}' completed successfully."
},
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot
}
}
]
}
def run(self, task: str, max_iterations: int = 25) -> str:
"""Run the agent to complete a task."""
print(f"Starting task: {task}")
# Get initial screenshot
initial_screenshot = self.screen.take_screenshot()
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": task
},
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": initial_screenshot
}
}
]
}
]
system_prompt = """You are a computer use agent. You can see the screen and control the mouse and keyboard to complete tasks.
Guidelines:
1. Always take a screenshot first to see the current state
2. Think step by step about what you need to do
3. Click precisely on UI elements you can see
4. Wait for pages/applications to load after actions
5. Report when the task is complete
Be careful and methodical. If something doesn't work, try an alternative approach."""
for iteration in range(max_iterations):
print(f"Iteration {iteration + 1}/{max_iterations}")
response = self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
system=system_prompt,
tools=self.tools,
messages=messages,
betas=["computer-use-2024-10-22"]
)
# Check for completion
if response.stop_reason == "end_turn":
for block in response.content:
if hasattr(block, "text"):
return block.text
return "Task completed"
# Process tool uses
tool_results = []
for block in response.content:
if block.type == "tool_use":
print(f" Action: {block.input.get('action', 'unknown')}")
result = self.process_tool_call(block)
tool_results.append(result)
if tool_results:
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
return "Max iterations reached without completing task"
# Usage example
if __name__ == "__main__":
agent = ComputerUseAgent()
result = agent.run("Open Safari and search for 'Python tutorials'")
print(f"Result: {result}")
Real-World Example: Form Automation
Here’s a practical example that fills out a web form:
# src/form_automation.py
from basic_agent import ComputerUseAgent
def automate_contact_form():
"""Fill out a contact form automatically."""
agent = ComputerUseAgent()
task = """
Complete the following steps:
1. I have a contact form open in the browser
2. Fill in the Name field with "John Doe"
3. Fill in the Email field with "john@example.com"
4. Fill in the Message field with "Hello, I'm interested in your services."
5. Click the Submit button
6. Confirm the form was submitted successfully
"""
result = agent.run(task)
print(f"Form automation result: {result}")
return result
def automate_login_flow():
"""Automate a login process."""
agent = ComputerUseAgent()
task = """
Complete the login process:
1. Find the username/email input field and enter "testuser@example.com"
2. Find the password field and enter "SecurePassword123"
3. Click the Login or Sign In button
4. Wait for the page to load and confirm successful login
5. Report what you see after logging in
"""
result = agent.run(task)
return result
def extract_dashboard_data():
"""Read data from a dashboard."""
agent = ComputerUseAgent()
task = """
Analyze the dashboard on screen:
1. Take a screenshot of the current dashboard
2. Identify all visible metrics, charts, and data points
3. Summarize the key information displayed
4. Note any alerts or warnings shown
5. Report your findings in a structured format
"""
result = agent.run(task)
return result
if __name__ == "__main__":
# Run form automation
automate_contact_form()
Utility Functions
# src/utils.py
import time
from typing import Callable, Any
def retry_on_failure(
func: Callable,
max_retries: int = 3,
delay: float = 1.0
) -> Any:
"""Retry a function on failure with exponential backoff."""
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if attempt == max_retries - 1:
raise
wait_time = delay * (2 ** attempt)
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {wait_time}s...")
time.sleep(wait_time)
def validate_coordinates(x: int, y: int, width: int, height: int) -> bool:
"""Validate that coordinates are within screen bounds."""
return 0 <= x < width and 0 <= y < height
def scale_coordinates(
x: int, y: int,
from_width: int, from_height: int,
to_width: int, to_height: int
) -> tuple[int, int]:
"""Scale coordinates from one resolution to another."""
scaled_x = int(x * to_width / from_width)
scaled_y = int(y * to_height / from_height)
return scaled_x, scaled_y
def parse_tool_response(response) -> dict:
"""Parse Claude's response to extract tool calls."""
result = {
"text": "",
"tool_calls": [],
"stop_reason": response.stop_reason
}
for block in response.content:
if hasattr(block, "text"):
result["text"] += block.text
elif block.type == "tool_use":
result["tool_calls"].append({
"id": block.id,
"name": block.name,
"input": block.input
})
return result
Error Handling
Robust error handling is crucial for computer use agents:
from anthropic import (
APIError,
RateLimitError,
APIConnectionError
)
import time
class RobustAgent(ComputerUseAgent):
"""Agent with enhanced error handling."""
def run_with_recovery(self, task: str, max_iterations: int = 25) -> str:
"""Run agent with automatic error recovery."""
retries = 0
max_retries = 3
while retries < max_retries:
try:
return self.run(task, max_iterations)
except RateLimitError as e:
wait_time = 60 * (retries + 1)
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
retries += 1
except APIConnectionError:
print("Connection error. Retrying in 5s...")
time.sleep(5)
retries += 1
except APIError as e:
print(f"API error: {e}")
if "screenshot" in str(e).lower():
print("Screenshot issue - retrying...")
retries += 1
else:
raise
except Exception as e:
print(f"Unexpected error: {e}")
raise
return "Failed after maximum retries"
Security Considerations
⚠️ Warning: Computer Use gives an AI control over your computer. Always follow these security practices.
Security Best Practices
| Practice | Implementation |
|---|---|
| Sandboxing | Run in a VM or container |
| Limited Scope | Restrict to specific applications |
| Human Oversight | Review actions before execution |
| No Sensitive Data | Never expose passwords or API keys |
| Network Isolation | Limit network access |
Running in Docker
For production use, run Claude Computer Use in a sandboxed Docker container:
FROM python:3.11-slim
# Install display server and tools
RUN apt-get update && apt-get install -y \
xvfb \
x11vnc \
fluxbox \
xterm \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt
# Copy agent code
COPY src/ /app/src/
WORKDIR /app
# Start virtual display
CMD ["sh", "-c", "Xvfb :99 -screen 0 1920x1080x24 & export DISPLAY=:99 && python src/basic_agent.py"]
Confirmation Prompts
Add human confirmation for sensitive actions:
def confirm_action(action: str, params: dict) -> bool:
"""Ask user to confirm before executing sensitive actions."""
sensitive_patterns = [
"password", "delete", "remove", "sudo",
"admin", "payment", "transfer"
]
action_str = f"{action}: {params}"
is_sensitive = any(p in action_str.lower() for p in sensitive_patterns)
if is_sensitive:
response = input(f"Confirm action? {action_str} [y/N]: ")
return response.lower() == "y"
return True
Performance Optimization
Reduce Screenshot Size
Large screenshots increase latency and costs. Resize appropriately:
def optimize_screenshot(image_data: bytes, max_width: int = 1280) -> bytes:
"""Resize screenshot to reduce API costs."""
from PIL import Image
from io import BytesIO
image = Image.open(BytesIO(image_data))
if image.width > max_width:
ratio = max_width / image.width
new_height = int(image.height * ratio)
image = image.resize((max_width, new_height), Image.LANCZOS)
buffer = BytesIO()
image.save(buffer, format="PNG", optimize=True)
return buffer.getvalue()
Batch Operations
When possible, combine related actions:
def batch_form_fill(agent, form_data: dict) -> str:
"""Fill multiple form fields efficiently."""
instructions = "Fill out the form with the following data:\n"
for field, value in form_data.items():
instructions += f"- {field}: {value}\n"
instructions += "\nAfter filling all fields, click Submit."
return agent.run(instructions)
Cost Tracking
Monitor your API usage:
class CostTracker:
"""Track API costs for computer use sessions."""
# Pricing per million tokens
PRICING = {
"claude-sonnet-4-20250514": {"input": 3.0, "output": 15.0}
}
def __init__(self):
self.total_input_tokens = 0
self.total_output_tokens = 0
self.requests = 0
def track(self, response):
"""Record usage from a response."""
usage = response.usage
self.total_input_tokens += usage.input_tokens
self.total_output_tokens += usage.output_tokens
self.requests += 1
def get_cost(self, model: str = "claude-sonnet-4-20250514") -> float:
"""Calculate total estimated cost."""
prices = self.PRICING[model]
input_cost = (self.total_input_tokens / 1_000_000) * prices["input"]
output_cost = (self.total_output_tokens / 1_000_000) * prices["output"]
return input_cost + output_cost
def report(self):
"""Print usage report."""
print(f"Requests: {self.requests}")
print(f"Input tokens: {self.total_input_tokens:,}")
print(f"Output tokens: {self.total_output_tokens:,}")
print(f"Estimated cost: ${self.get_cost():.4f}")
Debugging Tips
Enable Verbose Logging
import logging
logging.basicConfig(
level=logging.DEBUG,
format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)
class DebugAgent(ComputerUseAgent):
"""Agent with detailed logging."""
def process_tool_call(self, tool_use) -> dict:
logger.debug(f"Tool call: {tool_use.name}")
logger.debug(f"Parameters: {tool_use.input}")
result = super().process_tool_call(tool_use)
logger.debug(f"Result type: {type(result)}")
return result
Save Screenshots for Review
import os
from datetime import datetime
def save_debug_screenshot(image_data: str, prefix: str = "debug"):
"""Save screenshot for debugging purposes."""
os.makedirs("debug_screenshots", exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"debug_screenshots/{prefix}_{timestamp}.png"
import base64
with open(filename, "wb") as f:
f.write(base64.b64decode(image_data))
print(f"Saved debug screenshot: {filename}")
return filename
Summary
| Concept | Key Points |
|---|---|
| Computer Use | Claude can see screens and control input |
| Agent Loop | Screenshot → Reason → Act → Repeat |
| Tools | computer, text_editor, bash |
| Actions | click, type, scroll, screenshot |
| Security | Sandbox, confirm, limit scope |
| Optimization | Resize images, batch operations |
Claude Computer Use opens new possibilities for AI-powered automation. By understanding the API, implementing proper error handling, and following security best practices, you can build powerful agents that interact with any application.
The complete code examples for this tutorial are available in our GitHub repository.
Advertisement
Moshiour Rahman
Software Architect & AI Engineer
Enterprise software architect with deep expertise in financial systems, distributed architecture, and AI-powered applications. Building large-scale systems at Fortune 500 companies. Specializing in LLM orchestration, multi-agent systems, and cloud-native solutions. I share battle-tested patterns from real enterprise projects.
Related Articles
Build Your First MCP Server in Python: Connect Claude to Any API
Learn to build a Model Context Protocol (MCP) server from scratch. Create a GitHub MCP server that lets Claude interact with repositories, issues, and pull requests.
PythonModel Context Protocol (MCP): Build Custom AI Tool Integrations
Master MCP to connect Claude and other LLMs to external tools, databases, and APIs. Complete guide with Python and TypeScript examples for building MCP servers.
PythonMulti-Agent Systems: Build AI Teams with CrewAI & LangGraph
Master multi-agent orchestration with CrewAI and LangGraph. Build specialized AI teams that collaborate, delegate tasks, and solve complex problems together.
Comments
Comments are powered by GitHub Discussions.
Configure Giscus at giscus.app to enable comments.