Products·Anthropic·Oct 2024

20. Claude Computer Use (Beta)

First model to operate a real desktop by interpreting screenshots and issuing mouse/keyboard commands.

Product Announcement

Summary

Released computer use capability allowing Claude to interact with desktop environments — clicking, typing, navigating UIs, and taking screenshots. First major AI model to ship native computer interaction. Launched as public beta in Claude 3.5 Sonnet.

Key Concepts

Screenshot-Based Desktop Control

A visual grounding approach where the model perceives the desktop environment exclusively through screenshots and controls it through discrete actions (click coordinates, keyboard input, scrolling). Rather than accessing application APIs directly, Computer Use operates on rendered visual output, just as a human would. This approach is slower and more error-prone than API integration but has a critical advantage: it works with any application, including legacy systems and web apps. Visual grounding also creates a natural human supervision point—every action is visible before execution.

Mouse/Keyboard Command Interface

The explicit action vocabulary for desktop control: mouse coordinates for clicks, drag operations, text input, key presses, and scrolling. This vocabulary is intentionally simple and observable, mirroring what a human at a keyboard can do. The model receives coordinates as input, executes actions, and receives updated screenshots as feedback. This loop creates clear causality that humans can follow, understand, and interrupt. Complex multi-step workflows emerge from chaining these atomic actions.

Agent Safety Boundary (action vs observation)

The critical distinction between actions the model takes (which should be limited, observable, and interceptable) and observations it makes (which can be comprehensive). Computer Use implements this by restricting the model to a defined action vocabulary while allowing unlimited screenshot observation. The model can observe high-resolution details of the desktop but can only click at 50-pixel granularity (intentionally imprecise). This asymmetry—rich observation, constrained action—maintains human oversight even as the model gains desktop access.

From Chat to Action

The transition from traditional LLM interaction (text in, text out) to embodied AI (perception and action in real environments). Computer Use represents the maturation of this transition—the model is no longer confined to advice or suggestions but can directly accomplish tasks. This shift requires new safety considerations: the model's errors now have real consequences (deleting files, sending emails, posting publicly) rather than merely being unhelpful suggestions. From Chat to Action represents both capability expansion and safety complexity increase.

Connections

Influenced by

12. Claude 3 Family Launch (Haiku, Sonnet, Opus)

Mar 2024