Testing Anthropic Computer Use API: UI Actions, Screenshot Assertions, and Agent Loops

Testing Anthropic Computer Use API: UI Actions, Screenshot Assertions, and Agent Loops

Anthropic's Computer Use API lets Claude control a computer — clicking, typing, scrolling, and taking screenshots — to complete tasks autonomously. Testing Computer Use integrations is uniquely challenging because you're testing an agentic system that interacts with a real or simulated UI. This guide covers testing strategies for Computer Use agent loops, tool call validation, and screenshot-based assertions.

Computer Use Architecture Overview

A Computer Use agent loop works like this:

  1. Send a task prompt to Claude with computer tools enabled
  2. Claude returns a tool_use block (e.g., screenshot, click, type)
  3. Your code executes the tool action on the real or virtual desktop
  4. You send the tool result (screenshot, success status) back to Claude
  5. Repeat until Claude responds with no tool calls (task complete)

Testing this system means testing each layer: the tool parsing, the tool execution, the loop control logic, and the agent's decision quality.

Mocking the Anthropic SDK

import Anthropic from '@anthropic-ai/sdk';

jest.mock('@anthropic-ai/sdk');

const mockCreate = jest.fn();

(Anthropic as jest.Mock).mockImplementation(() => ({
  beta: {
    messages: {
      create: mockCreate,
    },
  },
}));

// Helper to create mock Computer Use responses
function createComputerUseResponse(toolUses: any[]) {
  return {
    id: 'msg_test',
    type: 'message',
    role: 'assistant',
    content: toolUses.map(tool => ({
      type: 'tool_use',
      id: `toolu_${Math.random().toString(36).slice(2)}`,
      name: tool.name,
      input: tool.input,
    })),
    model: 'claude-opus-4-6',
    stop_reason: toolUses.length > 0 ? 'tool_use' : 'end_turn',
    usage: { input_tokens: 500, output_tokens: 100 },
  };
}

function createFinalResponse(text: string) {
  return {
    id: 'msg_final',
    type: 'message',
    role: 'assistant',
    content: [{ type: 'text', text }],
    model: 'claude-opus-4-6',
    stop_reason: 'end_turn',
    usage: { input_tokens: 600, output_tokens: 50 },
  };
}

Testing Tool Call Parsing

Verify that your agent correctly parses Claude's tool call responses:

import { ComputerUseAgent } from '../agents/computer-use-agent';

describe('Tool call parsing', () => {
  test('parses screenshot tool call correctly', () => {
    const response = createComputerUseResponse([{
      name: 'computer',
      input: { action: 'screenshot' },
    }]);

    const agent = new ComputerUseAgent(new Anthropic('test-key'));
    const toolCalls = agent.parseToolCalls(response);

    expect(toolCalls).toHaveLength(1);
    expect(toolCalls[0].name).toBe('computer');
    expect(toolCalls[0].input.action).toBe('screenshot');
  });

  test('parses click action with coordinates', () => {
    const response = createComputerUseResponse([{
      name: 'computer',
      input: {
        action: 'left_click',
        coordinate: [640, 480],
      },
    }]);

    const agent = new ComputerUseAgent(new Anthropic('test-key'));
    const toolCalls = agent.parseToolCalls(response);

    expect(toolCalls[0].input.action).toBe('left_click');
    expect(toolCalls[0].input.coordinate).toEqual([640, 480]);
  });

  test('parses type action with text', () => {
    const response = createComputerUseResponse([{
      name: 'computer',
      input: {
        action: 'type',
        text: 'Hello, World!',
      },
    }]);

    const agent = new ComputerUseAgent(new Anthropic('test-key'));
    const toolCalls = agent.parseToolCalls(response);

    expect(toolCalls[0].input.action).toBe('type');
    expect(toolCalls[0].input.text).toBe('Hello, World!');
  });

  test('identifies end_turn as task completion', () => {
    const finalResponse = createFinalResponse('Task completed successfully.');

    const agent = new ComputerUseAgent(new Anthropic('test-key'));
    
    expect(agent.isTaskComplete(finalResponse)).toBe(true);
    expect(agent.isTaskComplete(createComputerUseResponse([{ name: 'computer', input: { action: 'screenshot' } }]))).toBe(false);
  });
});

Testing the Agent Loop

The agent loop is the critical orchestration layer:

describe('Computer Use agent loop', () => {
  let mockScreenshotFn: jest.Mock;
  let mockClickFn: jest.Mock;
  let mockTypeFn: jest.Mock;

  beforeEach(() => {
    mockScreenshotFn = jest.fn().mockResolvedValue(Buffer.from('fake-screenshot-data'));
    mockClickFn = jest.fn().mockResolvedValue({ success: true });
    mockTypeFn = jest.fn().mockResolvedValue({ success: true });
  });

  test('completes task after screenshot and click sequence', async () => {
    // Simulate: screenshot → click → screenshot → task complete
    mockCreate
      .mockResolvedValueOnce(createComputerUseResponse([{
        name: 'computer',
        input: { action: 'screenshot' },
      }]))
      .mockResolvedValueOnce(createComputerUseResponse([{
        name: 'computer',
        input: { action: 'left_click', coordinate: [100, 200] },
      }]))
      .mockResolvedValueOnce(createFinalResponse('I clicked the button successfully.'));

    const agent = new ComputerUseAgent(new Anthropic('test-key'), {
      screenshot: mockScreenshotFn,
      click: mockClickFn,
      type: mockTypeFn,
    });

    const result = await agent.runTask('Click the Submit button');

    expect(mockCreate).toHaveBeenCalledTimes(3);
    expect(mockScreenshotFn).toHaveBeenCalledTimes(1);
    expect(mockClickFn).toHaveBeenCalledWith(100, 200, 'left');
    expect(result.success).toBe(true);
    expect(result.finalMessage).toBe('I clicked the button successfully.');
  });

  test('handles max iterations limit to prevent infinite loops', async () => {
    // Always return screenshot — never complete the task
    mockCreate.mockResolvedValue(createComputerUseResponse([{
      name: 'computer',
      input: { action: 'screenshot' },
    }]));
    mockScreenshotFn.mockResolvedValue(Buffer.from('same-screenshot'));

    const agent = new ComputerUseAgent(new Anthropic('test-key'), {
      screenshot: mockScreenshotFn,
      click: mockClickFn,
      type: mockTypeFn,
    }, { maxIterations: 5 });

    const result = await agent.runTask('Do something impossible');

    expect(result.success).toBe(false);
    expect(result.error).toContain('max iterations');
    expect(mockCreate).toHaveBeenCalledTimes(5);
  });

  test('stops loop when consecutive screenshots are identical', async () => {
    const identicalScreenshot = Buffer.from('identical-data');
    mockScreenshotFn.mockResolvedValue(identicalScreenshot);

    mockCreate.mockResolvedValue(createComputerUseResponse([{
      name: 'computer',
      input: { action: 'screenshot' },
    }]));

    const agent = new ComputerUseAgent(new Anthropic('test-key'), {
      screenshot: mockScreenshotFn,
      click: mockClickFn,
      type: mockTypeFn,
    }, { stuckDetectionThreshold: 3 });

    const result = await agent.runTask('Navigate somewhere');

    expect(result.success).toBe(false);
    expect(result.error).toContain('stuck');
  });

  test('retries on tool execution failure', async () => {
    mockCreate
      .mockResolvedValueOnce(createComputerUseResponse([{
        name: 'computer',
        input: { action: 'left_click', coordinate: [100, 200] },
      }]))
      .mockResolvedValueOnce(createFinalResponse('Done.'));

    // First click fails, second succeeds
    mockClickFn
      .mockRejectedValueOnce(new Error('Click failed: element not found'))
      .mockResolvedValueOnce({ success: true });

    const agent = new ComputerUseAgent(new Anthropic('test-key'), {
      screenshot: mockScreenshotFn,
      click: mockClickFn,
      type: mockTypeFn,
    }, { retryOnError: true });

    const result = await agent.runTask('Click something');

    // Should have retried the click
    expect(mockClickFn).toHaveBeenCalledTimes(2);
    expect(result.success).toBe(true);
  });
});

Testing Screenshot Assertions

Computer Use's screenshot-based interactions need validation that the agent is looking at what you expect:

import { compareImages, PixelDiff } from '../utils/image-comparison';

describe('Screenshot assertion helpers', () => {
  test('detects when target element appears in screenshot', async () => {
    const testScreenshot = Buffer.from('fake-screenshot');
    
    // Mock image comparison
    const mockFindElement = jest.fn().mockResolvedValue({
      found: true,
      boundingBox: { x: 100, y: 200, width: 150, height: 50 },
      confidence: 0.95,
    });

    const result = await mockFindElement(testScreenshot, 'submit-button-template.png');
    
    expect(result.found).toBe(true);
    expect(result.confidence).toBeGreaterThan(0.9);
    expect(result.boundingBox.x).toBe(100);
  });

  test('validates that Claude clicks within expected region', async () => {
    const clickCoordinate = [125, 225]; // Should be inside bounding box [100, 200, 150, 50]
    const expectedRegion = { x: 100, y: 200, width: 150, height: 50 };

    function isWithinRegion(coord: number[], region: typeof expectedRegion): boolean {
      return coord[0] >= region.x && 
             coord[0] <= region.x + region.width &&
             coord[1] >= region.y && 
             coord[1] <= region.y + region.height;
    }

    expect(isWithinRegion(clickCoordinate, expectedRegion)).toBe(true);
    expect(isWithinRegion([500, 500], expectedRegion)).toBe(false);
  });
});

Testing Tool Result Construction

When you return tool results to Claude, they must be in the correct format:

describe('Tool result construction', () => {
  test('builds screenshot tool result with correct base64', async () => {
    const screenshotBuffer = Buffer.from('fake-png-data');
    const toolCallId = 'toolu_test123';

    const agent = new ComputerUseAgent(new Anthropic('test-key'), {
      screenshot: jest.fn().mockResolvedValue(screenshotBuffer),
      click: jest.fn(),
      type: jest.fn(),
    });

    const toolResult = await agent.executeToolAndBuildResult({
      id: toolCallId,
      name: 'computer',
      input: { action: 'screenshot' },
    });

    expect(toolResult).toEqual({
      type: 'tool_result',
      tool_use_id: toolCallId,
      content: [{
        type: 'image',
        source: {
          type: 'base64',
          media_type: 'image/png',
          data: screenshotBuffer.toString('base64'),
        },
      }],
    });
  });

  test('builds action result for successful click', async () => {
    const agent = new ComputerUseAgent(new Anthropic('test-key'), {
      screenshot: jest.fn(),
      click: jest.fn().mockResolvedValue({ success: true }),
      type: jest.fn(),
    });

    const toolResult = await agent.executeToolAndBuildResult({
      id: 'toolu_click',
      name: 'computer',
      input: { action: 'left_click', coordinate: [100, 200] },
    });

    expect(toolResult).toEqual({
      type: 'tool_result',
      tool_use_id: 'toolu_click',
      content: [{ type: 'text', text: 'Action completed successfully.' }],
    });
  });

  test('builds error result for failed action', async () => {
    const agent = new ComputerUseAgent(new Anthropic('test-key'), {
      screenshot: jest.fn(),
      click: jest.fn().mockRejectedValue(new Error('Element not found')),
      type: jest.fn(),
    });

    const toolResult = await agent.executeToolAndBuildResult({
      id: 'toolu_fail',
      name: 'computer',
      input: { action: 'left_click', coordinate: [999, 999] },
    });

    expect(toolResult.content[0].text).toContain('Error: Element not found');
  });
});

Agent Loop Reliability Testing

Test the overall reliability of your Computer Use integration:

describe('Agent reliability', () => {
  test('handles API rate limit errors with exponential backoff', async () => {
    const rateLimitError = new Anthropic.RateLimitError(
      429,
      { error: { type: 'rate_limit_error', message: 'Rate limit exceeded' } },
      'Rate limit exceeded',
      new Headers()
    );

    mockCreate
      .mockRejectedValueOnce(rateLimitError)
      .mockRejectedValueOnce(rateLimitError)
      .mockResolvedValueOnce(createFinalResponse('Done after retry.'));

    const agent = new ComputerUseAgent(new Anthropic('test-key'), {
      screenshot: jest.fn().mockResolvedValue(Buffer.from('ss')),
      click: jest.fn(),
      type: jest.fn(),
    }, { retryOnRateLimit: true, maxRetries: 3 });

    const result = await agent.runTask('Test task');

    expect(result.success).toBe(true);
    expect(mockCreate).toHaveBeenCalledTimes(3);
  });

  test('logs all tool calls for audit trail', async () => {
    const auditLog: any[] = [];
    
    mockCreate
      .mockResolvedValueOnce(createComputerUseResponse([{
        name: 'computer',
        input: { action: 'screenshot' },
      }]))
      .mockResolvedValueOnce(createFinalResponse('Done.'));

    const agent = new ComputerUseAgent(new Anthropic('test-key'), {
      screenshot: jest.fn().mockResolvedValue(Buffer.from('ss')),
      click: jest.fn(),
      type: jest.fn(),
    }, {
      onToolCall: (call) => auditLog.push({ ...call, timestamp: Date.now() }),
    });

    await agent.runTask('Log all actions');

    expect(auditLog).toHaveLength(1);
    expect(auditLog[0].input.action).toBe('screenshot');
    expect(auditLog[0].timestamp).toBeDefined();
  });
});

E2E Testing with HelpMeTest

Computer Use agents need monitoring in production to catch failures in real UI interactions. HelpMeTest can test your agent wrapper:

Navigate to https://your-app.com/agent-demo
Type "Search for 'best testing tools' and click the first result" in the task input
Click Run Agent
Verify agent starts executing within 5 seconds
Verify agent completes the task within 60 seconds
Verify a success message appears
Verify the agent log shows at least 3 tool calls

This verifies that your Computer Use agent integration, UI, and backend are all working together correctly in production.

Summary

Testing Computer Use API integrations requires:

  1. Tool call parsing tests — verify screenshot, click, type, and scroll action parsing
  2. Agent loop tests — completion detection, max iteration limits, stuck detection
  3. Tool execution tests — success/failure handling, retry logic
  4. Tool result construction tests — correct base64 encoding, format validation
  5. Reliability tests — rate limit backoff, audit logging, error propagation
  6. E2E monitoring — HelpMeTest for production agent loop health

The key insight: test the loop control logic and tool parsing with mocks, then use integration tests against a real sandboxed desktop to verify the actual click/type behavior.

Read more