Testing

Stagehand by Browserbase: AI-Powered Browser Automation for Testing

HelpMeTest

20 May 2026 — 5 min read

Browser automation has always been a brittle affair. You write a selector like #submit-btn, the design team renames it to .cta-button, and suddenly your entire test suite is broken. Stagehand, an open-source framework built by Browserbase, takes a fundamentally different approach: instead of hard-coded selectors, it uses large language models to interpret and interact with web pages the way a human would.

This post walks through what Stagehand is, how it works, where it shines, and where it struggles — with practical code examples you can run today.

What Is Stagehand?

Stagehand is an open-source TypeScript framework that wraps Playwright with an AI layer. Rather than requiring developers to write precise CSS or XPath selectors, Stagehand lets you describe what you want to do in plain English, and the underlying LLM figures out how to execute it.

The project is maintained by Browserbase, a cloud browser infrastructure company. Browserbase provides headless Chromium instances in the cloud, and Stagehand is their answer to the "how do you actually automate complex web tasks reliably?" question.

At its core, Stagehand provides three main primitives:

act() — perform an action on the page described in natural language
extract() — pull structured data from the current page
observe() — return a list of possible actions available on the page

These three methods, combined with standard Playwright capabilities, cover the vast majority of browser automation use cases.

Key Features

Natural Language Actions

The act() method is the heart of Stagehand. You describe what you want to do, and the LLM translates that into actual browser interactions:

await page.act("Click the login button");
await page.act("Fill in the email field with test@example.com");
await page.act("Select 'Monthly' from the billing cycle dropdown");

Under the hood, Stagehand takes a screenshot of the current page, sends it along with your instruction to an LLM (GPT-4o or Claude), and gets back the precise coordinates or selectors to interact with.

Structured Data Extraction

The extract() method lets you pull structured data from pages without writing complex scraping logic:

const products = await page.extract({
  instruction: "Extract all product names and prices from this page",
  schema: z.object({
    products: z.array(z.object({
      name: z.string(),
      price: z.number(),
    }))
  })
});

The schema is defined using Zod, giving you type-safe extracted data. This is particularly useful for testing data integrity — verifying that your application displays the correct information.

Full Testing Example

Here's a complete test scenario using Stagehand with a testing mindset:

import { Stagehand } from "@browserbasehq/stagehand";
import { z } from "zod";

const stagehand = new Stagehand({
  env: "LOCAL",
  verbose: 1,
  llmProvider: "openai",
  llmClient: new OpenAI({ apiKey: process.env.OPENAI_API_KEY }),
});

async function testCheckoutFlow() {
  await stagehand.init();
  const page = stagehand.page;

  // Navigate to the app
  await page.goto("https://your-app.com/shop");

  // Add item to cart using natural language
  await page.act("Click 'Add to Cart' on the first product");

  // Verify cart updated
  const cartCount = await page.extract({
    instruction: "What is the number shown in the cart badge?",
    schema: z.object({ count: z.number() })
  });
  console.assert(cartCount.count === 1, "Cart should have 1 item");

  // Proceed through checkout
  await page.act("Click the cart icon");
  await page.act("Click 'Proceed to Checkout'");
  await page.act("Fill in shipping address with: 123 Main St, New York, NY 10001");
  await page.act("Select standard shipping");
  
  // Verify order summary
  const summary = await page.extract({
    instruction: "Extract the order total and item count from the order summary",
    schema: z.object({
      total: z.string(),
      itemCount: z.number()
    })
  });
  
  console.assert(summary.itemCount === 1, "Order should contain 1 item");
  console.log("Order total:", summary.total);

  await stagehand.close();
}

testCheckoutFlow().catch(console.error);

Integration with Browserbase Cloud

While Stagehand works locally, it's designed to pair with Browserbase's cloud infrastructure:

const stagehand = new Stagehand({
  env: "BROWSERBASE",
  apiKey: process.env.BROWSERBASE_API_KEY,
  projectId: process.env.BROWSERBASE_PROJECT_ID,
  verbose: 1,
});

This gives you access to residential proxies, browser fingerprinting, and persistent sessions — useful for testing applications that have bot detection.

Pros and Cons

Strengths

Resilience to UI changes. Because Stagehand understands the semantic meaning of your instructions ("click the submit button"), it's far less likely to break when class names or element IDs change. The LLM can identify the button by its text or visual appearance rather than its technical attributes.

Rapid test authoring. Writing await page.act("Click 'Add to Cart'") is faster than inspecting the DOM to find the right selector. This significantly lowers the barrier to test creation, especially for team members who aren't deep Playwright experts.

Excellent for complex interactions. Tasks that would require dozens of lines of Playwright code — like navigating a multi-step wizard or interacting with a third-party widget — can often be expressed in a few natural language instructions.

Visual understanding. Stagehand uses vision-capable LLMs, meaning it can interpret charts, icons, and visually-rendered content that standard DOM-based automation struggles with.

Weaknesses

Latency. Every act() call involves an LLM API call, which adds 1-5 seconds of overhead. For large test suites, this can make runs prohibitively slow compared to traditional Playwright.

Cost. GPT-4o or Claude Sonnet API calls add up. A test suite with 100 action steps could cost dollars per run, which is significant at scale.

Non-determinism. LLMs don't always make the same decision twice. A step that passes today might fail tomorrow if the model interprets the page differently. This requires thoughtful test design and sometimes retry logic.

Debugging difficulty. When a natural language instruction fails, it can be harder to understand why than when a precise selector fails. Was it the instruction? The page state? The LLM's interpretation?

Stagehand vs. Traditional Playwright

Traditional Playwright gives you complete control and predictability at the cost of brittleness and maintenance overhead. A typical Playwright test might look like:

await page.click('[data-testid="add-to-cart-button"]');
await expect(page.locator('.cart-badge')).toHaveText('1');

This works perfectly — until data-testid="add-to-cart-button" gets renamed to data-testid="cart-add-btn" and suddenly your CI is red.

Stagehand trades that precision for resilience. The equivalent becomes:

await page.act("Click the Add to Cart button");
const badge = await page.extract({
  instruction: "What number is shown in the cart badge?",
  schema: z.object({ count: z.number() })
});

The second version survives cosmetic UI refactors. But it's slower, more expensive, and occasionally unpredictable.

The right answer for most teams is a hybrid: use Stagehand for high-level, semantically complex flows, and fall back to standard Playwright for performance-critical or highly deterministic assertions.

Where HelpMeTest Fits In

If Stagehand's value proposition — writing tests in plain language — resonates with you, but you want something that's managed, monitored, and doesn't require you to wire up LLM API keys, HelpMeTest takes this idea further.

HelpMeTest uses Robot Framework with Playwright under the hood and lets you define tests in natural language that run on a schedule. Instead of managing your own Browserbase account, OpenAI keys, and test infrastructure, you describe what to test and HelpMeTest handles the execution. The Pro plan at $100/month includes AI-powered test generation, meaning you can paste your app's URL and let the system propose tests automatically.

Where Stagehand is a developer framework you integrate into your codebase, HelpMeTest is a testing service you point at your live application.

Conclusion

Stagehand represents a genuine shift in how browser automation can work. By replacing fragile selectors with natural language instructions backed by vision-capable LLMs, it makes tests more resilient to UI changes and faster to write. The trade-offs — latency, cost, non-determinism — are real but manageable with good test design.

For teams already invested in Playwright who want to add AI resilience to specific complex flows, Stagehand is worth serious evaluation. Install it, run the quickstart, and see how many of your flaky tests become stable.

npm install @browserbasehq/stagehand

If you'd rather not manage the infrastructure yourself, check out HelpMeTest for a fully managed AI testing platform that handles the complexity for you.