How to Test Mastra Agents and Workflows

How to Test Mastra Agents and Workflows

Your Mastra agent works perfectly in development. You've wired up the tools, the workflow steps chain together cleanly, and the LLM picks the right tool every time you test it manually. Then you ship it. A user hits an edge case your prompt didn't cover, the agent calls the wrong tool, and the workflow exits silently with no error. You find out from a support ticket.

This is what happens when you build fast with Mastra and don't build a test suite alongside it. Mastra makes TypeScript agent development genuinely fast — but the speed multiplies both your velocity and your blast radius.

Here's how to test Mastra agents properly, layer by layer.

What Makes Mastra Testing Different

Mastra is a TypeScript-first framework. That's a meaningful distinction from Python-first agent frameworks like LangChain or AutoGen. You get type safety, native async/await, and a test ecosystem (vitest, jest) that most TypeScript developers already know.

But Mastra introduces moving parts that standard unit tests don't cover well:

  • Tool calls — the agent must select the right tool for a given input
  • Memory — does the agent recall context from earlier in a thread?
  • Workflow steps — branching logic, parallel steps, error recovery
  • LLM calls — non-deterministic by default, expensive, and slow to run in CI

You need to test each layer independently, then test them together. Here's the stack.

Layer 1: Unit Testing Mastra Tools

Mastra tools are the easiest part to test. A tool is a function with a schema — pure input/output logic you can test without any LLM involved.

// tools/fetchWeather.ts
import { createTool } from '@mastra/core';
import { z } from 'zod';

export const fetchWeatherTool = createTool({
  id: 'fetch-weather',
  description: 'Get current weather for a city',
  inputSchema: z.object({ city: z.string() }),
  outputSchema: z.object({ temp: z.number(), condition: z.string() }),
  execute: async ({ context }) => {
    const res = await fetch(`https://api.weather.example.com?city=${context.city}`);
    const data = await res.json();
    return { temp: data.temperature, condition: data.sky };
  },
});
// tools/fetchWeather.test.ts
import { describe, it, expect, vi } from 'vitest';
import { fetchWeatherTool } from './fetchWeather';

vi.stubGlobal('fetch', vi.fn());

describe('fetchWeatherTool', () => {
  it('returns temperature and condition for a valid city', async () => {
    (fetch as any).mockResolvedValue({
      json: async () => ({ temperature: 22, sky: 'sunny' }),
    });

    const result = await fetchWeatherTool.execute({
      context: { city: 'Berlin' },
      runId: 'test-run',
    } as any);

    expect(result.temp).toBe(22);
    expect(result.condition).toBe('sunny');
  });

  it('propagates fetch errors without swallowing them', async () => {
    (fetch as any).mockRejectedValue(new Error('Network timeout'));

    await expect(
      fetchWeatherTool.execute({ context: { city: 'Berlin' }, runId: 'test-run' } as any)
    ).rejects.toThrow('Network timeout');
  });
});

Tool tests run fast, need no API keys, and catch the majority of bugs before they reach the agent layer.

Layer 2: Testing Mastra Agents

Agent tests are where things get interesting. The agent uses an LLM to decide which tool to call — you don't want to hit a real LLM in CI. Mock it.

// agents/weatherAgent.ts
import { Agent } from '@mastra/core';
import { openai } from '@ai-sdk/openai';
import { fetchWeatherTool } from '../tools/fetchWeather';

export const weatherAgent = new Agent({
  name: 'Weather Agent',
  instructions: 'You help users get weather information. Always use the fetch-weather tool.',
  model: openai('gpt-4o'),
  tools: { fetchWeatherTool },
});
// agents/weatherAgent.test.ts
import { describe, it, expect, vi } from 'vitest';
import { weatherAgent } from './weatherAgent';

vi.mock('@ai-sdk/openai', () => ({
  openai: () => ({
    doGenerate: vi.fn().mockResolvedValue({
      toolCalls: [{ toolName: 'fetchWeatherTool', args: { city: 'Berlin' } }],
      text: '',
      finishReason: 'tool-calls',
      usage: { promptTokens: 10, completionTokens: 5 },
    }),
  }),
}));

describe('weatherAgent', () => {
  it('selects fetchWeatherTool when asked about weather', async () => {
    const response = await weatherAgent.generate(
      'What is the weather in Berlin?',
      { onStepFinish: vi.fn() }
    );

    const toolCall = response.steps?.[0]?.toolCalls?.[0];
    expect(toolCall?.toolName).toBe('fetchWeatherTool');
    expect(toolCall?.args?.city).toBe('Berlin');
  });
});

The key assertion is tool selection — did the agent pick the right tool with the right arguments? That's the contract you're enforcing. The LLM mock makes this deterministic and free to run.

Layer 3: Testing Mastra Workflows

Workflows are where most production bugs hide. A workflow with five steps and two branches has paths you won't manually test every time you make a change.

// workflows/reportWorkflow.ts
import { Workflow, Step } from '@mastra/core';

const gatherData = new Step({
  id: 'gather-data',
  execute: async ({ context }) => {
    return { rows: context.triggerData.rows };
  },
});

const analyzeData = new Step({
  id: 'analyze-data',
  execute: async ({ context }) => {
    const { rows } = context.machineContext.stepResults['gather-data'].output;
    if (rows.length === 0) throw new Error('No data to analyze');
    return { summary: `Analyzed ${rows.length} rows` };
  },
});

export const reportWorkflow = new Workflow({ name: 'report-workflow' })
  .step(gatherData)
  .then(analyzeData)
  .commit();
// workflows/reportWorkflow.test.ts
import { describe, it, expect } from 'vitest';
import { reportWorkflow } from './reportWorkflow';

describe('reportWorkflow', () => {
  it('analyzes rows and returns a summary', async () => {
    const run = reportWorkflow.createRun();
    const result = await run.start({ triggerData: { rows: ['a', 'b', 'c'] } });

    expect(result.results['analyze-data'].status).toBe('success');
    expect(result.results['analyze-data'].output.summary).toBe('Analyzed 3 rows');
  });

  it('fails at analyze-data step when rows are empty', async () => {
    const run = reportWorkflow.createRun();
    const result = await run.start({ triggerData: { rows: [] } });

    expect(result.results['analyze-data'].status).toBe('failed');
    expect(result.results['analyze-data'].error).toMatch(/No data to analyze/);
  });
});

Test each step's output directly. Test the error paths explicitly — don't assume they work. A workflow that silently exits on empty input is a bug waiting to become a production incident.

Layer 4: Testing Memory Integration

Mastra's memory lets agents recall earlier turns in a thread. If you rely on it, you need to verify it actually works — not just that the API call doesn't throw.

// memory/agentMemory.test.ts
import { describe, it, expect } from 'vitest';
import { Memory } from '@mastra/memory';
import { LibSQLStore } from '@mastra/memory/storage';
import { weatherAgent } from '../agents/weatherAgent';

describe('weatherAgent memory', () => {
  it('recalls a city mentioned in a previous turn', async () => {
    const memory = new Memory({ storage: new LibSQLStore({ url: ':memory:' }) });
    const threadId = 'test-thread-1';
    const resourceId = 'user-test-1';

    // First turn: mention the city
    await weatherAgent.generate('I live in Berlin.', {
      memory,
      threadId,
      resourceId,
    });

    // Second turn: ask without repeating the city
    const response = await weatherAgent.generate('What is the weather like here?', {
      memory,
      threadId,
      resourceId,
    });

    // The agent should have used Berlin from memory
    const toolCall = response.steps?.flatMap(s => s.toolCalls ?? [])
      .find(tc => tc.toolName === 'fetchWeatherTool');

    expect(toolCall?.args?.city).toBe('Berlin');
  });
});

This test fails fast if your memory configuration is wrong — wrong storage, missing thread ID, or context window trimming that drops the earlier message.

Layer 5: End-to-End Testing Mastra-Powered Apps

The layers above test your agent logic in isolation. But your users interact with an interface — a chat UI, a dashboard, an API endpoint that wraps the agent. That surface needs its own tests.

End-to-end tests for Mastra-powered apps should:

  • Hit the actual running application (not mock the agent)
  • Simulate real user input through the UI or API
  • Assert the visible outcome — the response content, the state change, the data written
// e2e/weatherApp.test.ts — using a real API endpoint
import { describe, it, expect } from 'vitest';

const BASE_URL = process.env.APP_URL ?? 'http://localhost:3000';

describe('Weather app API', () => {
  it('returns a weather summary for a valid city query', async () => {
    const res = await fetch(`${BASE_URL}/api/chat`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ message: 'What is the weather in Berlin?' }),
    });

    expect(res.status).toBe(200);
    const body = await res.json();
    expect(body.reply).toMatch(/berlin/i);
    expect(body.reply).toMatch(/\d+/); // contains a temperature
  });

  it('handles an unrecognized city gracefully', async () => {
    const res = await fetch(`${BASE_URL}/api/chat`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ message: 'Weather in Xqzrtplm?' }),
    });

    expect(res.status).toBe(200);
    const body = await res.json();
    expect(body.reply).toBeTruthy(); // agent responds, doesn't crash
  });
});

E2E tests catch integration failures that unit tests miss: misconfigured API routes, CORS issues, environment variable gaps, and prompt regressions that only appear with real LLM output.

How HelpMeTest Helps

The five layers above cover your agent internals. HelpMeTest covers what surrounds them — the web interface your users actually touch.

If your Mastra app has a browser UI, you can write plain-English tests against it without touching test code:

Go to the chat page
Type "What is the weather in Berlin?"
Wait for the response to appear
Verify the response contains a temperature

HelpMeTest runs these on a schedule and alerts you when they break. You can also use helpmetest health weather-chat 5m to set up a continuous uptime check that pings your app every five minutes and reports degradation before users notice.

Multi-viewport visual testing catches layout regressions in your chat UI across desktop and mobile. And because HelpMeTest persists browser state, you can authenticate once and reuse that session across all your tests — no re-login noise.

The free tier covers 10 tests. That's enough to cover your critical paths before you scale.

What to Actually Ship

Here's the minimum test coverage for a production Mastra app:

  1. Unit tests for every tool — happy path and error path
  2. Agent tests that assert tool selection for your key intents
  3. Workflow tests that walk every branch, including failure branches
  4. Memory tests if your agents maintain thread context
  5. At least one E2E test per critical API endpoint or UI flow

Start with the tools. They're fast to write and fast to run. Add agent and workflow tests as you add behavior. Add E2E monitoring once you're in production.

Mastra gives you the velocity to build fast. Tests give you the confidence to ship.


Start with HelpMeTest's free tier — 10 tests, no credit card. Add browser-level monitoring to your Mastra app in under ten minutes at helpmetest.com.

Read more