AI Testing

How to Test Apps Built with Gemini CLI

HelpMeTest

13 May 2026 — 5 min read

Google released Gemini CLI in 2026, and developers are already using it to build and scaffold applications at speed. Like Cursor and Claude Code before it, Gemini CLI dramatically accelerates how fast you can go from idea to running code.

The problem is the same one that emerged with all AI-assisted development: the code runs. That's not the same as the code working correctly.

Gemini CLI generates plausible code. It handles the obvious paths. But it doesn't know your business logic edge cases, your data validation requirements, or the failure modes that matter in your production environment. The faster you ship with AI-generated code, the more important behavioral testing becomes.

What "Working" Means for Gemini CLI Apps

When you use Gemini CLI to scaffold or build an application, you get code that:

Compiles without errors
Runs without immediately crashing
Handles the happy path you described in the prompt

What you don't automatically get:

Correct behavior on edge cases you didn't describe
Proper error handling for unexpected inputs
Security validation at trust boundaries
Consistent behavior across the range of real user inputs
Regression safety — if you iterate with Gemini CLI, old behavior might break

This isn't a criticism of Gemini CLI. It's the nature of any code generation. The testing gap is your responsibility.

Layer 1: Behavioral Tests, Not Just Unit Tests

Gemini CLI apps need behavioral tests — tests that verify the application does what users need, not just that functions return values.

For a web app, start with end-to-end flows:

// Using Playwright for behavioral testing
import { test, expect } from '@playwright/test';

test('user can complete checkout flow', async ({ page }) => {
  await page.goto('/shop');
  
  // Add item to cart
  await page.click('[data-testid="add-to-cart-button"]');
  
  // Verify cart updated
  await expect(page.locator('[data-testid="cart-count"]')).toHaveText('1');
  
  // Proceed to checkout
  await page.click('[data-testid="checkout-button"]');
  
  // Fill payment form
  await page.fill('[name="card-number"]', '4242424242424242');
  await page.fill('[name="expiry"]', '12/28');
  await page.fill('[name="cvv"]', '123');
  
  await page.click('[data-testid="submit-payment"]');
  
  // Verify success
  await expect(page.locator('h1')).toContainText('Order confirmed');
  await expect(page.locator('[data-testid="order-id"]')).toBeVisible();
});

Write these before you start iterating with Gemini CLI. They become your regression safety net as you make changes.

Layer 2: Testing the Edge Cases Gemini CLI Missed

AI code generators are optimized for the common case. Test the edges:

test('handles empty cart checkout attempt', async ({ page }) => {
  await page.goto('/checkout');
  // Cart is empty — should redirect or show error, not crash
  
  await expect(page).toHaveURL(/\/cart|\/shop/);
  // OR
  await expect(page.locator('[data-testid="empty-cart-message"]')).toBeVisible();
});

test('validates required fields before payment submission', async ({ page }) => {
  await page.goto('/checkout');
  await addItemToCart(page);
  
  // Submit without filling card details
  await page.click('[data-testid="submit-payment"]');
  
  // Should show validation errors, not silently fail
  await expect(page.locator('[data-testid="card-number-error"]')).toBeVisible();
  await expect(page).not.toHaveURL(/\/confirmation/);
});

test('handles network error during checkout gracefully', async ({ page }) => {
  await page.route('**/api/payments', route => route.abort('failed'));
  
  await page.goto('/checkout');
  await addItemToCart(page);
  await fillValidPaymentDetails(page);
  await page.click('[data-testid="submit-payment"]');
  
  // Should show error, not leave user on broken page
  await expect(page.locator('[role="alert"]')).toBeVisible();
  await expect(page.locator('[data-testid="submit-payment"]')).toBeEnabled();
});

Gemini CLI generates code that works for the described use case. You test the cases that weren't described.

Layer 3: Input Validation and Security Boundaries

Gemini CLI-generated apps often have basic validation but miss security edge cases. Test explicitly:

test('rejects XSS in user input fields', async ({ page }) => {
  await page.goto('/profile');
  
  const xssPayload = '<script>alert("xss")</script>';
  await page.fill('[name="username"]', xssPayload);
  await page.click('[data-testid="save-profile"]');
  
  // Should sanitize or reject, not render the script
  const savedValue = await page.locator('[data-testid="username-display"]').textContent();
  expect(savedValue).not.toContain('<script>');
});

test('rejects SQL injection in search', async ({ page }) => {
  await page.goto('/search');
  
  await page.fill('[name="q"]', "'; DROP TABLE users; --");
  await page.press('[name="q"]', 'Enter');
  
  // Should return no results or an error, not crash or expose DB errors
  const errorText = await page.locator('body').textContent();
  expect(errorText).not.toContain('SQL');
  expect(errorText).not.toContain('syntax error');
});

Layer 4: Regression Testing After Gemini CLI Iterations

The real risk with AI-assisted development is regression: you ask Gemini CLI to add a feature, it does — and breaks something that was already working.

Your test suite is what catches this. Run it after every significant Gemini CLI session:

# Run your full behavioral test suite after each Gemini CLI iteration
npx playwright <span class="hljs-built_in">test

<span class="hljs-comment"># Or run just the core flows first for quick feedback
npx playwright <span class="hljs-built_in">test --grep <span class="hljs-string">"@smoke"

Tag your most critical tests with @smoke so you can run the fast regression check first:

test('@smoke user can log in', async ({ page }) => {
  await page.goto('/login');
  await page.fill('[name="email"]', 'test@example.com');
  await page.fill('[name="password"]', 'password123');
  await page.click('[type="submit"]');
  await expect(page).toHaveURL('/dashboard');
});

After any Gemini CLI change: run smoke tests first, full suite second.

Layer 5: API and Integration Testing

If Gemini CLI generated your API layer, test the contracts:

import request from 'supertest';
import app from './app';

describe('POST /api/orders', () => {
  it('creates order for valid authenticated request', async () => {
    const response = await request(app)
      .post('/api/orders')
      .set('Authorization', `Bearer ${validToken}`)
      .send({ productId: 'prod_123', quantity: 1 });
    
    expect(response.status).toBe(201);
    expect(response.body).toHaveProperty('orderId');
    expect(response.body.status).toBe('pending');
  });

  it('rejects unauthenticated order creation', async () => {
    const response = await request(app)
      .post('/api/orders')
      .send({ productId: 'prod_123', quantity: 1 });
    
    expect(response.status).toBe(401);
  });

  it('rejects order with invalid quantity', async () => {
    const response = await request(app)
      .post('/api/orders')
      .set('Authorization', `Bearer ${validToken}`)
      .send({ productId: 'prod_123', quantity: -5 });
    
    expect(response.status).toBe(400);
  });
});

What Local Tests Miss

Your local test suite runs against your development environment. Production fails differently:

Real user inputs — users phrase things in ways you didn't anticipate. Gemini CLI's generated validation handles what you described, not what users actually do.
Load behavior — works fine for one user in tests, breaks under concurrent load in production.
Third-party API changes — the integrations Gemini CLI wired up change their contracts. Your local mocks don't catch this.
Data at scale — tests run against small test datasets. Production has real data volumes that surface N+1 queries, pagination bugs, and performance issues.

Monitoring Gemini CLI Apps in Production

Once your app is live, you need ongoing behavioral monitoring.

HelpMeTest lets you write natural language tests against your deployed app and run them on a schedule:

Test: checkout flow works end-to-end
Go to https://yourapp.com
Click "Add to Cart" on the first product
Click "Checkout"
Fill in test payment details
Click "Submit"
Then: page shows order confirmation
And: order ID is visible
And: no JavaScript errors appear

Tests run continuously. If something Gemini CLI generated breaks after a dependency update, a data change, or a production environment difference, you find out before your users do.

Free tier: 10 tests, unlimited health checks. Try HelpMeTest →

Gemini CLI App Testing Checklist

For every app built or significantly modified with Gemini CLI:

Behavioral tests for all core user flows before shipping
Edge case tests — empty states, invalid inputs, boundary values
Error handling tests — network failures, API errors, validation failures
Security boundary tests — XSS, injection, unauthorized access
Regression test suite that runs after every AI iteration session
API contract tests for all generated endpoints
Production behavioral monitoring for post-deploy breakage

Gemini CLI builds the scaffolding. Your tests verify it holds weight.

How to Test Apps Built with Gemini CLI

HelpMeTest

What "Working" Means for Gemini CLI Apps

Layer 1: Behavioral Tests, Not Just Unit Tests

Layer 2: Testing the Edge Cases Gemini CLI Missed

Layer 3: Input Validation and Security Boundaries

Layer 4: Regression Testing After Gemini CLI Iterations

Layer 5: API and Integration Testing

What Local Tests Miss

Monitoring Gemini CLI Apps in Production

Gemini CLI App Testing Checklist

Read more

Testing React Router v7 with Vite + Vitest: Setup and Best Practices

E2E Testing React Router v7 Apps with Playwright

Migrating from Remix to React Router v7: Testing Your Migration

Testing React Router v7 Loaders and Actions with Vitest