LLM Testing

Vertex AI Model Testing: Evaluation Pipelines and Model Monitoring

HelpMeTest

21 May 2026 — 4 min read

Vertex AI provides a dedicated Evaluation Service for scoring generative model outputs — no third-party judge needed. Combined with Model Monitoring for skew and drift detection, you can build a complete quality assurance loop that runs on every deployment.

Vertex AI Evaluation Service Overview

The Evaluation Service scores model responses using:

Pointwise metrics — Score a single response (fluency, safety, groundedness, instruction-following)
Pairwise metrics — Compare two responses and pick the better one
Computation-based metrics — BLEU, ROUGE, exact match (no LLM judge required)

You submit a dataset, specify metrics, and get back per-sample and aggregate scores. The judge is Gemini — no external API key required if you're already on GCP.

Setup

pip install google-cloud-aiplatform pandas

import vertexai
from vertexai.evaluation import EvalTask, PointwiseMetric, MetricPromptTemplateExamples
from vertexai.generative_models import GenerativeModel
import pandas as pd

vertexai.init(project="my-gcp-project", location="us-central1")

Running a Basic Evaluation

# Define your evaluation dataset
eval_data = pd.DataFrame({
    "prompt": [
        "Summarize the following: The Eiffel Tower was built in 1889 for the World's Fair.",
        "What is the primary purpose of a load balancer?",
        "Explain the difference between TCP and UDP.",
    ],
    "reference": [
        "The Eiffel Tower was constructed in 1889 for the World's Fair.",
        "A load balancer distributes incoming network traffic across multiple servers.",
        "TCP provides reliable, ordered delivery; UDP is faster but does not guarantee delivery.",
    ],
})

# Configure the model under test
model = GenerativeModel("gemini-1.5-flash-001")

# Create and run the evaluation task
eval_task = EvalTask(
    dataset=eval_data,
    metrics=[
        "rouge_l_sum",
        "bleu",
        MetricPromptTemplateExamples.Pointwise.FLUENCY,
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
    ],
    experiment="my-model-eval",
)

result = eval_task.evaluate(model=model)

print(result.summary_metrics)
# {'rouge_l_sum/mean': 0.82, 'bleu/mean': 0.61, 'fluency/mean': 4.3, 'groundedness/mean': 0.91}

Custom Pointwise Metrics

Define domain-specific evaluation criteria using natural language:

from vertexai.evaluation import PointwiseMetric

# Custom metric: does the response avoid jargon?
plain_language_metric = PointwiseMetric(
    metric="plain_language",
    metric_prompt_template="""
    Evaluate whether the response uses plain, accessible language appropriate for a non-technical audience.
    
    Response: {response}
    
    Rate on a scale of 1-5:
    1 = Heavy technical jargon, incomprehensible to non-experts
    3 = Mixed — some jargon but mostly accessible  
    5 = Clear, plain language with no unexplained technical terms
    
    Return only the numeric score.
    """,
)

# Custom metric: does the response stay on topic?
on_topic_metric = PointwiseMetric(
    metric="on_topic",
    metric_prompt_template="""
    Given the prompt and response, evaluate whether the response directly addresses what was asked.
    
    Prompt: {prompt}
    Response: {response}
    
    Rate on a scale of 1-5:
    1 = Completely off-topic
    3 = Partially addresses the question
    5 = Directly and completely addresses the question
    
    Return only the numeric score.
    """,
)

eval_task = EvalTask(
    dataset=eval_data,
    metrics=[plain_language_metric, on_topic_metric],
)

result = eval_task.evaluate(model=model)
print(result.summary_metrics)

Pairwise Model Comparison

Compare two model versions to decide which performs better before promoting:

from vertexai.evaluation import PairwiseMetric, EvalTask

# Define pairwise comparison
pairwise_quality = PairwiseMetric(
    metric="pairwise_quality",
    metric_prompt_template="""
    Compare these two responses to the prompt and determine which is better.
    
    Prompt: {prompt}
    Response A: {baseline_model_response}
    Response B: {candidate_model_response}
    
    Which response is more accurate, helpful, and concise?
    Return "A" if Response A is better, "B" if Response B is better, or "SAME" if equal.
    """,
)

eval_data_comparison = pd.DataFrame({
    "prompt": [
        "What is gradient descent?",
        "How does HTTPS work?",
        "Explain database indexing.",
    ],
})

eval_task = EvalTask(
    dataset=eval_data_comparison,
    metrics=[pairwise_quality],
)

# baseline: current production model
# candidate: new model you want to promote
baseline_model = GenerativeModel("gemini-1.0-pro-001")
candidate_model = GenerativeModel("gemini-1.5-flash-001")

result = eval_task.evaluate(
    model=candidate_model,
    baseline_model=baseline_model,
)

win_rate = result.summary_metrics.get("pairwise_quality/candidate_model_win_rate", 0)
print(f"Candidate win rate: {win_rate:.1%}")

if win_rate < 0.5:
    raise ValueError(f"Candidate model underperforms baseline (win rate: {win_rate:.1%})")

Evaluation Pipelines with Kubeflow

For repeatable, versioned evaluation runs, wrap evaluations in a Vertex AI Pipeline:

from kfp import dsl
from kfp.v2 import compiler
from google.cloud.aiplatform import PipelineJob

@dsl.component(
    base_image="python:3.11",
    packages_to_install=["google-cloud-aiplatform>=1.38", "pandas"],
)
def evaluate_model(
    project: str,
    location: str,
    model_id: str,
    dataset_gcs_uri: str,
    metrics_output: dsl.Output[dsl.Metrics],
) -> float:
    import vertexai
    import pandas as pd
    from vertexai.evaluation import EvalTask, MetricPromptTemplateExamples

    vertexai.init(project=project, location=location)

    df = pd.read_json(dataset_gcs_uri, lines=True)

    from vertexai.generative_models import GenerativeModel
    model = GenerativeModel(model_id)

    eval_task = EvalTask(
        dataset=df,
        metrics=[
            "rouge_l_sum",
            MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
        ],
    )

    result = eval_task.evaluate(model=model)
    score = result.summary_metrics.get("groundedness/mean", 0.0)

    metrics_output.log_metric("groundedness", score)
    metrics_output.log_metric("rouge_l_sum", result.summary_metrics.get("rouge_l_sum/mean", 0.0))

    return score


@dsl.component(base_image="python:3.11")
def check_quality_gate(groundedness_score: float, threshold: float = 0.85) -> bool:
    if groundedness_score < threshold:
        raise ValueError(
            f"Quality gate failed: groundedness {groundedness_score:.3f} < {threshold}"
        )
    print(f"Quality gate passed: {groundedness_score:.3f} >= {threshold}")
    return True


@dsl.pipeline(name="model-evaluation-pipeline")
def eval_pipeline(
    project: str,
    location: str,
    model_id: str,
    dataset_gcs_uri: str,
):
    eval_task = evaluate_model(
        project=project,
        location=location,
        model_id=model_id,
        dataset_gcs_uri=dataset_gcs_uri,
    )

    check_quality_gate(
        groundedness_score=eval_task.output,
        threshold=0.85,
    )


compiler.Compiler().compile(eval_pipeline, "eval_pipeline.json")

Submit the compiled pipeline:

job = PipelineJob(
    display_name="model-eval-run",
    template_path="eval_pipeline.json",
    parameter_values={
        "project": "my-gcp-project",
        "location": "us-central1",
        "model_id": "gemini-1.5-flash-001",
        "dataset_gcs_uri": "gs://my-bucket/eval-datasets/qa-v3.jsonl",
    },
)
job.run(sync=True)

Model Monitoring for Skew and Drift

Once your model is deployed behind a Vertex AI Endpoint, enable monitoring to catch distribution shifts:

from google.cloud.aiplatform import ModelMonitoringObjectiveConfig
from google.cloud.aiplatform.model_monitoring import objective

# After deploying your endpoint, add a monitoring job
from google.cloud import aiplatform

monitoring_job = aiplatform.ModelDeploymentMonitoringJob.create(
    display_name="gemini-production-monitor",
    endpoint=endpoint_resource_name,
    logging_sampling_strategy={"random_sample_config": {"sample_rate": 0.1}},
    monitoring_objective_configs=[
        ModelMonitoringObjectiveConfig(
            prediction_drift_detection_config=objective.PredictionDriftDetectionConfig(
                drift_thresholds={"output_tokens": 0.3},
            ),
        )
    ],
    stats_anomalies_base_directory="gs://my-bucket/monitoring/",
    anomaly_cloud_logging_sampling_strategy={"random_sample_config": {"sample_rate": 1.0}},
    schedule={"cron": "0 * * * *"},  # Hourly
)

Set up an alert when drift exceeds threshold — a sudden shift in output length or token distribution often indicates a prompt injection attack or a change in user behavior that your model isn't handling well.

CI Integration

# .github/workflows/vertex-eval.yml
name: Vertex AI Model Evaluation

on:
  push:
    paths:
      - "prompts/**"
      - "src/ai/**"

jobs:
  evaluate:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      id-token: write

    steps:
      - uses: actions/checkout@v4

      - name: Authenticate to GCP
        uses: google-github-actions/auth@v2
        with:
          workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
          service_account: ${{ secrets.GCP_SA_EMAIL }}

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - run: pip install google-cloud-aiplatform kfp pandas

      - name: Upload eval dataset
        run: |
          gsutil cp eval-datasets/qa-latest.jsonl \
            gs://${{ vars.EVAL_BUCKET }}/eval-datasets/qa-latest.jsonl

      - name: Run evaluation pipeline
        run: python scripts/run_vertex_eval.py

The Vertex AI Evaluation Service is well-suited for teams already on GCP — no additional infrastructure, billing stays within your existing project, and results integrate directly with Vertex Experiments for tracking scores over time.

Vertex AI Model Testing: Evaluation Pipelines and Model Monitoring

HelpMeTest

Vertex AI Evaluation Service Overview

Setup

Running a Basic Evaluation

Custom Pointwise Metrics

Pairwise Model Comparison

Evaluation Pipelines with Kubeflow

Model Monitoring for Skew and Drift

CI Integration

Read more

Spacelift IaC Testing: Policies, Drift Detection, and Approval Workflows

Timezone and Locale-Sensitive Testing: Jest, Playwright, and CI Patterns

React i18next Testing: Unit Tests, Mocking, and RTL Integration

Automated Screen Reader Testing: NVDA, JAWS, and VoiceOver CI