Vertex AI Model Testing: Evaluation Pipelines and Model Monitoring
Vertex AI provides a dedicated Evaluation Service for scoring generative model outputs — no third-party judge needed. Combined with Model Monitoring for skew and drift detection, you can build a complete quality assurance loop that runs on every deployment.
Vertex AI Evaluation Service Overview
The Evaluation Service scores model responses using:
- Pointwise metrics — Score a single response (fluency, safety, groundedness, instruction-following)
- Pairwise metrics — Compare two responses and pick the better one
- Computation-based metrics — BLEU, ROUGE, exact match (no LLM judge required)
You submit a dataset, specify metrics, and get back per-sample and aggregate scores. The judge is Gemini — no external API key required if you're already on GCP.
Setup
pip install google-cloud-aiplatform pandasimport vertexai
from vertexai.evaluation import EvalTask, PointwiseMetric, MetricPromptTemplateExamples
from vertexai.generative_models import GenerativeModel
import pandas as pd
vertexai.init(project="my-gcp-project", location="us-central1")Running a Basic Evaluation
# Define your evaluation dataset
eval_data = pd.DataFrame({
"prompt": [
"Summarize the following: The Eiffel Tower was built in 1889 for the World's Fair.",
"What is the primary purpose of a load balancer?",
"Explain the difference between TCP and UDP.",
],
"reference": [
"The Eiffel Tower was constructed in 1889 for the World's Fair.",
"A load balancer distributes incoming network traffic across multiple servers.",
"TCP provides reliable, ordered delivery; UDP is faster but does not guarantee delivery.",
],
})
# Configure the model under test
model = GenerativeModel("gemini-1.5-flash-001")
# Create and run the evaluation task
eval_task = EvalTask(
dataset=eval_data,
metrics=[
"rouge_l_sum",
"bleu",
MetricPromptTemplateExamples.Pointwise.FLUENCY,
MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
],
experiment="my-model-eval",
)
result = eval_task.evaluate(model=model)
print(result.summary_metrics)
# {'rouge_l_sum/mean': 0.82, 'bleu/mean': 0.61, 'fluency/mean': 4.3, 'groundedness/mean': 0.91}Custom Pointwise Metrics
Define domain-specific evaluation criteria using natural language:
from vertexai.evaluation import PointwiseMetric
# Custom metric: does the response avoid jargon?
plain_language_metric = PointwiseMetric(
metric="plain_language",
metric_prompt_template="""
Evaluate whether the response uses plain, accessible language appropriate for a non-technical audience.
Response: {response}
Rate on a scale of 1-5:
1 = Heavy technical jargon, incomprehensible to non-experts
3 = Mixed — some jargon but mostly accessible
5 = Clear, plain language with no unexplained technical terms
Return only the numeric score.
""",
)
# Custom metric: does the response stay on topic?
on_topic_metric = PointwiseMetric(
metric="on_topic",
metric_prompt_template="""
Given the prompt and response, evaluate whether the response directly addresses what was asked.
Prompt: {prompt}
Response: {response}
Rate on a scale of 1-5:
1 = Completely off-topic
3 = Partially addresses the question
5 = Directly and completely addresses the question
Return only the numeric score.
""",
)
eval_task = EvalTask(
dataset=eval_data,
metrics=[plain_language_metric, on_topic_metric],
)
result = eval_task.evaluate(model=model)
print(result.summary_metrics)Pairwise Model Comparison
Compare two model versions to decide which performs better before promoting:
from vertexai.evaluation import PairwiseMetric, EvalTask
# Define pairwise comparison
pairwise_quality = PairwiseMetric(
metric="pairwise_quality",
metric_prompt_template="""
Compare these two responses to the prompt and determine which is better.
Prompt: {prompt}
Response A: {baseline_model_response}
Response B: {candidate_model_response}
Which response is more accurate, helpful, and concise?
Return "A" if Response A is better, "B" if Response B is better, or "SAME" if equal.
""",
)
eval_data_comparison = pd.DataFrame({
"prompt": [
"What is gradient descent?",
"How does HTTPS work?",
"Explain database indexing.",
],
})
eval_task = EvalTask(
dataset=eval_data_comparison,
metrics=[pairwise_quality],
)
# baseline: current production model
# candidate: new model you want to promote
baseline_model = GenerativeModel("gemini-1.0-pro-001")
candidate_model = GenerativeModel("gemini-1.5-flash-001")
result = eval_task.evaluate(
model=candidate_model,
baseline_model=baseline_model,
)
win_rate = result.summary_metrics.get("pairwise_quality/candidate_model_win_rate", 0)
print(f"Candidate win rate: {win_rate:.1%}")
if win_rate < 0.5:
raise ValueError(f"Candidate model underperforms baseline (win rate: {win_rate:.1%})")Evaluation Pipelines with Kubeflow
For repeatable, versioned evaluation runs, wrap evaluations in a Vertex AI Pipeline:
from kfp import dsl
from kfp.v2 import compiler
from google.cloud.aiplatform import PipelineJob
@dsl.component(
base_image="python:3.11",
packages_to_install=["google-cloud-aiplatform>=1.38", "pandas"],
)
def evaluate_model(
project: str,
location: str,
model_id: str,
dataset_gcs_uri: str,
metrics_output: dsl.Output[dsl.Metrics],
) -> float:
import vertexai
import pandas as pd
from vertexai.evaluation import EvalTask, MetricPromptTemplateExamples
vertexai.init(project=project, location=location)
df = pd.read_json(dataset_gcs_uri, lines=True)
from vertexai.generative_models import GenerativeModel
model = GenerativeModel(model_id)
eval_task = EvalTask(
dataset=df,
metrics=[
"rouge_l_sum",
MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
],
)
result = eval_task.evaluate(model=model)
score = result.summary_metrics.get("groundedness/mean", 0.0)
metrics_output.log_metric("groundedness", score)
metrics_output.log_metric("rouge_l_sum", result.summary_metrics.get("rouge_l_sum/mean", 0.0))
return score
@dsl.component(base_image="python:3.11")
def check_quality_gate(groundedness_score: float, threshold: float = 0.85) -> bool:
if groundedness_score < threshold:
raise ValueError(
f"Quality gate failed: groundedness {groundedness_score:.3f} < {threshold}"
)
print(f"Quality gate passed: {groundedness_score:.3f} >= {threshold}")
return True
@dsl.pipeline(name="model-evaluation-pipeline")
def eval_pipeline(
project: str,
location: str,
model_id: str,
dataset_gcs_uri: str,
):
eval_task = evaluate_model(
project=project,
location=location,
model_id=model_id,
dataset_gcs_uri=dataset_gcs_uri,
)
check_quality_gate(
groundedness_score=eval_task.output,
threshold=0.85,
)
compiler.Compiler().compile(eval_pipeline, "eval_pipeline.json")Submit the compiled pipeline:
job = PipelineJob(
display_name="model-eval-run",
template_path="eval_pipeline.json",
parameter_values={
"project": "my-gcp-project",
"location": "us-central1",
"model_id": "gemini-1.5-flash-001",
"dataset_gcs_uri": "gs://my-bucket/eval-datasets/qa-v3.jsonl",
},
)
job.run(sync=True)Model Monitoring for Skew and Drift
Once your model is deployed behind a Vertex AI Endpoint, enable monitoring to catch distribution shifts:
from google.cloud.aiplatform import ModelMonitoringObjectiveConfig
from google.cloud.aiplatform.model_monitoring import objective
# After deploying your endpoint, add a monitoring job
from google.cloud import aiplatform
monitoring_job = aiplatform.ModelDeploymentMonitoringJob.create(
display_name="gemini-production-monitor",
endpoint=endpoint_resource_name,
logging_sampling_strategy={"random_sample_config": {"sample_rate": 0.1}},
monitoring_objective_configs=[
ModelMonitoringObjectiveConfig(
prediction_drift_detection_config=objective.PredictionDriftDetectionConfig(
drift_thresholds={"output_tokens": 0.3},
),
)
],
stats_anomalies_base_directory="gs://my-bucket/monitoring/",
anomaly_cloud_logging_sampling_strategy={"random_sample_config": {"sample_rate": 1.0}},
schedule={"cron": "0 * * * *"}, # Hourly
)Set up an alert when drift exceeds threshold — a sudden shift in output length or token distribution often indicates a prompt injection attack or a change in user behavior that your model isn't handling well.
CI Integration
# .github/workflows/vertex-eval.yml
name: Vertex AI Model Evaluation
on:
push:
paths:
- "prompts/**"
- "src/ai/**"
jobs:
evaluate:
runs-on: ubuntu-latest
permissions:
contents: read
id-token: write
steps:
- uses: actions/checkout@v4
- name: Authenticate to GCP
uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
service_account: ${{ secrets.GCP_SA_EMAIL }}
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install google-cloud-aiplatform kfp pandas
- name: Upload eval dataset
run: |
gsutil cp eval-datasets/qa-latest.jsonl \
gs://${{ vars.EVAL_BUCKET }}/eval-datasets/qa-latest.jsonl
- name: Run evaluation pipeline
run: python scripts/run_vertex_eval.pyThe Vertex AI Evaluation Service is well-suited for teams already on GCP — no additional infrastructure, billing stays within your existing project, and results integrate directly with Vertex Experiments for tracking scores over time.