Testing LLM Requests with Helicone: Logging, Rate Limits, and Prompt Versioning

Testing LLM Requests with Helicone: Logging, Rate Limits, and Prompt Versioning

Every LLM API call your application makes is a potential point of failure, a cost you're incurring, and a data point you could be learning from. Most teams log almost none of this. Helicone is a proxy layer that sits between your application and LLM providers — capturing everything, enforcing limits, and giving you the tools to test how your prompts behave under different conditions.

This guide covers the practical side: how to instrument your application with Helicone, how to use it to test request behavior under rate limits, and how to version and test prompts systematically.

What Helicone Does

Helicone is an LLM observability proxy. You route your OpenAI (or Anthropic, Together AI, etc.) requests through Helicone's URL, and it logs every request and response without changing your application code. Beyond logging, it provides:

  • Request logging — full prompts, completions, latency, tokens, cost
  • Rate limiting — set per-user or per-tenant request limits
  • Caching — return cached responses for identical prompts (useful in testing)
  • Prompt management — version and test prompts from a central registry
  • Custom metadata — attach user IDs, session IDs, feature flags to every request
  • Alerts — notify when cost or error rate thresholds are exceeded

The proxy approach means you get all of this with minimal code changes.

Setup: One-Line Integration

For OpenAI:

from openai import OpenAI

client = OpenAI(
    api_key="your-openai-key",
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": "Bearer your-helicone-key"
    }
)

# Your existing code is unchanged
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain rate limiting"}]
)

For Anthropic:

import anthropic

client = anthropic.Anthropic(
    api_key="your-anthropic-key",
    base_url="https://anthropic.helicone.ai/",
    default_headers={
        "Helicone-Auth": "Bearer your-helicone-key"
    }
)

That's the full integration for basic logging. Every request now appears in your Helicone dashboard with latency, token counts, cost, and the full prompt/completion.

Attaching Metadata for Better Observability

Raw request logs are useful. Request logs tagged with your application's context are much more useful. Helicone lets you attach custom properties to every request via headers:

from openai import OpenAI

client = OpenAI(
    api_key="your-openai-key",
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": "Bearer your-helicone-key"
    }
)

def generate_response(user_id: str, feature: str, prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        extra_headers={
            "Helicone-User-Id": user_id,
            "Helicone-Property-Feature": feature,
            "Helicone-Property-Environment": "production",
            "Helicone-Property-Version": "v2.3"
        }
    )
    return response.choices[0].message.content

Now in Helicone you can filter requests by user, by feature, by version. When a user reports a problem, you search for their user ID and see every LLM request they triggered, with full prompts and responses.

For testing, the version property is particularly useful — you can tag requests made during a test run and filter them out of your production metrics.

Testing Rate Limit Behavior

Rate limits in LLM applications are a source of subtle bugs. What happens when a user hits their request limit? Does your application fail gracefully? Does it surface a clear error message, or does it crash in a way that confuses users?

Helicone lets you set rate limits per user and test the behavior programmatically.

Configuring Rate Limits

Set a rate limit via Helicone headers:

def make_limited_request(user_id: str, prompt: str):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        extra_headers={
            "Helicone-Auth": "Bearer your-helicone-key",
            "Helicone-User-Id": user_id,
            "Helicone-RateLimit-Policy": "10;w=60;u=requests;s=user"
            # 10 requests per 60 seconds per user
        }
    )
    return response

Testing Rate Limit Handling

import pytest
import time
from openai import RateLimitError

def test_rate_limit_graceful_handling():
    """Verify the application handles rate limits without crashing."""
    test_user = "test-user-rate-limit-001"
    
    # Set a very tight rate limit for this test user
    responses = []
    errors = []
    
    for i in range(15):  # Exceed the 10 req/min limit
        try:
            result = make_limited_request(
                user_id=test_user,
                prompt=f"Request number {i}: what is 2+2?"
            )
            responses.append(result)
        except Exception as e:
            errors.append({
                "request": i,
                "error_type": type(e).__name__,
                "message": str(e)
            })
    
    # We should have gotten some successful responses
    assert len(responses) >= 5, "Should have succeeded on early requests"
    
    # And some rate limit errors on later ones
    assert len(errors) > 0, "Should have hit rate limit"
    
    # But errors should be rate limit errors, not other exceptions
    rate_limit_errors = [e for e in errors if "429" in e["message"] or 
                         "rate" in e["message"].lower()]
    assert len(rate_limit_errors) == len(errors), \
        "All errors should be rate limit errors, got unexpected exceptions"

def test_rate_limit_recovery():
    """Verify the application recovers after the rate limit window expires."""
    test_user = "test-user-recovery-001"
    
    # Exhaust the rate limit
    for i in range(12):
        try:
            make_limited_request(user_id=test_user, prompt="test")
        except:
            pass
    
    # Wait for the window to reset (60 seconds in production, shorter in test env)
    time.sleep(61)
    
    # Should succeed again after window reset
    try:
        result = make_limited_request(user_id=test_user, prompt="What is 2+2?")
        assert result is not None
    except Exception as e:
        pytest.fail(f"Should have recovered from rate limit, got: {e}")

Caching in Testing: Fast and Deterministic Runs

One of Helicone's most useful testing features is caching. In a test environment, you often want consistent, deterministic responses without incurring API costs on every run. Helicone's cache returns the same response for identical prompts.

def make_cached_request(prompt: str, use_cache: bool = True) -> str:
    cache_header = "true" if use_cache else "false"
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        extra_headers={
            "Helicone-Auth": "Bearer your-helicone-key",
            "Helicone-Cache-Enabled": cache_header,
            "Helicone-Cache-Max-Age": "3600"  # Cache for 1 hour
        }
    )
    return response.choices[0].message.content

In your test suite:

def test_classification_accuracy():
    """Test classification accuracy using cached responses for consistency."""
    test_cases = [
        ("Invoice #4521 for $1,200 due June 30", "invoice"),
        ("Please review this legal document", "legal"),
        ("Q3 revenue increased 12% YoY", "financial-report"),
    ]
    
    correct = 0
    for input_text, expected_label in test_cases:
        # Cache ensures same input → same output across test runs
        result = make_cached_request(
            f"Classify this document into one of: invoice, legal, financial-report, other.\n\n{input_text}"
        )
        
        if expected_label in result.lower():
            correct += 1
    
    accuracy = correct / len(test_cases)
    assert accuracy >= 0.9, f"Classification accuracy {accuracy:.1%} below 90% threshold"

The cache makes test runs idempotent — you're not paying for API calls on reruns, and you're not getting random failures from model non-determinism. For integration tests that need to run fast in CI, this is significant.

Prompt Versioning and Testing

Helicone's prompt management feature lets you store prompts centrally, version them, and use specific versions in different environments. This is more important than it sounds.

Without centralized prompt management, prompts live in code strings scattered across your codebase. When something breaks, you don't know which version of which prompt was running. When you want to A/B test prompt variations, you need to deploy code.

With Helicone prompts:

from helicone.prompts import helicone_prompt

@helicone_prompt("document-classifier", version=3)
def classify_document(document_text: str) -> str:
    # The actual prompt is stored in Helicone, fetched at runtime
    # version=3 pins to a specific version
    pass

Or fetch the prompt explicitly:

import requests

def get_versioned_prompt(prompt_name: str, version: int) -> str:
    response = requests.get(
        f"https://api.helicone.ai/v1/prompt/{prompt_name}/version/{version}",
        headers={"Authorization": f"Bearer {os.environ['HELICONE_API_KEY']}"}
    )
    return response.json()["prompt"]

def classify_document(text: str, prompt_version: int = None) -> str:
    prompt_template = get_versioned_prompt("document-classifier", prompt_version or "latest")
    
    full_prompt = prompt_template.replace("{{document}}", text)
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": full_prompt}],
        extra_headers={
            "Helicone-Auth": "Bearer your-helicone-key",
            "Helicone-Property-PromptVersion": str(prompt_version)
        }
    )
    return response.choices[0].message.content

Testing Prompt Versions Against Each Other

def test_prompt_version_regression():
    """Ensure new prompt version doesn't regress on established test cases."""
    test_cases = [
        {"input": "Invoice #4521", "expected": "invoice"},
        {"input": "Legal contract attached", "expected": "legal"},
    ]
    
    results = {}
    
    for version in [2, 3]:  # Compare v2 vs v3
        scores = []
        for case in test_cases:
            result = classify_document(case["input"], prompt_version=version)
            scores.append(1 if case["expected"] in result.lower() else 0)
        
        results[f"v{version}"] = sum(scores) / len(scores)
    
    print(f"Prompt v2 accuracy: {results['v2']:.1%}")
    print(f"Prompt v3 accuracy: {results['v3']:.1%}")
    
    # New version should not regress from previous
    assert results["v3"] >= results["v2"], \
        f"Prompt v3 ({results['v3']:.1%}) regressed from v2 ({results['v2']:.1%})"

Cost and Error Rate Monitoring

Helicone tracks cost and error rates automatically. You can query these via the API in your tests to catch cost regressions:

def get_feature_cost(feature_name: str, hours: int = 24) -> float:
    """Get the total LLM cost for a feature over the last N hours."""
    response = requests.get(
        "https://api.helicone.ai/v1/request",
        params={
            "filter": f'properties.feature = "{feature_name}"',
            "created_at_gte": (datetime.now() - timedelta(hours=hours)).isoformat()
        },
        headers={"Authorization": f"Bearer {os.environ['HELICONE_API_KEY']}"}
    )
    
    requests_data = response.json()["data"]
    total_cost = sum(r.get("cost_usd", 0) for r in requests_data)
    return total_cost

def test_cost_bounds():
    """After a test run, verify costs stayed within bounds."""
    # Run the feature
    for _ in range(100):
        classify_document("Sample document text")
    
    # Check what it cost
    cost = get_feature_cost("document-classifier", hours=1)
    
    # 100 requests should cost under $0.50 for this model/prompt length
    assert cost < 0.50, \
        f"100 classification requests cost ${cost:.4f}, exceeding $0.50 budget"

Combining Helicone with End-to-End Tests

Helicone gives you visibility into the LLM request layer. For complete coverage of your application, pair it with end-to-end tests that exercise the full user flow.

HelpMeTest provides Robot Framework + Playwright automation that tests your application the way users interact with it. Where Helicone tells you "this prompt returned a 429 error," a HelpMeTest scenario verifies that "the user saw a helpful error message and wasn't left staring at a spinner."

The combination is particularly useful for rate limit testing: Helicone confirms the limit was hit at the infrastructure level, while the E2E test confirms the application handled it gracefully at the user interface level.

Useful Helicone Headers Reference

Header Purpose Example Value
Helicone-Auth API authentication Bearer hc-...
Helicone-User-Id Tag requests by user user-12345
Helicone-Property-* Custom metadata Feature: classifier
Helicone-RateLimit-Policy Enforce rate limits 10;w=60;u=requests;s=user
Helicone-Cache-Enabled Enable response cache true
Helicone-Cache-Max-Age Cache TTL in seconds 3600
Helicone-Retry-Enabled Auto-retry on failure true
Helicone-Retry-Max Max retry attempts 3

Summary

Helicone occupies a specific and valuable niche in the LLM testing stack: it gives you complete visibility into every API request your application makes, with zero changes to your LLM-calling code. Rate limit testing, prompt versioning, cost monitoring, and caching for deterministic test runs are all built into the proxy.

For teams building production LLM applications, the question isn't whether to add observability — it's where in the stack to put it. Helicone handles the request layer. Pair it with a dataset evaluation platform for model-level testing, and with end-to-end tests for application-level verification, and you have coverage from the prompt to the user interface.

Read more