Section 5.3: Evaluation Metrics

You cannot improve what you cannot measure. Guardrail evaluation metrics tell you how well your defenses are working — not in vague terms like “pretty good” or “seems fine,” but in precise, quantifiable numbers that you can track over time, compare across configurations, and use to justify engineering investment.

The challenge with guardrail metrics is that there are many things to measure, and they often pull in opposite directions. A guardrail that catches every possible attack will also block legitimate users. A guardrail that never bothers legitimate users will miss attacks. The art is in finding the right balance for your risk profile — and metrics are how you navigate that tradeoff.

The Confusion Matrix for Guardrails

Every guardrail decision falls into one of four categories. Understanding these categories is the foundation for every metric that follows.

Confusion matrix showing true positives, false positives, true negatives, and false negatives

In guardrail terms:

OutcomeGuardrail SaysRealityWhat Happened
True Positive (TP)BlockActually harmfulCorrect block — guardrail caught a real threat
False Positive (FP)BlockActually safeIncorrect block — guardrail blocked a legitimate user
True Negative (TN)AllowActually safeCorrect allow — guardrail let a safe request through
False Negative (FN)AllowActually harmfulMissed threat — harmful content got through

These four outcomes are the raw material for every classification metric. The relative cost of each outcome depends entirely on your use case — but in most guardrail contexts, false negatives are more dangerous than false positives, because a missed attack can cause real harm while a blocked legitimate request only causes inconvenience.

Precision

Precision answers the question: Of all the inputs the guardrail blocked, how many were actually harmful?

Precision = TP / (TP + FP)

High precision means the guardrail rarely blocks legitimate content. Low precision means the guardrail is trigger-happy — it blocks a lot of content that should have been allowed.

Concrete example: Your injection detector blocked 100 inputs today. If 92 of those were actual injection attempts (TP = 92) and 8 were legitimate questions (FP = 8), your precision is 92 / (92 + 8) = 0.92 (92%).

This means when your guardrail says “this is an attack,” it is right 92% of the time. The other 8% are frustrated users whose legitimate questions were incorrectly blocked.

Recall

Recall answers the question: Of all the inputs that were actually harmful, how many did the guardrail catch?

Recall = TP / (TP + FN)

High recall means the guardrail catches most threats. Low recall means threats are slipping through.

Concrete example: There were actually 118 injection attempts today. Your detector caught 92 of them (TP = 92) and missed 26 (FN = 26). Your recall is 92 / (92 + 26) = 0.78 (78%).

This means your guardrail catches 78% of real attacks. The other 22% get through undetected — and any one of them could lead to a harmful output, data leak, or system compromise.

F1 Score

F1 score is the harmonic mean of precision and recall — a single number that balances both concerns:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

F1 is useful as a summary metric but can obscure important tradeoffs. Two guardrails can have the same F1 score but very different precision-recall profiles. Always look at precision and recall individually before relying on F1.

Computing These Metrics in Practice

from dataclasses import dataclass


@dataclass
class GuardrailMetrics:
    true_positives: int
    false_positives: int
    true_negatives: int
    false_negatives: int

    @property
    def precision(self) -> float:
        denominator = self.true_positives + self.false_positives
        return self.true_positives / denominator if denominator > 0 else 0.0

    @property
    def recall(self) -> float:
        denominator = self.true_positives + self.false_negatives
        return self.true_positives / denominator if denominator > 0 else 0.0

    @property
    def f1_score(self) -> float:
        p, r = self.precision, self.recall
        return 2 * (p * r) / (p + r) if (p + r) > 0 else 0.0

    @property
    def false_positive_rate(self) -> float:
        denominator = self.false_positives + self.true_negatives
        return self.false_positives / denominator if denominator > 0 else 0.0

    @property
    def false_negative_rate(self) -> float:
        denominator = self.false_negatives + self.true_positives
        return self.false_negatives / denominator if denominator > 0 else 0.0


def evaluate_guardrail(
    guardrail_fn,
    labeled_dataset: list[dict],
) -> GuardrailMetrics:
    """Evaluate a guardrail function against a labeled dataset.

    Each item in labeled_dataset should have:
      - "input": the text to evaluate
      - "is_harmful": bool ground truth label
    """
    tp = fp = tn = fn = 0

    for item in labeled_dataset:
        result = guardrail_fn(item["input"])
        blocked = result.decision.value == "block"
        harmful = item["is_harmful"]

        if blocked and harmful:
            tp += 1
        elif blocked and not harmful:
            fp += 1
        elif not blocked and not harmful:
            tn += 1
        else:
            fn += 1

    return GuardrailMetrics(
        true_positives=tp,
        false_positives=fp,
        true_negatives=tn,
        false_negatives=fn,
    )

Using the evaluation:

dataset = load_labeled_test_set()
metrics = evaluate_guardrail(pipeline.evaluate, dataset)

print(f"Precision:  {metrics.precision:.3f}")
print(f"Recall:     {metrics.recall:.3f}")
print(f"F1 Score:   {metrics.f1_score:.3f}")
print(f"FP Rate:    {metrics.false_positive_rate:.3f}")
print(f"FN Rate:    {metrics.false_negative_rate:.3f}")

Worked Example: Interpreting Your Metrics

Let’s walk through a realistic scenario. Your injection detector reports:

  • Precision = 0.92 — when it blocks, it’s right 92% of the time
  • Recall = 0.78 — it catches 78% of actual injection attempts

What does this mean in practice?

If you process 10,000 requests per day and 2% are injection attempts (200 attacks):

MetricCalculationResultMeaning
True Positives200 × 0.78156Attacks correctly blocked
False Negatives200 × 0.2244Attacks that got through
False Positives156 / 0.92 × 0.08 ≈~14Legitimate users incorrectly blocked
True Negatives9,800 − 149,786Legitimate users correctly served

So every day: 44 attacks slip through your guardrails, and 14 legitimate users get blocked. Is this acceptable?

It depends on your context:

  • If this is a medical advice chatbot, 44 undetected attacks per day is a serious safety concern. You need higher recall, even if it means more false positives.
  • If this is an internal developer tool with trusted users, 14 blocked legitimate requests per day creates friction that may lead developers to circumvent the guardrail entirely. You might accept lower recall for higher precision.

Why this matters for guardrails: Raw accuracy (“our guardrail is 97% accurate!”) is meaningless without context. With 2% attack prevalence, a guardrail that blocks nothing is 98% accurate. Precision, recall, and their tradeoff tell the real story — and different risk profiles demand different positions on that tradeoff curve.

The Precision-Recall Tradeoff

Every classifier has a threshold — a score above which it flags content as harmful. Moving this threshold changes the precision-recall balance:

Threshold ──► Lower (more aggressive)
  • More inputs flagged as harmful
  • Recall increases (catch more real attacks)
  • Precision decreases (more false positives)
  • User experience degrades

Threshold ──► Higher (more permissive)
  • Fewer inputs flagged as harmful
  • Precision increases (fewer false positives)
  • Recall decreases (miss more real attacks)
  • Safety risk increases

Finding the right threshold is not a technical decision — it is a risk management decision that should involve product, security, and business stakeholders.

ScenarioThreshold StrategyTarget Metrics
Medical chatbotAggressive (low threshold)Recall > 0.95, accept precision ~0.80
Financial advisorAggressive (low threshold)Recall > 0.93, accept precision ~0.85
Customer supportBalancedPrecision ~0.90, recall ~0.88
Creative writing toolPermissive (high threshold)Precision > 0.95, accept recall ~0.75
Internal dev toolPermissive (high threshold)Precision > 0.97, accept recall ~0.70

False Positive Rate: The User Friction Metric

The false positive rate (FPR) directly measures user friction:

FPR = FP / (FP + TN)

FPR tells you what percentage of legitimate requests get incorrectly blocked. Even a small FPR creates significant friction at scale:

FPRAt 100K legitimate requests/dayImpact
0.1%100 users blockedManageable — users retry and succeed
0.5%500 users blockedNoticeable — support tickets increase
1.0%1,000 users blockedSignificant — user trust erodes
2.0%2,000 users blockedSevere — users seek alternatives
5.0%5,000 users blockedCritical — guardrail credibility destroyed

When FPR exceeds ~2%, something predictable happens: engineering teams start pushing to disable the guardrail because the user complaints outweigh the perceived security benefit. This is the paradox of overly aggressive guardrails — they get removed, leaving zero protection.

False Negative Rate: The Safety Gap Metric

The false negative rate (FNR) measures the safety gap:

FNR = FN / (FN + TP)

FNR tells you what percentage of actual attacks get through undetected. This is the critical risk metric — every false negative is a potential harm event.

Unlike false positives, false negatives are often invisible. A blocked user complains. An undetected attack may never be noticed until it causes harm. This asymmetry means FNR requires active measurement through labeled evaluation sets, red teaming, and anomaly detection.

Latency Percentiles

Guardrail latency directly impacts user experience. Report latency as percentiles, not averages — averages hide the long tail that real users experience.

PercentileWhat It MeasuresWhy It Matters
p50 (median)Typical user experienceBaseline performance
p95Worst case for most usersReal user experience under load
p99Worst case for high-traffic periodsTail latency that triggers timeouts
p99.9Extreme outliersIdentifies systematic issues (model cold starts, network retries)

Target ranges for guardrail processing latency:

Guardrail Typep50 Targetp95 Targetp99 Target
Rule-based< 1ms< 5ms< 10ms
ML classifier< 30ms< 80ms< 150ms
Embedding similarity< 20ms< 50ms< 100ms
LLM-as-judge< 500ms< 1500ms< 3000ms
Full pipeline< 100ms< 300ms< 500ms

Cost Per Evaluation

Guardrail cost is rarely measured but always relevant. As request volume scales, guardrail cost can become a significant line item.

def calculate_guardrail_cost(
    daily_requests: int,
    pipeline_cost_per_eval: float,
    monthly_fixed_costs: float,
) -> dict:
    """Calculate monthly guardrail operating costs."""
    monthly_requests = daily_requests * 30
    variable_cost = monthly_requests * pipeline_cost_per_eval
    total_cost = variable_cost + monthly_fixed_costs
    cost_per_thousand = (total_cost / monthly_requests) * 1000

    return {
        "monthly_requests": monthly_requests,
        "variable_cost": variable_cost,
        "fixed_costs": monthly_fixed_costs,
        "total_monthly": total_cost,
        "cost_per_1k_requests": cost_per_thousand,
    }
Pipeline ConfigurationCost per EvalAt 1M requests/dayMonthly Cost
Rules only~$0.000001$0.03/day~$1
Rules + ML classifier~$0.001$1,000/day~$30,000
Rules + ML + embeddings~$0.002$2,000/day~$60,000
Full pipeline (w/ LLM judge)~$0.01–0.05$10K–50K/day~$300K–1.5M

The key cost optimization: use the layered pipeline pattern from Section 4.1 to ensure the expensive LLM-as-judge stage only runs on the small fraction of inputs that pass cheaper stages.

Coverage Metrics

Coverage measures what percentage of inputs and outputs actually pass through your guardrail system:

Coverage TypeWhat It MeasuresTarget
Input coverage% of user inputs checked before reaching the model100% for production systems
Output coverage% of model outputs checked before reaching the user100% for user-facing systems
Endpoint coverage% of AI endpoints protected by guardrails100% — unprotected endpoints are unguarded attack surface
Policy coverage% of defined policies that have active guardrail enforcementTrack and increase over time
Attack category coverage% of known attack categories with active detectionInventory from Section 5.1

Coverage less than 100% on input or output means some traffic bypasses your guardrails entirely — through unprotected endpoints, code paths that skip the guardrail middleware, or batch processing jobs that don’t invoke the pipeline.

User Satisfaction and Indirect Metrics

Not every guardrail metric comes from the guardrail system itself. Indirect metrics from user behavior reveal how guardrails affect the product experience:

Indirect MetricWhat It IndicatesHow to Collect
User complaint rateFalse positives experienced by usersSupport tickets mentioning “blocked” or “can’t ask”
Retry rateUsers rewording blocked queriesTrack sequential requests from same user within short window
Session abandonmentUsers giving up after guardrail frictionTrack sessions ending immediately after a block
Guardrail appeal rateUsers explicitly disputing blocks”This shouldn’t have been blocked” feedback mechanism
Engagement after blockWhether blocked users continue using the systemTrack user activity in the session after a block event

These indirect metrics often tell a more honest story than technical metrics alone. A guardrail with perfect precision and recall numbers but a 15% session abandonment rate after blocks has a user experience problem that the technical metrics don’t capture.

The Complete Guardrail Metrics Dashboard

Bring all metrics together in a single view:

MetricWhat It MeasuresTarget RangeWhat Bad Values Mean
PrecisionCorrectness of blocks0.90–0.98Below 0.85: too many false positives, user friction
RecallCompleteness of threat detection0.80–0.95+Below 0.75: significant safety gaps
F1 ScoreBalance of precision and recall0.85–0.95Below 0.80: one or both metrics are weak
False Positive RateUser friction from incorrect blocks< 1%Above 2%: guardrail credibility at risk
False Negative RateSafety gap from missed threats< 10%Above 20%: guardrail provides minimal protection
p50 LatencyTypical guardrail processing time< 100msAbove 200ms: noticeable user delay
p99 LatencyTail latency under load< 500msAbove 1000ms: timeout risk, degraded experience
Cost per 1K evalsEconomic efficiencyVariesRising faster than traffic: optimization needed
Input coveragePercentage of inputs checked100%Below 100%: unprotected traffic paths exist
Block rateOverall % of requests blocked1–5% typicalSudden change: either attack spike or guardrail issue
User complaint rateExperience impact< 0.1% of blocksRising: false positive increase not caught by metrics