Guardrail Taxonomy — cAIge Training

Section 3.1: Guardrail Taxonomy

Before you can design a guardrail system, you need a shared vocabulary for the kinds of guardrails that exist, where they sit in the pipeline, and what each one costs you. This section provides that vocabulary.

A common mistake in guardrail design is treating “guardrails” as a single concept — something you bolt on at the end. In practice, guardrails are a diverse family of controls that differ in placement, mechanism, cost, and failure characteristics. A guardrail that inspects the user’s prompt before it reaches the model is fundamentally different from one that scans the model’s output before it reaches the user, and both are different from a system-level circuit breaker that kills the entire request if latency exceeds a threshold.

Understanding this taxonomy is the foundation for everything else in this module.

The Six Guardrail Categories

We organize guardrails into six categories based on where they operate and what they protect:

Input Guardrails — inspect and validate what goes into the model
Output Guardrails — inspect and validate what comes out of the model
System-Level Guardrails — control the environment the model operates in
Retrieval Guardrails — protect the RAG pipeline from poisoned or inappropriate context
Agentic Guardrails — constrain what actions an AI agent can take in the world
Human-in-the-Loop Guardrails — route decisions to humans when automated controls are insufficient

These categories are not mutually exclusive. A production system typically uses guardrails from several categories simultaneously, layered in a defense-in-depth strategy.

Defense in depth guardrail layers

Input Guardrails

Input guardrails sit between the user (or upstream system) and the model. Their job is to ensure that the model only receives inputs that are safe, well-formed, and within the system’s intended scope.

Typical input guardrails include:

Prompt injection detection — identifying attempts to override system instructions via user input
Content policy enforcement — blocking requests that violate usage policies (hate speech, illegal content requests)
Schema validation — ensuring structured inputs conform to expected formats
Topic classification — rejecting or routing requests outside the system’s intended domain
Rate limiting — preventing abuse through volume-based controls
Identity and access control — ensuring the requester is authorized for the requested operation

Input guardrails are attractive because they can prevent bad requests from consuming model inference resources. A blocked prompt costs you a classifier call, not a full LLM generation. The trade-off is that input guardrails operate on the user’s intent (which is ambiguous) rather than the model’s actual output (which is concrete). This means they tend to have higher false positive rates — blocking legitimate requests that pattern-match against threat signatures.

Why this matters for guardrails: Input guardrails are your first line of defense, but they cannot be your only line. A prompt injection that evades input detection will produce harmful output unless output guardrails catch it. Defense in depth means assuming every layer can be bypassed.

Output Guardrails

Output guardrails sit between the model’s response and the user (or downstream system). They inspect what the model actually generated and decide whether it is safe to deliver.

Typical output guardrails include:

Toxicity and bias classification — scoring output for harmful content
PII detection and redaction — finding and removing personally identifiable information
Groundedness checking — verifying that claims are supported by provided context
Structured output validation — ensuring JSON, SQL, or other structured outputs conform to schemas
Citation enforcement — requiring that factual claims reference source documents
Confidence scoring — routing low-confidence responses to human review or fallback paths

Output guardrails are powerful because they operate on concrete text — the actual thing the user will see. This makes them more precise than input guardrails for many threat categories. The trade-off is latency: you have already paid for model inference before the output guardrail runs. If the guardrail blocks the response, that inference cost is wasted.

Why this matters for guardrails: Output guardrails are the last automated checkpoint before content reaches users. They catch threats that input guardrails miss (including novel attacks the input classifier has never seen) and threats that originate from the model itself (hallucination, bias in training data) rather than from the user’s input.

System-Level Guardrails

System-level guardrails do not inspect individual inputs or outputs. Instead, they control the operating environment — the system prompt, the model configuration, resource limits, and architectural patterns that shape how the model behaves.

Typical system-level guardrails include:

System prompt engineering — embedding safety instructions, persona constraints, and refusal behaviors
Conversation memory management — limiting context window size, summarizing history, detecting topic drift
Circuit breakers and fallbacks — killing requests that exceed latency or cost thresholds
Model routing — sending requests to different models based on risk classification
Multi-model verification — using a second model to check the first model’s output
Canary and shadow deployment — testing guardrail changes on a fraction of traffic before full rollout

System-level guardrails are often invisible to the end user but have enormous impact on safety. A well-crafted system prompt can prevent entire categories of harmful output without any runtime classification cost. The trade-off is that system prompts are “soft” controls — they influence the model through learned preferences, not hard enforcement. A determined adversary can often override system prompt instructions through prompt injection.

Why this matters for guardrails: System-level guardrails set the baseline behavior. They reduce the load on input and output guardrails by making harmful outputs less likely in the first place. But “less likely” is not “impossible,” which is why they must be combined with hard enforcement at other layers.

Retrieval Guardrails

Retrieval guardrails are specific to RAG (Retrieval-Augmented Generation) pipelines. They protect the pathway between the knowledge store and the model, ensuring that retrieved context is safe, relevant, authorized, and current.

Typical retrieval guardrails include:

Source access control — ensuring the user is authorized to see the retrieved documents
Relevance filtering — removing retrieved chunks that are semantically distant from the query
Indirect injection detection — scanning retrieved documents for embedded prompt injection attacks
Source attribution enforcement — requiring the model to cite which documents support its claims
Contradiction detection — identifying when retrieved sources disagree with each other
Staleness checking — flagging documents that may be outdated

Retrieval guardrails address a threat surface that does not exist in simple chat applications. When you give the model access to a knowledge base, you are introducing a new input channel — one that may be controlled by a different set of actors than the user. A document in your knowledge base could contain an indirect prompt injection planted by a malicious author.

Why this matters for guardrails: RAG pipelines expand the attack surface beyond user input. Retrieved documents become an additional untrusted input channel, and retrieval guardrails must treat them with the same suspicion as user prompts.

Agentic Guardrails

Agentic guardrails constrain AI systems that can take actions in the world — calling APIs, executing code, modifying databases, sending messages, or interacting with external services.

Typical agentic guardrails include:

Tool use policies — defining which tools the agent can call and under what conditions
Action confirmation — requiring human approval for high-risk actions
Scope limiting — restricting the agent to specific domains, data sets, or environments
Sandboxing — isolating agent execution so failures do not affect production systems
Budget caps — limiting the cost or number of actions per request
Reasoning trace auditing — logging the agent’s chain-of-thought for post-hoc review

Agentic guardrails are critical because the consequences of failure are no longer limited to text. A chat model that generates harmful text is bad; an agent that executes a harmful database query is catastrophic. The principle of least privilege — granting only the minimum permissions needed for the task — is the foundation of agentic guardrail design.

Why this matters for guardrails: Agentic systems amplify the consequences of guardrail failure from “bad text” to “bad actions.” Every tool call, API request, and state mutation must be governed by explicit policies that are enforced architecturally, not just requested in the system prompt.

Human-in-the-Loop Guardrails

Human-in-the-loop (HITL) guardrails route decisions to human reviewers when automated systems cannot make a confident determination. They are the ultimate fallback — accepting latency and cost in exchange for judgment that current AI systems cannot reliably provide.

Human-in-the-loop escalation tiers

Typical HITL guardrails include:

Confidence-based escalation — routing responses below a confidence threshold to human review
Risk-based escalation — requiring human approval for high-stakes domains (medical, legal, financial)
Appeal workflows — allowing users to contest automated decisions
Sampling-based review — randomly reviewing a percentage of responses for quality assurance
Annotation loops — feeding human judgments back into guardrail training data

The trade-off with HITL guardrails is obvious: they are slow and expensive. A human reviewer adds minutes or hours of latency and requires staffing, training, and tooling. The art is in designing escalation criteria that route the right requests to humans — not too many (creating bottlenecks) and not too few (missing critical issues).

Why this matters for guardrails: No automated guardrail system is perfect. HITL guardrails acknowledge this reality and provide a principled fallback for cases where the cost of an automated mistake exceeds the cost of human review. Designing effective escalation criteria is an architectural skill.

Comprehensive Guardrail Comparison

The following table summarizes the six guardrail categories across key dimensions:

Type	Placement	What It Guards Against	Latency Impact	False Positive Risk	Example
Input	Before model inference	Prompt injection, policy violations, malformed input	Low — blocks before expensive inference	Higher — judging intent from input alone	Classifier rejects “ignore your instructions” patterns
Output	After model inference	Toxic content, PII leakage, hallucination, schema violations	Medium — adds post-inference check	Lower — inspecting concrete output text	PII regex + NER model redacts SSNs from response
System-Level	Model environment	Broad behavioral drift, resource exhaustion, unsafe defaults	Negligible to medium	Low (but soft enforcement)	System prompt instructs refusal for medical diagnoses
Retrieval	Between knowledge store and model	Poisoned context, unauthorized data access, stale information	Low to medium per chunk	Medium — relevance thresholds are tunable	Access control filter removes documents user cannot see
Agentic	Between model decision and action execution	Unauthorized actions, scope creep, resource abuse	Variable — depends on confirmation workflow	Low — policies are explicit	Tool policy blocks DELETE operations on production databases
Human-in-the-Loop	Any escalation point	Anything automated systems miss	High — minutes to hours	Very low — human judgment	Legal team reviews contract-related responses

Combining Guardrails: Defense in Depth

No single guardrail category is sufficient on its own. Production systems layer guardrails from multiple categories to create defense in depth — the principle that security should not depend on any single control.

A well-designed defense-in-depth strategy follows these principles:

1. Order by cost. Run cheap checks before expensive ones. A regex-based injection check costs microseconds; an LLM-as-judge call costs hundreds of milliseconds and inference dollars. If the regex catches 60% of injection attempts, you save 60% of the LLM-as-judge cost.

2. Assume each layer can be bypassed. Design each guardrail as if the previous layer does not exist. If your output guardrail only works when the input guardrail has already filtered known threats, you have a single point of failure disguised as defense in depth.

3. Match guardrails to threat models. Not every application faces every threat. A customer service bot needs strong input guardrails against topic drift but may not need agentic guardrails. A code execution agent needs strong agentic guardrails but may face less prompt injection risk. Start with your threat model and select guardrails that address your specific risks.

4. Monitor guardrail interactions. Guardrails can interfere with each other. An input guardrail that rewrites the prompt may confuse an output guardrail that expects certain formatting. Test the full pipeline, not individual guardrails in isolation.

5. Plan for graceful degradation. When a guardrail fails or is unavailable (service outage, latency spike), the system should fail safely — typically by refusing to respond rather than responding without protection.

┌─────────────────────────────────────────────────────────┐
│                   Defense in Depth                       │
│                                                         │
│  User Request                                           │
│      │                                                  │
│      ▼                                                  │
│  ┌─────────────────┐                                    │
│  │  Input Guardrail │ ──block──▶ Rejection Response     │
│  │  (Layer 1)       │                                   │
│  └────────┬────────┘                                    │
│           │ pass                                        │
│           ▼                                             │
│  ┌─────────────────┐                                    │
│  │  System-Level    │ (system prompt, routing, limits)  │
│  │  (Layer 2)       │                                   │
│  └────────┬────────┘                                    │
│           │                                             │
│           ▼                                             │
│  ┌─────────────────┐                                    │
│  │  Model Inference │                                   │
│  └────────┬────────┘                                    │
│           │                                             │
│           ▼                                             │
│  ┌─────────────────┐                                    │
│  │  Output Guardrail│ ──block──▶ Fallback / Refusal     │
│  │  (Layer 3)       │                                   │
│  └────────┬────────┘                                    │
│           │ pass                                        │
│           ▼                                             │
│  ┌─────────────────┐                                    │
│  │  HITL Escalation │ ──uncertain──▶ Human Review Queue │
│  │  (Layer 4)       │                                   │
│  └────────┬────────┘                                    │
│           │ confident                                   │
│           ▼                                             │
│      Response Delivered                                 │
│                                                         │
└─────────────────────────────────────────────────────────┘

The Cost-Coverage Trade-off

Every guardrail introduces cost — in latency, compute, false positives, and engineering complexity. The guardrail architect’s job is to maximize threat coverage while minimizing these costs, recognizing that different applications will make different trade-offs.

A low-risk internal tool might use only input validation and a system prompt. A customer-facing medical assistant might need every category of guardrail including mandatory HITL for clinical recommendations. The taxonomy gives you the menu; your threat model tells you what to order.

A useful heuristic: for each guardrail you add, ask three questions:

What specific threat does this address? If you cannot name a concrete attack or failure mode, the guardrail may not be justified.
What happens if this guardrail fails? If the answer is “nothing, because another layer catches it,” you may have redundancy (which is fine for defense in depth) or waste (which is not).
What legitimate use cases does this block? Every guardrail has a false positive rate. If a guardrail blocks 2% of legitimate requests, that may be acceptable for a high-risk application and unacceptable for a high-volume consumer product.

The rest of this module works through each guardrail category in detail, covering the specific techniques, patterns, and trade-offs you need to design effective guardrail architectures.

← PreviousLearning Objectives Next →Input Guardrail Design