02-auto-processing-fail-open

1. Failure Handling Models in NADs

When AAA responses are delayed or absent, NADs typically fall into one of the following behaviors:

  • Fail-close

    • Block new sessions

    • Maximum security, minimum availability

  • Fail-open

    • Allow access using a predefined “critical” profile

  • Next-method

    • Fallback to:

      • MAB

      • Guest VLAN

      • Critical VLAN

      • Local policy

These are control-plane decisions, not authentication results.

If they are not intentionally engineered, they emerge as implicit behavior.


2. Fail-Open via Critical Authentication (IAB)

Cisco switch configuration guides explicitly describe support for:

  • Critical Authentication

  • Inaccessible Authentication Bypass (IAB)

  • Commands such as:

    • dot1x critical eapol

    • Critical VLAN / Critical ACL

2.1 The Real Problem: Fail-Open ≠ “Full Access”

In 802.1X environments, fail-open usually means:

  • Session enters critical state

  • NAD applies a specific VLAN or ACL

  • Only minimal continuity is allowed:

    • Voice

    • Infrastructure services

    • Limited routing

If fail-open is treated as full access, the design creates its largest security gap exactly when visibility and control are lost.

Key risk: Fail-open without restriction turns control-plane failure into data-plane compromise.


3. AAA Recovery Delay: The Most Underestimated Control

Cisco documentation describes AAA Recovery Delay, allowing NADs to:

  • Throttle authentication attempts

  • Stagger reauthentication

  • Give ISE time to stabilize after recovery

3.1 Failure Mode Without Recovery Delay

spinner

Observed Outcome

  • ISE comes back online

  • Thousands of ports reauthenticate simultaneously

  • PSNs saturate

  • Latency increases

  • NADs time out again

  • Sessions fall back to critical auth

This creates a logical flap loop, even though all components are “up”.


4. Latency Misinterpreted as Server Failure

A recurring real-world case:

  • ISE responds, but slowly (>10s)

  • NAD RADIUS timeout expires

  • NAD marks PSN as dead

  • Traffic is redirected to another PSN


4.1 Why This Is Dangerous

  • Sessions become split across PSNs

  • CoA is sent to the wrong node

  • Posture and authorization diverge

  • The system appears “redundant” but becomes inconsistent

Latency is treated as unavailability, even when the server is technically alive.


5. When the NAD Becomes the Policy Engine

In degraded mode, NADs effectively decide:

  • Who gets access

  • What level of access is granted

  • How long degraded access persists

  • When to retry AAA

If this behavior is not documented and tested, security posture becomes undefined.


6. Architectural Mitigations: Make Degraded Mode Predictable

6.1 Define Behavior Per Network Type

Different access domains require different failure semantics:

User Access

  • Restricted critical VLAN

  • Limited routing

Voice

  • Allow voice continuity

  • Minimal dependency on AAA

OT / IoT

  • Controlled MAB fallback

  • Static, predictable behavior

Admin Access (TACACS)

  • Fail-close is usually acceptable

One-size-fits-all fail-open is a design flaw.


6.2 Treat Degraded Mode as a First-Class Requirement

Explicitly test:

  • Local PSN failure

  • Artificial latency to:

    • AD

    • DNS

    • Logging

  • Mass reauthentication events

  • Recovery with and without recovery delay

Validate that:

  • Access is minimal but sufficient

  • Recovery is stable

  • No authentication storm occurs

If degraded mode is not tested, it is not designed.


6.3 Size the System with Real Assumptions

Cisco performance guidance explicitly notes:

  • Scale numbers assume low inter-node latency

  • AD / LDAP latency directly impacts TPS and timeout behavior

  • Small latency increases can collapse throughput

If your environment violates these assumptions:

  • Timeout values must change

  • Failover behavior must be reconsidered

  • Published scale numbers no longer apply


7. Design Rules (Auto-Processing and Failure Handling)

  • NAD behavior under AAA failure must be explicitly defined

  • Fail-open must always mean restricted access

  • Recovery delay is mandatory in large environments

  • Latency must not be confused with node failure

  • Degraded-mode behavior must be predictable and documented


8. Failure Modes: Session Ownership — “Who Owns the Decision?”

At this point in the document, a pattern should be clear:

Most AAA and posture failures are not caused by wrong decisions, but by unclear ownership of the decision.

Latency, failover, load balancing, and retries all converge on a single question:

Which component owns the session at any given moment?


8.1 Session Ownership Model (End-to-End)

8.1.1 Ownership Domains

A single access session spans multiple ownership layers:

  • NAD

    • Port / SSID / tunnel state

    • Enforcement point

  • ISE PSN

    • Authentication state

    • Posture state

    • Authorization logic

  • External systems

    • AD / LDAP / MFA / DNS (dependencies, never owners)

Ownership must be singular and stable for the lifetime of the session.


8.1.2 Healthy Ownership Model

spinner

Characteristics

One PSN owns:

  • Authentication

  • Posture

  • Reauthentication

  • CoA

Design properties:

  • NAD always talks to the same PSN for the session

  • External systems never own session state


8.1.3 Broken Ownership Model (Common Failure)

spinner

What Breaks

  • CoA is sent by a non-owning PSN

  • NAD ignores or rejects the update

  • Authorization never transitions

  • Recovery requires reauthentication or port bounce

Ownership drift is invisible in basic logs and catastrophic under latency.


9. Unified Design Rules vs Anti-Patterns

This section consolidates all previous failure modes into clear architectural guidance.


9.1 Core Design Rules

Latency-aware by design

  • RTT NAD ↔ PSN < 20–30 ms

  • RTT PSN ↔ identity stores ideally < 5 ms

Session ownership is immutable

  • One PSN per session

  • Load balancer stickiness is mandatory

CoA is control-plane traffic

  • Engineered

  • Monitored

  • Validated

Degraded mode is intentional

  • Fail-open is always restricted

  • Fail-close is explicit

  • Recovery is rate-limited

  • AAA recovery delay is mandatory at scale


9.2 Recurrent Anti-Patterns

Anti-Pattern
Result

Centralized PSN for global NADs

Latency-driven instability

Load balancer without stickiness

Silent posture failure

Fail-open = full access

Security collapse

Aggressive NAD timeouts

Retry storms

Latency treated as node failure

Session split-brain

Untested degraded mode

Unpredictable behavior

If more than one applies, the issue is architectural, not tuning-related.


10. Latency Budget Model (Auth + Posture + Enforcement)

Latency must be treated as a budget, not as an afterthought.


10.1 End-to-End Latency Budget

spinner

The total must fit within the NAD timeout:

NAD timeout > (RTT NAD ↔ PSN)

  • (RTT PSN ↔ Identity Stores)

  • (ISE internal processing + queues)

  • (CoA delivery and enforcement)

If this inequality is violated:

  • Retries begin

  • Failover triggers

  • Ownership breaks

  • Posture desynchronizes


10.2 Why Published Scale Numbers Often Fail in Production

Cisco scale figures typically assume:

  • Near-zero packet loss

  • Single-digit millisecond latency

  • Minimal jitter

  • Local identity stores

Real environments include:

  • WAN / SD-WAN

  • Cloud-hosted PSNs

  • DNS and logging delays

  • Intermittent identity store latency

Result:

Scale numbers must be derated, or instability is guaranteed.


11. Reference Architecture: Latency-Aware NAC

11.1 Regionalized Control Plane

spinner

Properties

  • Low RTT within each region

  • Multiple PSNs per failure domain

  • No inter-regional authentication or posture

  • Clear and enforceable ownership boundaries


12. Executive Summary (One-Page Mental Model)

  • AAA and posture are control-plane systems

  • Latency breaks state, not just performance

  • Ownership drift causes silent failure

  • Fail-open / fail-close are policy decisions

  • Recovery without throttling causes instability

  • If posture matters, latency must be engineered


Key Takeaway

NAC failures under latency are rarely caused by bugs. They are caused by implicit assumptions:

  • That latency is negligible

  • That failover is free

  • That load balancing is harmless

  • That degraded mode “just works”

In reality:

Authentication succeeds or fails at the speed of the control plane. Posture succeeds or fails at the speed of state alignment.

If these are not explicitly designed, the system will make decisions for you, usually the wrong ones.


Last updated