02-auto-processing-fail-open

1. Failure Handling Models in NADs

When AAA responses are delayed or absent, NADs typically fall into one of the following behaviors:

Fail-close
- Block new sessions
- Maximum security, minimum availability
Fail-open
- Allow access using a predefined “critical” profile
Next-method
- Fallback to:
  - MAB
  - Guest VLAN
  - Critical VLAN
  - Local policy

These are control-plane decisions, not authentication results.

If they are not intentionally engineered, they emerge as implicit behavior.

2. Fail-Open via Critical Authentication (IAB)

Cisco switch configuration guides explicitly describe support for:

Critical Authentication
Inaccessible Authentication Bypass (IAB)
Commands such as:
- dot1x critical eapol
- Critical VLAN / Critical ACL

2.1 The Real Problem: Fail-Open ≠ “Full Access”

In 802.1X environments, fail-open usually means:

Session enters critical state
NAD applies a specific VLAN or ACL
Only minimal continuity is allowed:
- Voice
- Infrastructure services
- Limited routing

If fail-open is treated as full access, the design creates its largest security gap exactly when visibility and control are lost.

Key risk: Fail-open without restriction turns control-plane failure into data-plane compromise.

3. AAA Recovery Delay: The Most Underestimated Control

Cisco documentation describes AAA Recovery Delay, allowing NADs to:

Throttle authentication attempts
Stagger reauthentication
Give ISE time to stabilize after recovery

3.1 Failure Mode Without Recovery Delay

Observed Outcome

ISE comes back online
Thousands of ports reauthenticate simultaneously
PSNs saturate
Latency increases
NADs time out again
Sessions fall back to critical auth

This creates a logical flap loop, even though all components are “up”.

4. Latency Misinterpreted as Server Failure

A recurring real-world case:

ISE responds, but slowly (>10s)
NAD RADIUS timeout expires
NAD marks PSN as dead
Traffic is redirected to another PSN

4.1 Why This Is Dangerous

Sessions become split across PSNs
CoA is sent to the wrong node
Posture and authorization diverge
The system appears “redundant” but becomes inconsistent

Latency is treated as unavailability, even when the server is technically alive.

5. When the NAD Becomes the Policy Engine

In degraded mode, NADs effectively decide:

Who gets access
What level of access is granted
How long degraded access persists
When to retry AAA

If this behavior is not documented and tested, security posture becomes undefined.

6. Architectural Mitigations: Make Degraded Mode Predictable

6.1 Define Behavior Per Network Type

Different access domains require different failure semantics:

User Access

Restricted critical VLAN
Limited routing

Voice

Allow voice continuity
Minimal dependency on AAA

OT / IoT

Controlled MAB fallback
Static, predictable behavior

Admin Access (TACACS)

Fail-close is usually acceptable

One-size-fits-all fail-open is a design flaw.

6.2 Treat Degraded Mode as a First-Class Requirement

Explicitly test:

Local PSN failure
Artificial latency to:
- AD
- DNS
- Logging
Mass reauthentication events
Recovery with and without recovery delay

Validate that:

Access is minimal but sufficient
Recovery is stable
No authentication storm occurs

If degraded mode is not tested, it is not designed.

6.3 Size the System with Real Assumptions

Cisco performance guidance explicitly notes:

Scale numbers assume low inter-node latency
AD / LDAP latency directly impacts TPS and timeout behavior
Small latency increases can collapse throughput

If your environment violates these assumptions:

Timeout values must change
Failover behavior must be reconsidered
Published scale numbers no longer apply

7. Design Rules (Auto-Processing and Failure Handling)

NAD behavior under AAA failure must be explicitly defined
Fail-open must always mean restricted access
Recovery delay is mandatory in large environments
Latency must not be confused with node failure
Degraded-mode behavior must be predictable and documented

8. Failure Modes: Session Ownership — “Who Owns the Decision?”

At this point in the document, a pattern should be clear:

Most AAA and posture failures are not caused by wrong decisions, but by unclear ownership of the decision.

Latency, failover, load balancing, and retries all converge on a single question:

Which component owns the session at any given moment?

8.1 Session Ownership Model (End-to-End)

8.1.1 Ownership Domains

A single access session spans multiple ownership layers:

NAD
- Port / SSID / tunnel state
- Enforcement point
ISE PSN
- Authentication state
- Posture state
- Authorization logic
External systems
- AD / LDAP / MFA / DNS (dependencies, never owners)

Ownership must be singular and stable for the lifetime of the session.

8.1.2 Healthy Ownership Model

Characteristics

One PSN owns:

Authentication
Posture
Reauthentication
CoA

Design properties:

NAD always talks to the same PSN for the session
External systems never own session state

8.1.3 Broken Ownership Model (Common Failure)

What Breaks

CoA is sent by a non-owning PSN
NAD ignores or rejects the update
Authorization never transitions
Recovery requires reauthentication or port bounce

Ownership drift is invisible in basic logs and catastrophic under latency.

9. Unified Design Rules vs Anti-Patterns

This section consolidates all previous failure modes into clear architectural guidance.

9.1 Core Design Rules

Latency-aware by design

RTT NAD ↔ PSN < 20–30 ms
RTT PSN ↔ identity stores ideally < 5 ms

Session ownership is immutable

One PSN per session
Load balancer stickiness is mandatory

CoA is control-plane traffic

Engineered
Monitored
Validated

Degraded mode is intentional

Fail-open is always restricted
Fail-close is explicit
Recovery is rate-limited
AAA recovery delay is mandatory at scale

9.2 Recurrent Anti-Patterns

Anti-Pattern

Result

Centralized PSN for global NADs

Latency-driven instability

Load balancer without stickiness

Silent posture failure

Fail-open = full access

Security collapse

Aggressive NAD timeouts

Retry storms

Latency treated as node failure

Session split-brain

Untested degraded mode

Unpredictable behavior

If more than one applies, the issue is architectural, not tuning-related.

10. Latency Budget Model (Auth + Posture + Enforcement)

Latency must be treated as a budget, not as an afterthought.

10.1 End-to-End Latency Budget

The total must fit within the NAD timeout:

NAD timeout > (RTT NAD ↔ PSN)

(RTT PSN ↔ Identity Stores)
(ISE internal processing + queues)
(CoA delivery and enforcement)

If this inequality is violated:

Retries begin
Failover triggers
Ownership breaks
Posture desynchronizes

10.2 Why Published Scale Numbers Often Fail in Production

Cisco scale figures typically assume:

Near-zero packet loss
Single-digit millisecond latency
Minimal jitter
Local identity stores

Real environments include:

WAN / SD-WAN
Cloud-hosted PSNs
DNS and logging delays
Intermittent identity store latency

Result:

Scale numbers must be derated, or instability is guaranteed.

11. Reference Architecture: Latency-Aware NAC

11.1 Regionalized Control Plane

Properties

Low RTT within each region
Multiple PSNs per failure domain
No inter-regional authentication or posture
Clear and enforceable ownership boundaries

12. Executive Summary (One-Page Mental Model)

AAA and posture are control-plane systems
Latency breaks state, not just performance
Ownership drift causes silent failure
Fail-open / fail-close are policy decisions
Recovery without throttling causes instability
If posture matters, latency must be engineered

Key Takeaway

NAC failures under latency are rarely caused by bugs. They are caused by implicit assumptions:

That latency is negligible
That failover is free
That load balancing is harmless
That degraded mode “just works”

In reality:

Authentication succeeds or fails at the speed of the control plane. Posture succeeds or fails at the speed of state alignment.

If these are not explicitly designed, the system will make decisions for you, usually the wrong ones.

Previous01-authentication-queues Next03-posture-assessment-failure

Last updated 2 hours ago

hashtag1. Failure Handling Models in NADs

hashtag2. Fail-Open via Critical Authentication (IAB)

hashtag2.1 The Real Problem: Fail-Open ≠ “Full Access”

hashtag3. AAA Recovery Delay: The Most Underestimated Control

hashtag3.1 Failure Mode Without Recovery Delay

hashtagObserved Outcome

hashtag4. Latency Misinterpreted as Server Failure

hashtag4.1 Why This Is Dangerous

hashtag5. When the NAD Becomes the Policy Engine

hashtag6. Architectural Mitigations: Make Degraded Mode Predictable

hashtag6.1 Define Behavior Per Network Type

hashtag6.2 Treat Degraded Mode as a First-Class Requirement

hashtag6.3 Size the System with Real Assumptions

hashtag7. Design Rules (Auto-Processing and Failure Handling)

hashtag8. Failure Modes: Session Ownership — “Who Owns the Decision?”

hashtag8.1 Session Ownership Model (End-to-End)

hashtag8.1.1 Ownership Domains

hashtag8.1.2 Healthy Ownership Model

hashtagCharacteristics

hashtag8.1.3 Broken Ownership Model (Common Failure)

hashtagWhat Breaks

hashtag9. Unified Design Rules vs Anti-Patterns

hashtag9.1 Core Design Rules

hashtag9.2 Recurrent Anti-Patterns

hashtag10. Latency Budget Model (Auth + Posture + Enforcement)

hashtag10.1 End-to-End Latency Budget

hashtagThe total must fit within the NAD timeout:

hashtag10.2 Why Published Scale Numbers Often Fail in Production

hashtag11. Reference Architecture: Latency-Aware NAC

hashtag11.1 Regionalized Control Plane

hashtagProperties

hashtag12. Executive Summary (One-Page Mental Model)

hashtagKey Takeaway