1. Failure Handling Models in NADs
When AAA responses are delayed or absent, NADs typically fall into one of the following behaviors:
Fail-close
Maximum security, minimum availability
Fail-open
Allow access using a predefined “critical” profile
These are control-plane decisions, not authentication results.
If they are not intentionally engineered, they emerge as implicit behavior.
2. Fail-Open via Critical Authentication (IAB)
Cisco switch configuration guides explicitly describe support for:
Inaccessible Authentication Bypass (IAB)
Commands such as:
Critical VLAN / Critical ACL
2.1 The Real Problem: Fail-Open ≠ “Full Access”
In 802.1X environments, fail-open usually means:
Session enters critical state
NAD applies a specific VLAN or ACL
Only minimal continuity is allowed:
If fail-open is treated as full access, the design creates its largest security gap exactly when visibility and control are lost.
Key risk:
Fail-open without restriction turns control-plane failure into data-plane compromise.
3. AAA Recovery Delay: The Most Underestimated Control
Cisco documentation describes AAA Recovery Delay, allowing NADs to:
Throttle authentication attempts
Give ISE time to stabilize after recovery
3.1 Failure Mode Without Recovery Delay
Observed Outcome
Thousands of ports reauthenticate simultaneously
Sessions fall back to critical auth
This creates a logical flap loop, even though all components are “up”.
4. Latency Misinterpreted as Server Failure
A recurring real-world case:
ISE responds, but slowly (>10s)
NAD RADIUS timeout expires
Traffic is redirected to another PSN
4.1 Why This Is Dangerous
Sessions become split across PSNs
CoA is sent to the wrong node
Posture and authorization diverge
The system appears “redundant” but becomes inconsistent
Latency is treated as unavailability, even when the server is technically alive.
5. When the NAD Becomes the Policy Engine
In degraded mode, NADs effectively decide:
What level of access is granted
How long degraded access persists
If this behavior is not documented and tested, security posture becomes undefined.
6. Architectural Mitigations: Make Degraded Mode Predictable
6.1 Define Behavior Per Network Type
Different access domains require different failure semantics:
User Access
Voice
Minimal dependency on AAA
OT / IoT
Static, predictable behavior
Admin Access (TACACS)
Fail-close is usually acceptable
One-size-fits-all fail-open is a design flaw.
6.2 Treat Degraded Mode as a First-Class Requirement
Explicitly test:
Mass reauthentication events
Recovery with and without recovery delay
Validate that:
Access is minimal but sufficient
No authentication storm occurs
If degraded mode is not tested, it is not designed.
6.3 Size the System with Real Assumptions
Cisco performance guidance explicitly notes:
Scale numbers assume low inter-node latency
AD / LDAP latency directly impacts TPS and timeout behavior
Small latency increases can collapse throughput
If your environment violates these assumptions:
Timeout values must change
Failover behavior must be reconsidered
Published scale numbers no longer apply
7. Design Rules (Auto-Processing and Failure Handling)
NAD behavior under AAA failure must be explicitly defined
Fail-open must always mean restricted access
Recovery delay is mandatory in large environments
Latency must not be confused with node failure
Degraded-mode behavior must be predictable and documented
8. Failure Modes: Session Ownership — “Who Owns the Decision?”
At this point in the document, a pattern should be clear:
Most AAA and posture failures are not caused by wrong decisions,
but by unclear ownership of the decision.
Latency, failover, load balancing, and retries all converge on a single question:
Which component owns the session at any given moment?
8.1 Session Ownership Model (End-to-End)
8.1.1 Ownership Domains
A single access session spans multiple ownership layers:
NAD
Port / SSID / tunnel state
External systems
AD / LDAP / MFA / DNS
(dependencies, never owners)
Ownership must be singular and stable for the lifetime of the session.
8.1.2 Healthy Ownership Model
Characteristics
One PSN owns:
Design properties:
NAD always talks to the same PSN for the session
External systems never own session state
8.1.3 Broken Ownership Model (Common Failure)
CoA is sent by a non-owning PSN
NAD ignores or rejects the update
Authorization never transitions
Recovery requires reauthentication or port bounce
Ownership drift is invisible in basic logs and catastrophic under latency.
9. Unified Design Rules vs Anti-Patterns
This section consolidates all previous failure modes into clear architectural guidance.
9.1 Core Design Rules
Latency-aware by design
RTT PSN ↔ identity stores ideally < 5 ms
Session ownership is immutable
Load balancer stickiness is mandatory
CoA is control-plane traffic
Degraded mode is intentional
Fail-open is always restricted
AAA recovery delay is mandatory at scale
9.2 Recurrent Anti-Patterns
Centralized PSN for global NADs
Latency-driven instability
Load balancer without stickiness
Latency treated as node failure
If more than one applies, the issue is architectural, not tuning-related.
10. Latency Budget Model (Auth + Posture + Enforcement)
Latency must be treated as a budget, not as an afterthought.
10.1 End-to-End Latency Budget
The total must fit within the NAD timeout:
NAD timeout > (RTT NAD ↔ PSN)
(RTT PSN ↔ Identity Stores)
(ISE internal processing + queues)
(CoA delivery and enforcement)
If this inequality is violated:
10.2 Why Published Scale Numbers Often Fail in Production
Cisco scale figures typically assume:
Single-digit millisecond latency
Real environments include:
Intermittent identity store latency
Result:
Scale numbers must be derated, or instability is guaranteed.
11. Reference Architecture: Latency-Aware NAC
11.1 Regionalized Control Plane
Low RTT within each region
Multiple PSNs per failure domain
No inter-regional authentication or posture
Clear and enforceable ownership boundaries
12. Executive Summary (One-Page Mental Model)
AAA and posture are control-plane systems
Latency breaks state, not just performance
Ownership drift causes silent failure
Fail-open / fail-close are policy decisions
Recovery without throttling causes instability
If posture matters, latency must be engineered
NAC failures under latency are rarely caused by bugs.
They are caused by implicit assumptions:
That latency is negligible
That load balancing is harmless
That degraded mode “just works”
In reality:
Authentication succeeds or fails at the speed of the control plane.
Posture succeeds or fails at the speed of state alignment.
If these are not explicitly designed,
the system will make decisions for you, usually the wrong ones.