01-authentication-queues

Failure Modes: Authentication Queues (ISE / AAA Under Latency)

In NAC systems (e.g. Cisco ISE), latency is rarely a single number. What breaks user experience and/or security is the combined effect of:

RTT NAD ↔ PSN (RADIUS / EAP)
RTT PSN ↔ Identity Stores (AD / LDAP / IdP / MFA / OCSP / CRL / MDM)
CPU time and internal contention (thread queues, locks, GC, disk)
Retries and retransmission storms (NADs, supplicants, AAA clients)

When this combined latency exceeds NAD/supplicant timers, the architecture enters very specific failure modes.

1. What Actually Queues Inside ISE (and Why)

Cisco Live BRKSEC-3412 presents a practical view: latency appears in Live Logs as Step Latency (e.g. Evaluating Policy Group), allowing clear separation between:

External latency (identity stores, network, MFA, OCSP, DNS)
Internal latency (processing and queue contention)

1.1 Thread Pools / Queues That Become Bottlenecks (TAC View)

The RADIUS / policy pipeline uses multiple thread pools, for example:

Main (RADIUS I/O)
Policy (policy evaluation / PDP)
EapTls
ADIDStore
RestFetcher

The classic failure pattern under latency:

Main receives Access-Request and creates a session
Policy triggers policy evaluation (calling PIPs / identity stores)
One step becomes slow (AD, REST, DNS, disk, cache, GC)
The request remains “alive” until NAD timers expire
The NAD retransmits → retry storm
Load multiplies without a hard failure signal

2. Observable Production Symptoms (Real Cases)

2.1 High Authentication Latency Without AD in the Path

A real case shows:

Alarm: High Authentication Latency
Log entry: Evaluating Policy Group (Step latency=3961ms)
No AD involved at that stage
Multiple internal PIP lookups (Called-Station-ID, Location, etc.)

This is a clear signature of internal contention or indirect dependencies (DB, cache, DNS, internal lookups).

What breaks:

Authentication does not fail consistently
Throughput collapses due to growing queues
During peaks, NADs treat ISE as dead

2.2 ISE “Under Capacity” but >10s Auth Latency Due to DNS / Logging

Case (ISE 3.1 P1):

Authentication latency > 10 seconds during peak
PSN not at session limit
WLCs mark ISE as dead (RADIUS timeout)
Failover to secondary PSN

Root cause:

Remote Logging Target configured via FQDN
DNS resolution impacting internal queues
Switching to IPv4 → issue resolved immediately

Architectural lesson: latency can originate from non-obvious components (DNS / logging) and surface as AAA pipeline queueing.

2.3 External Latency (AD / DC) → Extreme Step Latency

Another report shows:

Step latency > 60,000 ms
RPC communication errors with Domain Controllers
Failover threshold exceeded

Typical pattern of:

Unstable external dependency
DC switching / RPC failures
AAA SLA collapse
Policy fallback or rejection behavior

3. How This Becomes an “Auth Storm”

3.1 Amplifiers (Exponential Effects)

Aggressive RADIUS timeout on NADs (e.g. 1s)
Fast retransmissions
Boot storms / shift changes / power recovery
CoA or mass reauthentication
One slow PSN redistributes load → cluster contamination

A recurring point in the community:

RADIUS timeout must cover RTT NAD↔ISE + RTT ISE↔Identity Store + policy evaluation + extra checks. Aggressive timeouts cause failures even without explicit ISE errors.

4. Practical Diagnosis (TAC-Style)

4.1 Step Latency as the Key Divider

Slow PIP / identity store step → external latency
Slow policy evaluation step, no external calls → internal contention / queueing

BRKSEC-3412 demonstrates this method explicitly.

4.2 Correlate With “What Changed” Outside AAA

Classic examples:

DNS changes
Syslog targets
Certificate validation
CRL / OCSP
REST integrations

Not everything that breaks authentication lives in AD or RADIUS.

5. Architectural Mitigations (Beyond Tuning)

5.1 Principle 1 — Reduce Synchronous Dependencies

Avoid policies that query multiple external systems on every auth
Treat enrichment as post-auth whenever possible

5.2 Principle 2 — Size for Real Latency, Not Lab Numbers

Cisco scale tests assume:

Inter-node latency < 5 ms

If your environment exceeds this:

Published scale numbers do not directly apply

Official performance guidance warns:

ISE ↔ AD / LDAP latency should be ~5 ms or less
Small increases can drastically reduce TPS

5.3 Principle 3 — Storm Control (Logical, Not Network)

Set NAD timeouts to match real worst-case latency
Use backoff / deadtime carefully
Avoid synchronized mass reauthentication

5.4 Principle 4 — Session Persistence (Load Balancers)

If a RADIUS load balancer is used:

Session stickiness is mandatory
Auth / reauth / accounting must hit the same PSN
Required for:
Caching
EAP Session Resume
Fast Reconnect

Without it → intermittent failures and “ghost issues”.

Failure Mode Checklist

Is there any external dependency in the critical auth path?
AD / MFA / OCSP / REST
Is DNS or logging via FQDN blocking processing?
Does NAD timeout cover RTT + real policy evaluation time?
Is there retry storm potential (boot storms / WLC behavior)?
Does the load balancer enforce RADIUS session persistence?

Previous04-what-breaks-with-latency Next02-auto-processing-fail-open

Last updated 23 hours ago

hashtagFailure Modes: Authentication Queues (ISE / AAA Under Latency)

hashtag1. What Actually Queues Inside ISE (and Why)

hashtag1.1 Thread Pools / Queues That Become Bottlenecks (TAC View)

hashtag2. Observable Production Symptoms (Real Cases)

hashtag2.1 High Authentication Latency Without AD in the Path

hashtag2.2 ISE “Under Capacity” but >10s Auth Latency Due to DNS / Logging

hashtag2.3 External Latency (AD / DC) → Extreme Step Latency

hashtag3. How This Becomes an “Auth Storm”

hashtag3.1 Amplifiers (Exponential Effects)

hashtag4. Practical Diagnosis (TAC-Style)

hashtag4.1 Step Latency as the Key Divider

hashtag4.2 Correlate With “What Changed” Outside AAA

hashtag5. Architectural Mitigations (Beyond Tuning)

hashtag5.1 Principle 1 — Reduce Synchronous Dependencies

hashtag5.2 Principle 2 — Size for Real Latency, Not Lab Numbers

hashtag5.3 Principle 3 — Storm Control (Logical, Not Network)

hashtag5.4 Principle 4 — Session Persistence (Load Balancers)

hashtagFailure Mode Checklist