DNS, Site Classification, and Why Hybrid Architectures Break Authentication
1. The Real Problem (Beyond “ISE is Slow”)
LDAP and Kerberos rarely fail because of bandwidth exhaustion.
They fail because of latency, incorrect DNS resolution, and wrong server selection.
In hybrid environments, authentication often degrades even when:
All Domain Controllers are online
Network links are “healthy”
No service is technically down
The root cause is usually identity traffic crossing regions unnecessarily, driven by DNS behavior.
Kerberos fails architecturally before it fails technically.
2. Why Kerberos Is Highly Latency-Sensitive
Kerberos (RFC 4120) depends on:
Multiple request/response exchanges
Strict time validity windows
Correct KDC (Domain Controller) selection
DNS-based service discovery (SRV records)
Unlike stateless protocols, Kerberos is stateful and time-bound.
As latency increases:
Ticket requests take longer
Retries become more frequent
Time skew tolerance becomes tighter
Failures become intermittent and non-deterministic
A Kerberos environment may appear stable in low-latency labs but collapse under WAN or inter-region RTT.
3. DNS Is the Control Plane of Kerberos
Kerberos does not “find” Domain Controllers dynamically.
It relies entirely on DNS SRV records, including:
If DNS returns a valid but distant Domain Controller, Kerberos will still use it.
DNS correctness does not imply DNS optimality.
4. Baseline Scenario – Simple, Healthy Environment
4.1 Architecture
DNS resolves only local DCs
Kerberos tickets are issued locally
LDAP group lookups are fast
RADIUS timers are respected
Authentication and posture are stable
4.3 Why This Works
No cross-region identity traffic
DNS answers are implicitly correct
Kerberos assumptions hold true
This environment often masks poor architectural decisions.
5. Hybrid Scenario – Where Things Break
5.1 Architecture
Primary on-prem site in Brazil
Domain Controllers distributed across regions
Cisco ISE deployed in cloud (US or Europe)
DNS resolvers located on-prem
6. Bad Process (Very Common in Real Deployments)
6.1 Step-by-step failure chain
ISE (US cloud) queries DNS in Brazil
DNS returns SRV records for Brazilian DCs
ISE selects a Brazilian DC (valid but distant)
Kerberos exchanges cross continents
RTT increases (150–300+ ms)
Kerberos ticket validation slows or retries
LDAP group queries are delayed
RADIUS responses are delayed
NAD retries authentication
Queues grow, sessions churn
Operators conclude: “ISE is slow”
Everything is functionally correct, yet architecturally broken.
7. Why Kerberos Breaks First
Kerberos assumes:
Low RTT between client and KDC
Predictable response times
In cross-region designs:
Ticket exchanges exceed timing expectations
Authentication becomes unstable under load
Kerberos does not degrade gracefully with distance.
8. The Wrong Fix: Increasing Timeouts
8.1 Common reaction
Increase Kerberos retry thresholds
Increases session duration
Expands security exposure windows
Timeout tuning is not an architectural fix.
9. The Right Fix: DNS and Site Classification
9.1 Core principle
Identity resolution must be local to the consumer.
This requires DNS, AD Sites & Services, and resolver placement to work together.
10. Bad DNS Design Pattern
10.1 Characteristics
Cross-region Kerberos traffic
High RTT baked into auth flow
Authentication stability tied to WAN quality
11. Good DNS Design – Site-Aware Resolution
11.1 Key design principles
DNS resolution must be local
Cloud ISE → cloud-local DNS
On-prem ISE → on-prem DNS
AD Sites & Services must reflect reality
11.2 Required elements
Cloud subnets mapped to logical sites
Site link costs aligned with latency
Site-specific SRV records:
_ldap._tcp.<site>._sites.domain
_kerberos._tcp.<site>._sites.domain
12. Practical Classification Example
12.1 Regional model
12.2 Correct DNS resolution flow
ISE in US-CLOUD resolves:
DNS returns only US-based DCs.
RADIUS fits within timer budget
Authentication and posture stabilize
13. DNS Resolver Placement (Critical Detail)
13.1 Bad practice
Adds latency before authentication even starts.
13.2 Good Prictice
DNS answers are already locality-aware. DNS query time is part of authentication latency.
14. Visual Comparison (Mermaid)
14.1 Bad flow – Cross-region identity
14.2 Good flow – Localized identity
15. Impact on Posture
Posture depends on:
Timely authorization updates
Predictable session lifecycle
When Kerberos or LDAP are slow:
Posture results arrive late
Sessions may reauthenticate
Cached posture states persist longer
Compliance assurance weakens
Identity latency becomes security debt.
16. Design Checklist (Actionable)
Map all subnets correctly in AD Sites & Services
Create logical sites for cloud regions
Ensure ISE uses local DNS resolvers
Validate site-specific SRV resolution
Measure ISE ↔ DC RTT (not just ISE ↔ NAD)
Avoid global DNS answers for identity services
Never rely on timeout increases as a fix
Kerberos failures in hybrid environments are almost always DNS and site-design failures.
Cisco ISE exposes these issues because it sits at the intersection of:
Fix identity locality, and latency stops being a mystery.