DNS-Based Service Discovery: Kubernetes, Consul, and etcd
Learn how DNS-based service discovery works in microservices platforms like Kubernetes, Consul, and etcd, including DNS naming conventions and SRV records.
DNS-Based Service Discovery: Kubernetes, Consul, and etcd
Service discovery sits at the heart of any distributed system. Before a client can communicate with a service, it needs to find where that service lives on the network. DNS, the same protocol that translates domain names to IP addresses, has been stretched and adapted to solve this problem in modern microservices platforms.
This post covers how DNS-based service discovery works, the trade-offs involved, and how platforms like Kubernetes, Consul, and etcd each approach it.
How DNS Has Been Adapted for Service Discovery
Traditional DNS was designed for relatively static infrastructure. A server might change IP addresses once every few months, so TTLs (Time To Live) of hours or even days made sense. Microservices change constantly—pods get created and destroyed, containers scale up and down, services move between nodes.
DNS-based service discovery adapts the protocol in several ways:
Short TTLs: Service records expire quickly, often within 30 seconds or less. This allows clients to pick up changes rapidly without overwhelming the DNS infrastructure with queries.
Dynamic Registration: Services register themselves (or are registered by an agent) as they come online. When a service instance fails or is replaced, its DNS record is removed automatically.
SRV Records: Standard DNS A records map a name to an IP address. But services run on different ports. SRV records store both the target host and the port number, allowing complete endpoint information in DNS.
Multi-value Responses: A single DNS query can return multiple IP addresses. Load balancing becomes a matter of rotating through these values.
graph TD
Client[Client Application] -->|Queries| DNS[DNS Server]
DNS -->|Returns A/SRV records| Client
subgraph "Service Instances"
S1[Service-A:8080]
S2[Service-A:8080]
S3[Service-B:3000]
end
Registry[Service Registry] -->|Watches for changes| DNS
S1 -->|Registers| Registry
S2 -->|Registers| Registry
S3 -->|Registers| Registry
S1 -.->|Health check fails| Registry
Registry -.->|Removes record| DNS
The diagram shows the basic pattern. Services register with a central registry. The registry pushes updates to DNS. Clients query DNS to discover endpoints. When health checks fail, records disappear.
Kubernetes DNS
Kubernetes operates its own internal DNS service for pod and service discovery. Understanding how this works helps you design better service communication patterns.
kube-dns and CoreDNS
Early Kubernetes versions shipped with kube-dns, which bundled SkyDNS. Modern clusters run CoreDNS instead—a modular DNS server written in Go that became the default in Kubernetes 1.11.
CoreDNS runs as a deployment in kube-system, usually with a couple replicas for HA. It watches the Kubernetes API for service and endpoint changes, rebuilding its zone data on every meaningful change.
The CoreDNS configuration lives in a ConfigMap fittingly named coredns. The default setup looks like this:
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
upstream
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
proxy . /etc/resolv.conf
cache 30
}
Service DNS Naming Conventions
Kubernetes services get DNS names that follow a predictable pattern:
<service-name>.<namespace>.svc.<cluster-domain>
So a service named “api-gateway” in the “production” namespace becomes:
api-gateway.production.svc.cluster.local
Same namespace? You can usually just use the service name. Different namespace? You need the full qualified name.
Headless services behave differently. When you set clusterIP: None, CoreDNS skips the VIP entirely and returns the IPs of backing pods directly:
apiVersion: v1
kind: Service
metadata:
name: stateful-service
spec:
clusterIP: None # This makes it headless
selector:
app: stateful-app
ports:
- port: 8080
targetPort: http
With a headless service, DNS returns individual pod IPs. Your application handles load balancing—which is exactly what you want for stateful services where clients need to reach specific pods directly.
Consul DNS Interface
HashiCorp Consul takes a more traditional approach to service discovery. It runs a distributed, gossip-based cluster with agents on every node. Services register with local agents, which gossip information across the cluster.
The Consul DNS interface exposes everything through standard DNS queries. No API endpoint to query—just familiar DNS tools:
# Query for web service instances
dig @127.0.0.1 -p 8600 web.service.consul SRV
# Get just the IP addresses
dig @127.0.0.1 -p 8600 web.service.consul
Consul uses the .consul domain by default. Queries for web.service.consul return A records with the IP addresses of all healthy service instances.
DNS SRV Records for Port Discovery
SRV records become essential when services run on non-standard ports. Imagine a service catalog where different teams run their own instances on arbitrary ports. Clients do not hardcode port numbers, they discover them through DNS.
A Consul SRV response looks something like this:
;; ANSWER SECTION:
api.service.consul. 0 IN SRV 1 1 8080 node1.service.consul.
api.service.consul. 0 IN SRV 1 1 8081 node2.service.consul.
;; ADDITIONAL SECTION:
node1.service.consul. 0 IN A 10.0.1.10
node2.service.consul. 0 IN A 10.0.1.11
The SRV record tells you that two instances exist, on ports 8080 and 8081 respectively, running on nodes with those IP addresses.
Prepared Queries
Consul supports prepared queries—saved query templates on the server side. These enable advanced patterns like geo-based routing:
{
"Name": "geo-routing",
"Query": "api-fleet",
"DNS": {
"TTL": "10s"
},
"ServiceMeta": {
"version": "v2"
}
}
Clients then query geo-routing.query.consul and Consul returns instances based on the query definition.
etcd for Service Registration
etcd is the persistent store behind Kubernetes and many other distributed systems. It is not a DNS server, but it often sits underneath service registries that expose DNS interfaces.
Services store endpoint information in etcd’s hierarchical key-value space:
/services/api/10.0.1.10:8080
/services/api/10.0.1.11:8080
A separate component—etcd-watcher, a custom controller, whatever—watches these paths and updates DNS records when values change. Storage stays separate from DNS serving, which keeps things clean.
The advantage of etcd is its consistency and availability story. As a Raft-based consensus system, it handles network partitions gracefully and gives you strong consistency guarantees for registration data.
Watch operations in etcd notify listeners of changes immediately:
watcher := client.Watch(ctx, "/services/", client.WithPrefix())
for resp := range watcher {
for _, event := range resp.Events {
// Process registration/deregistration
}
}
This reactive model works well for keeping DNS records current.
DNS Caching Challenges and TTL Considerations
Caching happens at multiple layers in the DNS resolution path. Each layer brings its own headaches for service discovery.
Application-Level Caching: Applications cache DNS lookups to avoid repeated queries. If your cached entry lives for 5 minutes but the service moved 1 minute ago, you are sending traffic to a dead address.
Operating System Caching: Most operating systems cache DNS responses based on the TTL in the record. Kubernetes DNS records typically have TTLs around 30 seconds, which most systems respect.
Load Balancer / Proxy Caching: If your service sits behind a proxy or load balancer, that component may cache DNS independently. Your 30-second TTL means nothing if the proxy cached the entry for 5 minutes.
The result: a window where traffic flows to addresses that no longer exist. Mitigation strategies include:
-
Readiness Probes: Kubernetes uses readiness probes to remove unhealthy pods from service endpoints immediately, regardless of DNS caching.
-
Connection Draining: Allow existing connections to complete while routing new traffic only to healthy instances.
-
Client-Side Re-resolution: Some clients re-resolve DNS periodically or on connection errors, rather than relying solely on cached entries.
-
Very Short TTLs: Some deployments use TTLs under 10 seconds, accepting the increased query load in exchange for faster convergence.
Headless Services in Kubernetes
Headless services change DNS semantics in ways that matter. When you set clusterIP: None, CoreDNS returns pod IPs directly instead of a service VIP.
This shows up in a few scenarios:
StatefulSets: Database clusters like MongoDB or Cassandra need pod-to-pod communication where clients connect to specific instances. Headless services let DNS resolve to individual pod IPs.
Custom Load Balancing: Some applications implement their own load balancing. They need to see all available pods and make their own routing decisions.
Service Mesh: With a service mesh like Istio, sidecar proxies often handle load balancing. They may need direct pod IPs for proper traffic management.
The trade-off is that your application takes on complexity the service proxy would normally handle. Without a VIP, failed pods mean failed connections unless your client implements retry logic.
Link Local DNS vs Global DNS
Service discovery DNS typically stays local to your infrastructure. These addresses do not resolve on the public internet and do not need delegation to global DNS servers.
Link-local DNS (also called private DNS) operates within a bounded environment:
- Kubernetes cluster DNS lives in
cluster.local(or a custom domain) - Consul datacenter DNS lives in
datacenter.consul - VPC private DNS in AWS uses
.compute.internal
These namespaces do not conflict with public DNS. You can have api.service.consul internally while someone else has api.com on the public internet.
When you need external access, you expose services through an ingress or gateway that bridges internal and external DNS. The external name points to a load balancer or reverse proxy that forwards traffic into your internal network.
Global DNS matters when you need geographic distribution of service discovery. A service registered in one datacenter should be discoverable from another. This requires replication mechanisms—Consul’s multi-datacenter support, for instance, replicates service registrations across datacenters so queries anywhere return consistent results.
Connecting the Patterns
DNS-based service discovery works well in many scenarios, but it has limits. When you need:
-
Strong consistency: DNS caching means some clients may see stale data. For leader election or configuration changes, you need a more consistent store.
-
Rich metadata: DNS records carry limited information. When you need health check details, latency metrics, or custom attributes, a service registry with an API works better.
-
Fine-grained routing: DNS operates at host/port level. When you need header-based routing, traffic splitting, or canary deployments, your service mesh or API gateway provides more control.
Most production systems use DNS for basic discovery, the service mesh for traffic management, and a service registry API for operational tooling.
Topic Deep Dive: CoreDNS Configuration for Service Discovery
CoreDNS is the default DNS server for Kubernetes clusters. Understanding its configuration helps you debug discovery issues and implement custom behaviors.
CoreDNS Architecture
CoreDNS works as a plugin-based DNS server. Each plugin handles a specific function—kubernetes plugin for K8s discovery, forward plugin for upstream DNS, cache plugin for caching responses.
# Default Corefile for Kubernetes
. {
errors # Print errors to stdout
health # Health endpoint on :8080
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure # Use pod IPs (not secure for multi-tenant)
upstream # Forward unresolved queries here
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153 # Metrics endpoint
proxy . /etc/resolv.conf # Forward external queries
cache 30 # Cache TTL of 30 seconds
}
Custom DNS Records with CoreDNS
For custom service discovery needs, you can add additional DNS records:
# Add custom service records
api.example.com {
forward . 8.8.8.8
}
# Or create custom discovery zones
service-discovery.local {
file /var/lib/coredns/custom.db
reload 10s
}
DNS TTL Considerations
CoreDNS cache TTL affects how quickly clients see updates:
| TTL Setting | Use Case | Trade-off |
|---|---|---|
| 30 seconds (default) | Most Kubernetes workloads | Balance between freshness and query load |
| 10 seconds or less | Services that scale frequently | Higher DNS query load |
| 5 minutes or more | Stable services, reduced query load | Slower convergence on changes |
For headless services with frequent pod changes, lower TTLs ensure faster convergence. For stable services, higher TTLs reduce infrastructure load.
Real-world Failure Scenarios
| Scenario | What Happens | Root Cause | Mitigation |
|---|---|---|---|
| DNS cache staleness | Traffic routes to terminated pod | Cached DNS entry not yet expired | Use readiness probes to remove from endpoints; set short TTLs |
| CoreDNS OOM kill | New pods cannot discover existing services | CoreDNS memory limits too low | Increase CoreDNS resource limits |
| NDOTS enabled | Excessive external DNS queries | Resolver appends search domains to every query | Disable ndots in pod spec or use FQDNs |
| Headless service with no endpoints | DNS returns empty | Pods not yet scheduled or selector matches nothing | Add startup probe, verify selector |
| Cross-namespace lookup | Query fails | Wrong namespace in DNS name | Use full qualified name: service.namespace.svc.cluster.local |
| CoreDNS crash loop | All service discovery fails cluster-wide | ConfigMap error or plugin failure | Validate Corefile syntax; check CoreDNS logs |
Trade-off Comparison: DNS-Based Discovery Platforms
| Feature | Kubernetes CoreDNS | Consul DNS | etcd + DNS Bridge | AWS Route 53 |
|---|---|---|---|---|
| Integration | Native K8s | Multi-platform | K8s-native | Cloud-native |
| SRV Record Support | Yes | Yes | Via controller | Yes |
| Health-aware routing | Via readiness probes | Built-in | Custom implementation | Via health checks |
| Multi-datacenter | No (single cluster) | Yes, via gossip | Via federation | Yes, via latency routing |
| TTL flexibility | Yes | Yes | Configurable | Per-record |
| Headless services | Native | Via agent | Custom | Via alias records |
| Operational complexity | Low | Medium | Medium | Low |
Quick Recap Checklist
- DNS-based service discovery adapts traditional DNS with short TTLs for rapid service instance changes
- Kubernetes CoreDNS integrates with the Kubernetes API to automatically create DNS records for services and headless services
- Consul provides DNS interface with SRV records for port discovery and multi-datacenter support
- DNS caching at multiple layers creates staleness windows; use readiness probes and short TTLs to mitigate
- Headless services return individual pod IPs for custom client-side load balancing
- For advanced routing (canary deployments, traffic splitting), combine DNS discovery with service mesh or API gateway
Interview Questions
Traditional DNS servers like BIND serve static zone data that changes slowly. CoreDNS is designed for dynamic environments where records change constantly as pods scale up and down. CoreDNS watches the Kubernetes API and updates DNS records automatically as services and endpoints change.
CoreDNS uses a plugin architecture where each plugin handles specific DNS functionality. The kubernetes plugin translates K8s services and endpoints into DNS records. When you create or delete a service, CoreDNS updates its zone data without manual intervention.
SRV records map a service name to a hostname and port. Unlike A records that only map names to IP addresses, SRV records include the port number, allowing complete endpoint discovery from DNS alone.
The format: _service._protocol.name TTL SRV priority weight port target
SRV records let clients discover both the IP address and port number of a service without hardcoding port numbers or making separate discovery calls.
A clusterIP service creates a virtual IP (VIP) that load balances to backing pods. DNS for a clusterIP service returns the VIP address.
A headless service (clusterIP: None) skips the VIP. DNS returns the individual pod IP addresses directly, allowing clients to do their own load balancing or connect to specific pods directly.
Headless services suit stateful applications where clients need specific instance connections—MongoDB replicas, Cassandra nodes, or custom load balancing.
DNS resolution caches at multiple layers: application-level, OS-level, and potentially load balancer or proxy caches. Each layer has its own TTL, creating staleness possibilities.
If Kubernetes DNS returns a record with 30-second TTL but the OS caches it for longer, or your application caches for 5 minutes, updates take time to propagate.
Mitigation: use short TTLs, configure OS-level cache appropriately, implement connection-level health checking that verifies actual connectivity.
Consul exposes service discovery through standard DNS queries on port 8600. You query for service names with the Consul domain suffix, and Consul returns A records for service IPs and SRV records for service IPs plus ports.
Advantages: DNS is universally supported, no client library needed, works at the network level.
Limitations: DNS has limited record types for rich metadata. HTTP APIs can return health status and custom attributes.
The ndots setting determines how many search domains to append to unqualified hostnames. With ndots=5, querying "api-service" tries multiple search domains before the bare name.
This means every service lookup generates multiple DNS queries, adding latency and load to CoreDNS.
Fix: set ndots to a lower value or use fully qualified domain names (FQDNs) in application code.
You need an external service registry like Consul that bridges both environments. Consul agents run on Kubernetes and on VM infrastructure, and they gossip across the cluster boundary.
Services inside K8s register with their local Consul agent. Services outside register with their local agent. All agents share the same service catalog through gossip protocol.
etcd is a distributed key-value store used by Kubernetes to store cluster state. It backs DNS-based discovery through separate controllers that watch etcd and update DNS records.
External-dns is a controller that watches Service and Ingress resources and updates DNS records in providers like Route 53 based on etcd data.
Kubernetes DNS uses a hierarchical structure: api-service.production.svc.cluster.local
Within the same namespace, you can use just the service name. From a different namespace, you need the full name with the namespace suffix.
DNS records have limited metadata—only IPs, ports, and basic priority/weight. You cannot filter DNS queries based on health status, version, or custom attributes.
DNS updates have eventual consistency. For immediate consistency requirements, DNS is not suitable.
DNS does not support header-based routing, traffic splitting, or canary deployments. These require service meshes or API gateways.
When a pod terminates, the kubelet notifies the API server, which removes the pod from the endpoints object. CoreDNS watches these changes and stops returning the pod IP in DNS queries.
The timing depends on both the endpoint propagation and the DNS TTL. With a 30-second TTL, DNS caches the stale IP for up to 30 seconds after removal.
Readiness probes help mitigate this—if a pod fails its readiness check, it is removed from endpoints immediately, before DNS TTL expiration.
Consul uses the Serf library for gossip protocol, which disseminates information across the cluster through a peer-to-peer model.
Each Consul agent maintains a member list and periodically exchanges messages with random nodes. Information about new services, health changes, and failures propagates across the cluster through this indirect diffusion.
The gossip protocol provides fault tolerance and eventual consistency without requiring a central registry. If one node fails, others continue sharing information.
A records map a hostname to an IPv4 address only. They do not carry port information.
SRV records provide more complete service endpoint information: they specify the target hostname, port number, priority, and weight for a service.
In Kubernetes, A records for services return the cluster IP (VIP) while SRV records return the actual pod IPs and their port numbers.
External-dns is a Kubernetes controller that synchronizes exposed Services and Ingresses with external DNS providers like Route 53, Cloudflare, or Google Cloud DNS.
It watches Kubernetes resources and creates DNS records in the cloud provider when services are created or modified.
This bridges internal Kubernetes DNS with external DNS, enabling external clients to discover services running inside the cluster.
First, verify CoreDNS is running: kubectl get pods -n kube-system -l k8s-app=kube-dns
Test DNS resolution from within a pod: kubectl exec -it test-pod -- nslookup kubernetes.default
Check CoreDNS logs: kubectl logs -n kube-system -l k8s-app=kube-dns
Verify the coredns ConfigMap is valid and check /etc/resolv.conf in pods for correct search domain configuration.
DNS cache poisoning attacks can redirect service traffic to malicious endpoints. Use DNSSEC to validate DNS responses.
In multi-tenant clusters, pod IP ranges may leak between namespaces if CoreDNS is not properly configured with network policies.
Headless services expose individual pod IPs, which may be undesirable in security-sensitive environments. Use network policies to restrict pod-to-pod communication.
Consul uses the Raft consensus protocol for consistent state replication. During a network partition, minority nodes cannot elect a leader and stop processing writes.
DNS queries continue to be served from nodes in the majority partition with potentially stale data from the minority partition.
Consul's built-in health checking continues evaluating node and service health, removing failed instances from DNS responses.
The Endpoints object in Kubernetes contains a list of IP addresses and ports for all pods that back a Service.
CoreDNS watches the Endpoints API and creates DNS records corresponding to the current pod IPs in the endpoint list.
When a pod is added or removed (through readiness probes, termination, or scaling), the endpoints object updates, triggering CoreDNS to update DNS records accordingly.
Service meshes like Istio often implement their own service discovery by intercepting DNS queries. When an application queries a service name, the sidecar proxy intercepts the request and handles load balancing according to mesh policies.
Istio can route traffic based on headers, version labels, or circuit breakers—capabilities that DNS-based discovery does not provide.
The application sees a stable local endpoint through the sidecar, while the sidecar handles actual endpoint selection and health checking.
Very short TTLs (under 10 seconds) mean clients re-resolve DNS more frequently, increasing query load on the DNS infrastructure.
This can strain CoreDNS or Consul agents, especially at scale with thousands of services and frequent churn.
The trade-off is faster convergence when services change. For stable services, longer TTLs (30-60 seconds) reduce infrastructure load with minimal impact on discovery accuracy.
Further Reading
- Kubernetes — Container orchestration and service networking fundamentals
- Microservices Architecture Roadmap — Patterns that depend on service discovery
- CoreDNS Documentation — DNS server configuration and plugins
- Consul DNS Interface — Using DNS for service discovery with Consul
Conclusion
DNS-based service discovery provides a simple, universal mechanism for service location in microservices architectures. By adapting traditional DNS with short TTLs and dynamic registration, services can discover each other without hardcoded addresses. Kubernetes CoreDNS and Consul are the dominant solutions, each with distinct approaches suited to different deployment models. While DNS discovery handles basic location well, advanced traffic management requires additional layers like service mesh or API gateways.
Category
Related Posts
Client-Side Discovery: Direct Service Routing in Microservices
Explore client-side service discovery patterns, how clients directly query the service registry, and when this approach works best.
Server-Side Discovery: Load Balancer-Based Service Routing
Learn how server-side discovery uses load balancers and reverse proxies to route service requests in microservices architectures.
Service Registry: Dynamic Service Discovery in Microservices
Understand how service registries enable dynamic service discovery, health tracking, and failover in distributed microservices systems.