Network Security: VPC, Firewall Rules, and Service Mesh mTLS
Design network security for cloud-native applications using VPCs, network policies, and mutual TLS for service-to-service encryption.
Introduction
Network security is the discipline of controlling which systems can communicate with which other systems, and under what conditions. It spans from cloud VPC architecture down to individual Kubernetes pods. When done well, it prevents lateral movement during incidents, reduces blast radius from compromised workloads, and keeps sensitive data from leaking to unintended recipients. When done poorly, a single misconfigured security group rule or an overpermissive NACL can cascade into a full production outage or a data breach.
The failures network security prevents are rarely theoretical. A missing security group rule blocks legitimate traffic and causes cascading timeouts. An overly permissive ingress rule exposes internal services to the public internet. A Kubernetes cluster without NetworkPolicy means a single compromised pod can reach every other workload in the cluster. Certificates that are not renewed automatically cause HTTPS outages that affect users directly. Service mesh misconfigurations add latency that surfaces only under production load.
This guide covers the full stack of cloud-native network security: VPC design and CIDR allocation, stateful security groups versus stateless NACLs, Kubernetes NetworkPolicy enforcement with Calico and Cilium, mutual TLS via service mesh (Linkerd and Istio), automated certificate management with cert-manager, and zero-trust architecture principles. By the end, you will be able to design a network security posture that limits blast radius, enforces least-privilege connectivity, and survives common failure scenarios without manual intervention.
When to Use
Istio vs. Linkerd vs. No Service Mesh
Use a service mesh when you have multiple services that need mutual TLS without modifying application code. Linkerd is the better choice when you want simple, low-overhead mTLS with minimal configuration. Istio is better when you need fine-grained traffic control, advanced observability, or multi-cluster federation.
Do not use a service mesh if your services communicate over a dedicated network with no untrusted traffic, or if your team cannot afford the operational overhead. A service mesh adds complexity at every level: debugging, routing, and authentication all become mesh concerns.
cert-manager vs. Cloud-Native Certificate Management
Use cert-manager when you run Kubernetes and want a unified way to manage certificates across cloud and on-premises environments, or when you use Let’s Encrypt or other ACME providers.
Use cloud-native certificate management (AWS ACM, Azure Key Vault, GCP Certificate Manager) when you operate entirely within one cloud and your certificates are primarily for cloud-managed ingresses like ALB or Cloud CDN.
NACLs vs. Security Groups Alone
NACLs add value when you need subnet-wide rules that apply to all resources in a subnet, or when you want explicit deny rules at the network layer (for example, blocking known malicious IP ranges before traffic reaches any security group).
Most workloads do fine with security groups alone. NACLs are worth the additional configuration complexity when you have a specific compliance requirement for network-layer filtering or when multiple security groups need a common deny rule.
VPC Design and CIDR Allocation
A VPC (Virtual Private Cloud) is your network boundary in the cloud. Design it carefully because changing it later is painful.
Allocate CIDR blocks that give you room to grow. If you start with a /24, you will outgrow it. Use a /16 for production environments and segment with subnets.
# AWS VPC example
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
}
# Subnets across availability zones
resource "aws_subnet" "private" {
count = 3
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
availability_zone = data.aws_availability_zones.available.names[count.index]
}
resource "aws_subnet" "public" {
count = 3
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index + 10)
availability_zone = data.aws_availability_zones.available.names[count.index]
map_public_ip_on_launch = false
}
Segment your VPC into at least three subnet types:
- Public subnets: Load balancers, NAT gateways. Has direct internet access.
- Private subnets: Application workloads. No direct internet access.
- Data subnets: Databases, caches. Most restricted, no internet access at all.
Security Groups and NACLs
Security groups are stateful firewalls attached to instances or ENIs (Elastic Network Interfaces). They are the primary tool for controlling traffic to your workloads.
# Security group for an application tier
resource "aws_security_group" "app" {
name = "app-tier"
description = "Security group for application servers"
vpc_id = aws_vpc.main.id
ingress {
from_port = 8080
to_port = 8080
protocol = "tcp"
security_groups = [aws_security_group.load_balancer.id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
Network ACLs (NACLs) are stateless and operate at the subnet level. Use them as a second layer of defense, for example, blocking all traffic to your data subnets except from specific app subnets.
# NACL for data subnet - only allow traffic from app tier
resource "aws_network_acl" "data" {
vpc_id = aws_vpc.main.id
ingress {
rule_number = 100
from_port = 5432
to_port = 5432
protocol = "tcp"
cidr_block = "10.0.1.0/24" # App tier subnet
rule_action = "allow"
}
ingress {
rule_number = 200
cidr_block = "0.0.0.0/0"
rule_action = "deny"
}
}
The key difference: security groups are stateful (return traffic is automatically allowed), NACLs are stateless (you must explicitly allow return traffic).
Kubernetes NetworkPolicy Enforcement
In Kubernetes, pods can communicate freely by default. A compromised pod can reach any other pod in the cluster. NetworkPolicy is the Kubernetes-native way to restrict this.
# Default deny all ingress for a namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
---
# Allow only from frontend to backend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-to-backend
namespace: production
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
Not all Kubernetes networking plugins enforce NetworkPolicy. Calico, Cilium, and Weave Net do. If you are using EKS with Amazon VPC CNI, you need to add Calico or another plugin for policy enforcement.
Service Mesh mTLS
When you want encryption and authentication between every service, a service mesh with mutual TLS is the answer. Instead of managing certificates in your application code, the mesh handles it.
flowchart LR
A[Service A] -->|mTLS| B[Linkerd Proxy]
B -->|mTLS| C[Linkerd Proxy]
C --> D[Service B]
E[Service C] -->|mTLS| F[Linkerd Proxy]
F -->|mTLS| B
B --> G[Control Plane<br/>certificates, policies]
D --> H[Certificate<br/>rotation]
With Linkerd, you enable mTLS with a single line in the control plane configuration:
Istio and Linkerd are the two main options. Linkerd is simpler and lower-overhead. Istio is more feature-rich but more complex.
With Linkerd, you enable mTLS with a single line in the control plane configuration:
# Enable mTLS in Linkerd
apiVersion: linkerd.io/v1alpha2
kind: GlobalMeshPolicy
metadata:
name: global
spec:
enableTLS: true
Your services do not change. Linkerd intercepts traffic at the proxy level, verifies certificates, and encrypts communication automatically.
Istio gives you more control but requires more configuration:
# PeerAuthentication policy for strict mTLS
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: production
spec:
mtls:
mode: STRICT
Certificate Management with cert-manager
Whether you are using service mesh mTLS or just securing ingress, certificates need to be managed automatically. cert-manager automates certificate issuance and renewal.
# ClusterIssuer for Let's Encrypt
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: ops@example.com
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx
Once you have a ClusterIssuer, you can request certificates for any service:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: myapp-tls
namespace: production
spec:
secretName: myapp-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- myapp.example.com
cert-manager handles renewal automatically. Certificates from Let’s Encrypt expire every 90 days; cert-manager renews them at 30 days by default.
Zero-Trust Network Architecture
Zero-trust means never assuming that a request is safe just because it comes from inside your network. Every request should be authenticated and authorized, regardless of source.
The practical implications:
- Identity-based access: Services should have identities (certificates, service accounts) that are verified, not just IP-based access.
- Microsegmentation: Each service should only be able to reach the services it needs, nothing more.
- Short-lived credentials: Service accounts should use short-lived tokens, not long-lived secrets.
The Secrets Management guide covers how to implement identity-based secrets distribution.
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Security group rule conflict causing intermittent connectivity | Services cannot reach each other, timeouts, partial failures | Always number your security group rules, document expected port ranges, test after any security group change |
| cert-manager failing to renew certificate causing production outage | HTTPS becomes unavailable, all TLS connections fail | Monitor cert-manager’s cert-ready condition, set up alerts 30 days before expiry, keep a backup certificate |
| Linkerd mTLS causing latency spikes | Service-to-service latency increases, timeouts | Profile your services with and without mTLS, use Linkerd’s tap command to identify slow connections, check proxy resource limits |
| NetworkPolicy misconfigured blocking all traffic to namespace | All pods in namespace become unreachable, total outage | Apply NetworkPolicy to one pod first, test connectivity before broad rollout, always have a recovery path |
| NACL overly restrictive blocking legitimate traffic | Database or API unreachable from app tier, cascading failures | Test NACL changes on a non-production subnet first, use descriptive rule numbers for easy identification |
Network Security Observability
Certificate expiration causes complete outages. Set alerts at 60, 30, 7, and 1 day before expiry. If cert-manager reports a certificate as not-ready, investigate immediately — you may have a DNS validation failure or network connectivity issue.
Security group change frequency matters. Teams that modify security groups multiple times per day either have automation problems or unclear ownership. Frequent changes also make auditing harder.
For Linkerd and Istio, watch proxy CPU and memory on each pod. The sidecar adds overhead that catches teams off guard when they have tight resource limits and start getting evicted under load.
Key commands:
# Check cert-manager certificate status
kubectl get certificates -A -o wide
# Monitor certificate expiration
kubectl get certificates -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.notAfter}{"\n"}'
# List security group rules in AWS
aws ec2 describe-security-groups --region us-east-1 --query 'SecurityGroups[*].{Name:GroupName,Rules:IpPermissions}'
# Check Linkerd mTLS status
linkerd identity -n production
# Verify NetworkPolicy is applied
kubectl get networkpolicies -A -o wide
Common Anti-Patterns
Relying on the internal network being safe. Internal networks get breached. Compromised containers can reach any other container in the same VPC. Treat internal traffic as untrusted and use mTLS or at minimum application-layer authentication.
Using self-signed certificates in production. Self-signed certs work for development but break certificate transparency logs and make incident response harder. Use Let’s Encrypt via cert-manager or your cloud’s managed certificate service.
Not rotating certificates. Certificates expire, get compromised, or need to be replaced after incidents. Automate renewal with cert-manager and test the renewal process before expiry.
Over-trusting the Kubernetes network. Pods can reach any other pod by default. A single compromised workload becomes a pivot point to attack every other workload. Default-deny NetworkPolicy costs little and limits lateral movement.
Attaching overly permissive security group rules as a shortcut. Opening port 0.0.0.0/0 or allowing all traffic from 10.0.0.0/8 “because it’s the internal network” defeats the purpose of security groups. Restrict source CIDRs to the minimum required.
Trade-off Summary
| Control Layer | Scope | Operational Complexity | Latency Impact |
|---|---|---|---|
| Security groups | Instance-level | Low | Minimal |
| NACLs | Subnet-level | Medium | Minimal |
| VPC endpoints | Service-level | Medium | Reduces (removes IGW) |
| PrivateLink / VPC peering | Cross-account | High | Minimal |
| VPN / Direct Connect | On-prem hybrid | High | Adds encryption overhead |
| Service mesh (mTLS) | Pod-to-pod | High | 1-3% overhead |
| NetworkPolicy (K8s) | Pod-level | Medium | Minimal |
Interview Questions
Expected answer points:
- Security groups are stateful — return traffic is automatically allowed without explicit rules
- NACLs are stateless — you must explicitly allow both inbound and outbound traffic
- Security groups operate at the instance/ENI level; NACLs operate at the subnet level
- Use both when you need subnet-wide deny rules (NACLs) plus instance-level source restrictions (security groups)
- Common pattern: NACLs for known malicious IP blocking, security groups for workload access control
Expected answer points:
- Start with /16 to avoid address exhaustion — /24 runs out fast in multi-tier architectures
- Reserve space for future subnets, availability zones, and peering connections
- Segment into at least three subnet types: public (load balancers, NAT), private (app workloads), data (databases, caches)
- Align with availability zones for high availability
- Avoid overlapping CIDRs if VPC peering or VPN is needed
Expected answer points:
- By default, all pods in a Kubernetes cluster can reach all other pods — no isolation enforced
- A compromised pod can pivot laterally to any other workload in the cluster
- NetworkPolicy is a Kubernetes resource that defines ingress/egress rules per namespace or pod selector
- Requires a CNI plugin that supports policy enforcement (Calico, Cilium, Weave Net — not Amazon VPC CNI by default)
- Best practice: apply a default-deny NetworkPolicy to all production namespaces first, then add allow rules
Expected answer points:
- mTLS is a protocol where both the client and the server authenticate each other with X.509 certificates
- Unlike regular TLS where only the server presents a certificate, mTLS verifies the client identity as well
- Prevents unauthorized services from making or receiving connections within the mesh
- In a service mesh, certificates are managed by the control plane (Linkerd or Istio), not the application code
- Provides encryption (confidentiality) plus authentication (identity verification) for all service-to-service traffic
Expected answer points:
- Linkerd: choose when you want simple mTLS with minimal operational overhead, lower latency overhead (~1ms), and straightforward configuration
- Istio: choose when you need fine-grained traffic control (mirroring, retries, fault injection), multi-cluster federation, or advanced observability
- Istio has a steeper learning curve and higher resource consumption
- Linkerd is CNCF graduated; Istio is CNCF graduated but more complex
- If you only need mTLS and basic observability, Linkerd wins on simplicity
Expected answer points:
- cert-manager creates a ClusterIssuer or Issuer resource that references Let's Encrypt as the ACME provider
- When a Certificate resource is created, cert-manager initiates an ACME challenge (HTTP01 or DNS01) to prove domain ownership
- Once the challenge is passed, Let's Encrypt issues a certificate stored as a Kubernetes Secret
- Let's Encrypt certificates expire every 90 days; cert-manager renews them at 30 days by default
- Renewal is automatic — cert-manager monitors expiry and re-initiates the ACME flow when renewal is due
Expected answer points:
- Zero-trust means no request is trusted simply because it originates inside the network perimeter
- Every request must be authenticated and authorized regardless of source IP or network location
- Implications: identity-based access (certificates, service accounts) instead of IP-based trust, microsegmentation, short-lived credentials
- Internal networks are treated as untrusted — the same rigor applied to public-facing services is applied to east-west traffic
- Service mesh mTLS, Kubernetes NetworkPolicy, and secrets management are all building blocks of zero-trust
Expected answer points:
- Self-signed certificates break certificate transparency logs — browsers and security tools cannot validate them against known CA logs
- Incident response is harder because there is no issuance audit trail
- They introduce operational risk: no automated renewal, easy to forget about expiry
- Make it harder to detect man-in-the-middle attacks in production
- Use Let's Encrypt (free, automated) or cloud-managed certificates (ACM, Azure Key Vault, GCP Certificate Manager) instead
Expected answer points:
- Certificate expiration alerts at 60, 30, 7, and 1 day before expiry — certificate outages are total outages
- Security group rule change detection — unexpected changes can indicate misconfiguration or compromise
- Service mesh proxy CPU/memory usage — sidecar overhead catches teams off guard under load
- Monitor cert-manager's CertificateReady condition — failures indicate DNS validation or ACME challenges failing
- Linkerd/Istio metrics: request success rates, mTLS handshake latencies, policy violations
Expected answer points:
- Amazon VPC CNI does not enforce Kubernetes NetworkPolicy — it only assigns AWS ENI addresses to pods
- You must install a policy-capable CNI plugin separately: Calico for Kubernetes or Cilium
- Teams often assume NetworkPolicy "just works" in EKS and then are confused when default-deny rules have no effect
- Calico can be installed as a DaemonSet with a ConfigMap; Cilium uses eBPF for enforcement
- After installation, verify with `kubectl get networkpolicies -A` and test connectivity
Expected answer points:
- NAT gateway: allows private subnets to make outbound internet calls (e.g., for updates), but traffic goes through the internet — encrypted but traverses public infrastructure
- VPC endpoints: create private entry points to AWS services (S3, DynamoDB, Secrets Manager) — traffic never leaves the AWS network
- VPC endpoints reduce NAT gateway costs for high-volume service calls and improve security posture
- PrivateLink / VPC endpoint services are required for cross-account access to services
- Trade-off: NAT gateway is simpler for internet-bound traffic; VPC endpoints are more secure for AWS service traffic
Expected answer points:
- CIDR-based rules (10.0.0.0/8) allow any resource within that range — overpermissive, no identity guarantee
- Security group referencing another security group means only resources attached to that specific security group can connect
- Security group references scale automatically: new instances attached to the source SG inherit its egress permissions without rule changes
- Overpermissive CIDR rules are a common anti-pattern — they defeat the purpose of security group least-privilege access
Expected answer points:
- HTTP01: cert-manager creates a temporary HTTP resource at `http://domain.com/.well-known/acme-challenge/` — works for publicly reachable domains
- DNS01: cert-manager creates a DNS TXT record `_acme-challenge.domain.com` — verified by the ACME provider querying DNS
- Use DNS01 when the domain is not publicly reachable (internal services, private DNS zones) or when HTTP01 is not possible
- DNS01 requires the cluster DNS provider to be accessible to the ACME provider, or use a DNS01 solver with cloud credentials
- DNS01 supports wildcard certificates; HTTP01 does not
Expected answer points:
- DNS validation failure: ACME provider cannot reach the HTTP01 challenge or DNS01 TXT record — often caused by private DNS or propagation delays
- Expired account key: the Let's Encrypt account needs to be kept active; if it expires, cert-manager cannot request certificates
- Rate limiting: Let's Encrypt has rate limits (5 certificates per domain per week) — hitting limits blocks renewal
- Detection: monitor `kubectl get certificates -A` for Ready=False; check cert-manager logs for Challenge/Certificate events
- Set up alerts on CertificateReady condition and certificate not-ready events
Expected answer points:
- Each service has a Linkerd proxy sidecar (Envoy-based) that intercepts outbound and inbound traffic
- On outbound: proxy presents the workload's certificate to the destination's proxy
- On inbound: proxy verifies the client's certificate before forwarding to the application pod
- Certificates are issued and rotated by the Linkerd control plane — application code sees encrypted traffic but handles no certificates
- Control plane uses a trust anchor (root certificate) to issue workload certificates — rotation happens automatically
- Service accounts are mapped to workload identities in the mesh
Expected answer points:
- NACLs require explicit inbound AND outbound rule definitions (stateless) — twice the rule management surface
- They operate at the subnet level, so they affect all resources in that subnet simultaneously
- Complexity is justified in regulated environments (PCI-DSS, HIPAA) where subnet-wide deny rules are required
- Also justified when you need to block known malicious IP ranges before traffic reaches any security group
- For most workloads, well-structured security groups alone are sufficient
Expected answer points:
- Public subnets in each AZ: load balancers, NAT gateways — receives traffic from the internet
- Private subnets in each AZ: web tier then app tier — app tier receives only from web tier security group
- Data subnets in each AZ: database tier — accepts connections only from app tier security group, no internet access
- App tier security group allows port 5432 (Postgres) only from web/app tier security group, not from 0.0.0.0/0
- NACLs add subnet-level deny rules for known malicious IPs on data subnets
- Multi-AZ requires careful CIDR allocation per subnet per AZ — plan this in the initial VPC CIDR design
Expected answer points:
- Default (no NetworkPolicy): the compromised pod can reach every other pod in the cluster — it can exfiltrate data from databases, retrieve secrets from other pods, pivot to external services
- Default-deny NetworkPolicy: the compromised pod can only reach pods it was explicitly allowed to reach — lateral movement is blocked
- Applying default-deny is low operational overhead — one manifest per namespace — but it requires knowing your application's communication patterns up front
- Even with default-deny, a compromised pod can still damage the pod it is allowed to communicate with — defense in depth is necessary
Expected answer points:
- VPC peering: creates a direct network connection between two VPCs — traffic uses AWS backbone but appears as from the peer CIDR
- PrivateLink: exposes a service as a private endpoint within your VPC — the service owner manages it, you do not need a peering connection
- Choose VPC peering for simple, non-isolated cross-VPC communication within your organization
- Choose PrivateLink for cross-account service exposure where the service consumer should not have full VPC-level access
- PrivateLink is more secure for consuming third-party SaaS or AWS services because it does not expose your VPC CIDRs
Expected answer points:
- Security group naming conventions and tagging: tag by environment (prod/staging), team, owner for traceability
- Security group hierarchy: define a "base" security group with common rules (monitoring, logging) and have workload SGs reference it
- Infrastructure as Code (Terraform, Pulumi): treat security group rules as code — review changes via PR, automate application
- Avoid rule explosion: use security group references over CIDR rules where possible for automatic scaling
- Automated drift detection: compare actual SG rules against IaC definitions and alert on drift
- Cross-account security group sharing via AWS Resource Access Manager (RAM) for shared services
Further Reading
- AWS VPC Documentation — Official AWS guide on VPC design, subnets, and connectivity options
- Kubernetes NetworkPolicy documentation — Official K8s docs on writing NetworkPolicy manifests
- Linkerd Documentation — Official Linkerd docs for mTLS, policy, and observability configuration
- cert-manager Documentation — Official cert-manager docs for ClusterIssuer, Certificate resources, and ACME solvers
- OWASP Network Security Cheat Sheet — Practical network security controls and anti-patterns from OWASP
- NIST Zero Trust Architecture (SP 800-207) — Formal definition of zero-trust principles for enterprise networks
Conclusion
Key Takeaways
- VPC design sets the foundation: use /16 with subnet segmentation, not /24
- Security groups are stateful and instance-level; NACLs are stateless and subnet-level
- Kubernetes NetworkPolicy is not enforced by default CNI plugins on EKS — add Calico or Cilium
- Service mesh mTLS offloads certificate management from application code
- cert-manager automates certificate issuance and renewal across all Kubernetes workloads
- Zero-trust means authenticating every request regardless of network origin
Network Security Checklist
# 1. VPC uses /16 with public/private/data subnet segmentation
# 2. Security groups restrict source to specific CIDRs or security groups
# 3. NACLs add subnet-level deny rules for known malicious IPs
# 4. Calico or Cilium installed for NetworkPolicy enforcement on EKS
# 5. Default-deny NetworkPolicy applied to all production namespaces
# 6. cert-manager ClusterIssuer created for Let's Encrypt
# 7. Certificates auto-renewed 30 days before expiry
# 8. Service mesh mTLS enabled for all production namespaces
# 9. Certificate expiration alerts configured at 60, 30, 7, 1 days Category
Related Posts
Container Security: Image Scanning and Vulnerability Management
Implement comprehensive container security: from scanning images for vulnerabilities to runtime security monitoring and secrets protection.
Deployment Strategies: Rolling, Blue-Green, and Canary Releases
Compare and implement deployment strategies—rolling updates, blue-green deployments, and canary releases—to reduce risk and enable safe production releases.
Developing Helm Charts: Templates, Values, and Testing
Create production-ready Helm charts with Go templates, custom value schemas, and testing using Helm unittest and ct.