Network Security: VPC, Firewall Rules, and Service Mesh mTLS

Design network security for cloud-native applications using VPCs, network policies, and mutual TLS for service-to-service encryption.

published: March 25, 2026 reading time: 26 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Network security controls which systems can communicate under what conditions, spanning cloud VPC architecture down to individual Kubernetes pods. The guide covers VPC design with proper CIDR allocation, stateful security groups versus stateless NACLs, and Kubernetes NetworkPolicy enforcement to prevent lateral movement when pods are compromised. Proper implementation prevents data leaks, reduces blast radius during incidents, and avoids manual misconfiguration through automation.

Introduction

Network security is the discipline of controlling which systems can communicate with which other systems, and under what conditions. It spans from cloud VPC architecture down to individual Kubernetes pods. When done well, it prevents lateral movement during incidents, reduces blast radius from compromised workloads, and keeps sensitive data from leaking to unintended recipients. When done poorly, a single misconfigured security group rule or an overpermissive NACL can cascade into a full production outage or a data breach.

The failures network security prevents are rarely theoretical. A missing security group rule blocks legitimate traffic and causes cascading timeouts. An overly permissive ingress rule exposes internal services to the public internet. A Kubernetes cluster without NetworkPolicy means a single compromised pod can reach every other workload in the cluster. Certificates that are not renewed automatically cause HTTPS outages that affect users directly. Service mesh misconfigurations add latency that surfaces only under production load.

This guide covers the full stack of cloud-native network security: VPC design and CIDR allocation, stateful security groups versus stateless NACLs, Kubernetes NetworkPolicy enforcement with Calico and Cilium, mutual TLS via service mesh (Linkerd and Istio), automated certificate management with cert-manager, and zero-trust architecture principles. By the end, you will be able to design a network security posture that limits blast radius, enforces least-privilege connectivity, and survives common failure scenarios without manual intervention.

When to Use

Istio vs. Linkerd vs. No Service Mesh

Use a service mesh when you have multiple services that need mutual TLS without modifying application code. Linkerd is the better choice when you want simple, low-overhead mTLS with minimal configuration. Istio is better when you need fine-grained traffic control, advanced observability, or multi-cluster federation.

Do not use a service mesh if your services communicate over a dedicated network with no untrusted traffic, or if your team cannot afford the operational overhead. A service mesh adds complexity at every level: debugging, routing, and authentication all become mesh concerns.

The choice between Linkerd and Istio usually comes down to operational tolerance and feature requirements. Linkerd’s proxy is Envoy-based but runs as a DaemonSet with a dedicated control plane, which keeps memory overhead predictable. In production benchmarks, Linkerd adds roughly 0.2ms to p99 latency per hop. Istio’s sidecar model (Envoy injected as a sidecar per pod) adds more flexibility but also more memory pressure, especially at scale. If you are running fewer than 50 services and do not need traffic shaping, Linkerd wins on simplicity. If you need multi-cluster federation, wasm-based EnvoyFilter extensions, or fine-grained load shedding policies, Istio’s API surface justifies the complexity.

Here is the failure mode that trips teams up: Linkerd’s mTLS rotation is automatic and transparent, but if your services use istio-client libraries for certificate reading, those integrations break in a Linkerd environment. Istio’s PeerAuthentication policy set to STRICT mode will reject any non-mTLS connection, which catches teams off guard during initial rollout. Always test connectivity between services after enabling mTLS, not before.

cert-manager vs. Cloud-Native Certificate Management

Use cert-manager when you run Kubernetes and want a unified way to manage certificates across cloud and on-premises environments, or when you use Let’s Encrypt or other ACME providers.

Use cloud-native certificate management (AWS ACM, Azure Key Vault, GCP Certificate Manager) when you operate entirely within one cloud and your certificates are primarily for cloud-managed ingresses like ALB or Cloud CDN.

cert-manager is the right choice when your infrastructure spans multiple cloud providers or includes on-premises Kubernetes clusters. It gives you a single control plane for certificate lifecycle management regardless of where your clusters run. The ACME protocol support means you can use Let’s Encrypt at no cost, which is practical for internal services that need publicly trusted certificates. cert-manager also integrates directly with Kubernetes Ingress resources, so certificate provisioning can be automated as part of your existing deployment workflow.

Cloud-native certificate management makes more sense when you are fully committed to one cloud provider and your certificate needs are concentrated at the load balancer layer. AWS ACM handles certificate rotation for ALB and CloudFront, but it does not help you manage certificates inside your cluster for pod-to-pod mTLS. Azure Key Vault with the Azure Key Vault Certificate Injector can bridge this gap, but the setup is more involved than cert-manager. If you are running EKS with an Application Load Balancer and do not need mTLS between services, ACM alone is sufficient and avoids the cert-manager operational surface.

The constraint that trips people up: cert-manager’s HTTP01 challenge requires the domain being validated to be reachable from the internet. If your internal services use private DNS names that are not resolvable externally, you need DNS01 validation instead, which requires cloud provider credentials or a DNS provider that supports API-based TXT record management.

NACLs vs. Security Groups Alone

NACLs add value when you need subnet-wide rules that apply to all resources in a subnet, or when you want explicit deny rules at the network layer (for example, blocking known malicious IP ranges before traffic reaches any security group).

Most workloads do fine with security groups alone. NACLs are worth the additional configuration complexity when you have a specific compliance requirement for network-layer filtering or when multiple security groups need a common deny rule.

NACLs become worth the complexity when two things are true. First, you need to block traffic before it reaches any instance in a subnet. Security groups only filter traffic arriving at an ENI, so a NACL with a deny rule on the subnet edge catches traffic before it consumes ENI-level processing. Second, you need a rule that applies uniformly to all current and future resources in a subnet. Security groups require per-instance attachment, which means new resources do not inherit rules automatically. NACLs cover everything in the subnet CIDR without attachment.

In practice, NACLs are most useful on data subnets where you want a persistent deny boundary that survives security group drift. A common pattern: NACL ingress rule 100 allows app-tier CIDR on the database port, NACL ingress rule 200 denies everything else. This deny rule catches cases where someone accidentally removes the security group restriction on a database instance. Without the NACL layer, that instance becomes publicly accessible the moment its security group rule is deleted or misconfigured.

The operational cost is real. NACLs are stateless, so you must explicitly allow both directions for any conversation. If your app tier at 10.0.1.0/24 talks to your database at 10.0.2.0/24 on port 5432, you need an inbound rule allowing 10.0.1.0/24 on 5432 and an outbound rule allowing 10.0.1.0/24 on 5432 response. Get this wrong and you will spend hours chasing connectivity issues that make no sense. Number your NACL rules (100, 200, 300) so you can trace the evaluation order, and document the expected traffic flows before applying them.

VPC Design and CIDR Allocation

A VPC (Virtual Private Cloud) is your network boundary in the cloud. Design it carefully because changing it later is painful.

Allocate CIDR blocks that give you room to grow. If you start with a /24, you will outgrow it. Use a /16 for production environments and segment with subnets.

# AWS VPC example
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
}

# Subnets across availability zones
resource "aws_subnet" "private" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]
}

resource "aws_subnet" "public" {
  count                   = 3
  vpc_id                  = aws_vpc.main.id
  cidr_block              = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index + 10)
  availability_zone        = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = false
}

Segment your VPC into at least three subnet types:

Public subnets: Load balancers, NAT gateways. Has direct internet access.
Private subnets: Application workloads. No direct internet access.
Data subnets: Databases, caches. Most restricted, no internet access at all.

Security Groups and NACLs

Security groups are stateful firewalls attached to instances or ENIs (Elastic Network Interfaces). They are the primary tool for controlling traffic to your workloads.

# Security group for an application tier
resource "aws_security_group" "app" {
  name        = "app-tier"
  description = "Security group for application servers"
  vpc_id      = aws_vpc.main.id

  ingress {
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.load_balancer.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Network ACLs (NACLs) are stateless and operate at the subnet level. Use them as a second layer of defense, for example, blocking all traffic to your data subnets except from specific app subnets.

# NACL for data subnet - only allow traffic from app tier
resource "aws_network_acl" "data" {
  vpc_id = aws_vpc.main.id

  ingress {
    rule_number     = 100
    from_port        = 5432
    to_port          = 5432
    protocol         = "tcp"
    cidr_block       = "10.0.1.0/24"  # App tier subnet
    rule_action      = "allow"
  }

  ingress {
    rule_number  = 200
    cidr_block   = "0.0.0.0/0"
    rule_action  = "deny"
  }
}

The key difference: security groups are stateful (return traffic is automatically allowed), NACLs are stateless (you must explicitly allow return traffic).

Kubernetes NetworkPolicy Enforcement

In Kubernetes, pods can communicate freely by default. A compromised pod can reach any other pod in the cluster. NetworkPolicy is the Kubernetes-native way to restrict this.

# Default deny all ingress for a namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress

---
# Allow only from frontend to backend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-backend
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - protocol: TCP
          port: 8080

Not all Kubernetes networking plugins enforce NetworkPolicy. Calico, Cilium, and Weave Net do. If you are using EKS with Amazon VPC CNI, you need to add Calico or another plugin for policy enforcement.

Service Mesh mTLS

When you want encryption and authentication between every service, a service mesh with mutual TLS is the answer. Instead of managing certificates in your application code, the mesh handles it.

flowchart LR
    A[Service A] -->|mTLS| B[Linkerd Proxy]
    B -->|mTLS| C[Linkerd Proxy]
    C --> D[Service B]
    E[Service C] -->|mTLS| F[Linkerd Proxy]
    F -->|mTLS| B
    B --> G[Control Plane<br/>certificates, policies]
    D --> H[Certificate<br/>rotation]

With Linkerd, you enable mTLS with a single line in the control plane configuration:

Istio and Linkerd are the two main options. Linkerd is simpler and lower-overhead. Istio is more feature-rich but more complex.

With Linkerd, you enable mTLS with a single line in the control plane configuration:

# Enable mTLS in Linkerd
apiVersion: linkerd.io/v1alpha2
kind: GlobalMeshPolicy
metadata:
  name: global
spec:
  enableTLS: true

Your services do not change. Linkerd intercepts traffic at the proxy level, verifies certificates, and encrypts communication automatically.

Istio gives you more control but requires more configuration:

# PeerAuthentication policy for strict mTLS
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production
spec:
  mtls:
    mode: STRICT

Certificate Management with cert-manager

Whether you are using service mesh mTLS or just securing ingress, certificates need to be managed automatically. cert-manager automates certificate issuance and renewal.

# ClusterIssuer for Let's Encrypt
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: ops@example.com
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
      - http01:
          ingress:
            class: nginx

Once you have a ClusterIssuer, you can request certificates for any service:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: myapp-tls
  namespace: production
spec:
  secretName: myapp-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - myapp.example.com

cert-manager handles renewal automatically. Certificates from Let’s Encrypt expire every 90 days; cert-manager renews them at 30 days by default.

Zero-Trust Network Architecture

Zero-trust means never assuming that a request is safe just because it comes from inside your network. Every request should be authenticated and authorized, regardless of source.

The practical implications:

Identity-based access: Services should have identities (certificates, service accounts) that are verified, not just IP-based access.
Microsegmentation: Each service should only be able to reach the services it needs, nothing more.
Short-lived credentials: Service accounts should use short-lived tokens, not long-lived secrets.

The Secrets Management guide covers how to implement identity-based secrets distribution.

Production Failure Scenarios

Failure	Impact	Mitigation
Security group rule conflict causing intermittent connectivity	Services cannot reach each other, timeouts, partial failures	Always number your security group rules, document expected port ranges, test after any security group change
cert-manager failing to renew certificate causing production outage	HTTPS becomes unavailable, all TLS connections fail	Monitor cert-manager’s cert-ready condition, set up alerts 30 days before expiry, keep a backup certificate
Linkerd mTLS causing latency spikes	Service-to-service latency increases, timeouts	Profile your services with and without mTLS, use Linkerd’s tap command to identify slow connections, check proxy resource limits
NetworkPolicy misconfigured blocking all traffic to namespace	All pods in namespace become unreachable, total outage	Apply NetworkPolicy to one pod first, test connectivity before broad rollout, always have a recovery path
NACL overly restrictive blocking legitimate traffic	Database or API unreachable from app tier, cascading failures	Test NACL changes on a non-production subnet first, use descriptive rule numbers for easy identification

Network Security Observability

Certificate expiration causes complete outages. Set alerts at 60, 30, 7, and 1 day before expiry. If cert-manager reports a certificate as not-ready, investigate immediately — you may have a DNS validation failure or network connectivity issue.

Security group change frequency matters. Teams that modify security groups multiple times per day either have automation problems or unclear ownership. Frequent changes also make auditing harder.

For Linkerd and Istio, watch proxy CPU and memory on each pod. The sidecar adds overhead that catches teams off guard when they have tight resource limits and start getting evicted under load.

Key commands:

# Check cert-manager certificate status
kubectl get certificates -A -o wide

# Monitor certificate expiration
kubectl get certificates -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.notAfter}{"\n"}'

# List security group rules in AWS
aws ec2 describe-security-groups --region us-east-1 --query 'SecurityGroups[*].{Name:GroupName,Rules:IpPermissions}'

# Check Linkerd mTLS status
linkerd identity -n production

# Verify NetworkPolicy is applied
kubectl get networkpolicies -A -o wide

Common Anti-Patterns

Relying on the internal network being safe. Internal networks get breached. Compromised containers can reach any other container in the same VPC. Treat internal traffic as untrusted and use mTLS or at minimum application-layer authentication.

Using self-signed certificates in production. Self-signed certs work for development but break certificate transparency logs and make incident response harder. Use Let’s Encrypt via cert-manager or your cloud’s managed certificate service.

Not rotating certificates. Certificates expire, get compromised, or need to be replaced after incidents. Automate renewal with cert-manager and test the renewal process before expiry.

Over-trusting the Kubernetes network. Pods can reach any other pod by default. A single compromised workload becomes a pivot point to attack every other workload. Default-deny NetworkPolicy costs little and limits lateral movement.

Attaching overly permissive security group rules as a shortcut. Opening port 0.0.0.0/0 or allowing all traffic from 10.0.0.0/8 “because it’s the internal network” defeats the purpose of security groups. Restrict source CIDRs to the minimum required.

Trade-off Summary

Control Layer	Scope	Operational Complexity	Latency Impact
Security groups	Instance-level	Low	Minimal
NACLs	Subnet-level	Medium	Minimal
VPC endpoints	Service-level	Medium	Reduces (removes IGW)
PrivateLink / VPC peering	Cross-account	High	Minimal
VPN / Direct Connect	On-prem hybrid	High	Adds encryption overhead
Service mesh (mTLS)	Pod-to-pod	High	1-3% overhead
NetworkPolicy (K8s)	Pod-level	Medium	Minimal

Interview Questions

1. What is the difference between a security group and a NACL in AWS? When would you use both?

Expected answer points:

Security groups are stateful — return traffic is automatically allowed without explicit rules
NACLs are stateless — you must explicitly allow both inbound and outbound traffic
Security groups operate at the instance/ENI level; NACLs operate at the subnet level
Use both when you need subnet-wide deny rules (NACLs) plus instance-level source restrictions (security groups)
Common pattern: NACLs for known malicious IP blocking, security groups for workload access control

2. How do you design a VPC CIDR allocation plan for a production environment? What considerations affect your choice?

Expected answer points:

Start with /16 to avoid address exhaustion — /24 runs out fast in multi-tier architectures
Reserve space for future subnets, availability zones, and peering connections
Segment into at least three subnet types: public (load balancers, NAT), private (app workloads), data (databases, caches)
Align with availability zones for high availability
Avoid overlapping CIDRs if VPC peering or VPN is needed

3. Why is the default Kubernetes pod-to-pod network inherently insecure, and how does NetworkPolicy address it?

Expected answer points:

By default, all pods in a Kubernetes cluster can reach all other pods — no isolation enforced
A compromised pod can pivot laterally to any other workload in the cluster
NetworkPolicy is a Kubernetes resource that defines ingress/egress rules per namespace or pod selector
Requires a CNI plugin that supports policy enforcement (Calico, Cilium, Weave Net — not Amazon VPC CNI by default)
Best practice: apply a default-deny NetworkPolicy to all production namespaces first, then add allow rules

4. What is mutual TLS (mTLS) and why is it important in a service mesh architecture?

Expected answer points:

mTLS is a protocol where both the client and the server authenticate each other with X.509 certificates
Unlike regular TLS where only the server presents a certificate, mTLS verifies the client identity as well
Prevents unauthorized services from making or receiving connections within the mesh
In a service mesh, certificates are managed by the control plane (Linkerd or Istio), not the application code
Provides encryption (confidentiality) plus authentication (identity verification) for all service-to-service traffic

5. Linkerd vs. Istio — when would you choose one over the other?

Expected answer points:

Linkerd: choose when you want simple mTLS with minimal operational overhead, lower latency overhead (~1ms), and straightforward configuration
Istio: choose when you need fine-grained traffic control (mirroring, retries, fault injection), multi-cluster federation, or advanced observability
Istio has a steeper learning curve and higher resource consumption
Linkerd is CNCF graduated; Istio is CNCF graduated but more complex
If you only need mTLS and basic observability, Linkerd wins on simplicity

6. How does cert-manager work with Let's Encrypt? What happens during certificate renewal?

Expected answer points:

cert-manager creates a ClusterIssuer or Issuer resource that references Let's Encrypt as the ACME provider
When a Certificate resource is created, cert-manager initiates an ACME challenge (HTTP01 or DNS01) to prove domain ownership
Once the challenge is passed, Let's Encrypt issues a certificate stored as a Kubernetes Secret
Let's Encrypt certificates expire every 90 days; cert-manager renews them at 30 days by default
Renewal is automatic — cert-manager monitors expiry and re-initiates the ACME flow when renewal is due

7. What is a zero-trust network architecture and how does it change how you design network security?

Expected answer points:

Zero-trust means no request is trusted simply because it originates inside the network perimeter
Every request must be authenticated and authorized regardless of source IP or network location
Implications: identity-based access (certificates, service accounts) instead of IP-based trust, microsegmentation, short-lived credentials
Internal networks are treated as untrusted — the same rigor applied to public-facing services is applied to east-west traffic
Service mesh mTLS, Kubernetes NetworkPolicy, and secrets management are all building blocks of zero-trust

8. What are the risks of using self-signed certificates in a production environment?

Expected answer points:

Self-signed certificates break certificate transparency logs — browsers and security tools cannot validate them against known CA logs
Incident response is harder because there is no issuance audit trail
They introduce operational risk: no automated renewal, easy to forget about expiry
Make it harder to detect man-in-the-middle attacks in production
Use Let's Encrypt (free, automated) or cloud-managed certificates (ACM, Azure Key Vault, GCP Certificate Manager) instead

9. How do you monitor network security health in a Kubernetes cluster? What metrics and alerts are critical?

Expected answer points:

Certificate expiration alerts at 60, 30, 7, and 1 day before expiry — certificate outages are total outages
Security group rule change detection — unexpected changes can indicate misconfiguration or compromise
Service mesh proxy CPU/memory usage — sidecar overhead catches teams off guard under load
Monitor cert-manager's CertificateReady condition — failures indicate DNS validation or ACME challenges failing
Linkerd/Istio metrics: request success rates, mTLS handshake latencies, policy violations

10. What is the CNI plugin requirement for NetworkPolicy enforcement on Amazon EKS? Why does this trip people up?

Expected answer points:

Amazon VPC CNI does not enforce Kubernetes NetworkPolicy — it only assigns AWS ENI addresses to pods
You must install a policy-capable CNI plugin separately: Calico for Kubernetes or Cilium
Teams often assume NetworkPolicy "just works" in EKS and then are confused when default-deny rules have no effect
Calico can be installed as a DaemonSet with a ConfigMap; Cilium uses eBPF for enforcement
After installation, verify with `kubectl get networkpolicies -A` and test connectivity

11. Explain the trade-offs between VPC endpoints (private endpoints) and using a NAT gateway for outbound traffic.

Expected answer points:

NAT gateway: allows private subnets to make outbound internet calls (e.g., for updates), but traffic goes through the internet — encrypted but traverses public infrastructure
VPC endpoints: create private entry points to AWS services (S3, DynamoDB, Secrets Manager) — traffic never leaves the AWS network
VPC endpoints reduce NAT gateway costs for high-volume service calls and improve security posture
PrivateLink / VPC endpoint services are required for cross-account access to services
Trade-off: NAT gateway is simpler for internet-bound traffic; VPC endpoints are more secure for AWS service traffic

12. How does a security group rule allowing traffic from 10.0.0.0/8 differ from allowing traffic from a specific security group, and why does it matter?

Expected answer points:

CIDR-based rules (10.0.0.0/8) allow any resource within that range — overpermissive, no identity guarantee
Security group referencing another security group means only resources attached to that specific security group can connect
Security group references scale automatically: new instances attached to the source SG inherit its egress permissions without rule changes
Overpermissive CIDR rules are a common anti-pattern — they defeat the purpose of security group least-privilege access

13. What is the difference between ACME HTTP01 and DNS01 challenges in cert-manager? When would you use DNS01?

Expected answer points:

HTTP01: cert-manager creates a temporary HTTP resource at `http://domain.com/.well-known/acme-challenge/` — works for publicly reachable domains
DNS01: cert-manager creates a DNS TXT record `_acme-challenge.domain.com` — verified by the ACME provider querying DNS
Use DNS01 when the domain is not publicly reachable (internal services, private DNS zones) or when HTTP01 is not possible
DNS01 requires the cluster DNS provider to be accessible to the ACME provider, or use a DNS01 solver with cloud credentials
DNS01 supports wildcard certificates; HTTP01 does not

14. What are the main failure scenarios for cert-manager, and how do you detect them before they cause production outages?

Expected answer points:

DNS validation failure: ACME provider cannot reach the HTTP01 challenge or DNS01 TXT record — often caused by private DNS or propagation delays
Expired account key: the Let's Encrypt account needs to be kept active; if it expires, cert-manager cannot request certificates
Rate limiting: Let's Encrypt has rate limits (5 certificates per domain per week) — hitting limits blocks renewal
Detection: monitor `kubectl get certificates -A` for Ready=False; check cert-manager logs for Challenge/Certificate events
Set up alerts on CertificateReady condition and certificate not-ready events

15. Describe how Linkerd's mTLS works end-to-end, from connection initiation to certificate rotation.

Expected answer points:

Each service has a Linkerd proxy sidecar (Envoy-based) that intercepts outbound and inbound traffic
On outbound: proxy presents the workload's certificate to the destination's proxy
On inbound: proxy verifies the client's certificate before forwarding to the application pod
Certificates are issued and rotated by the Linkerd control plane — application code sees encrypted traffic but handles no certificates
Control plane uses a trust anchor (root certificate) to issue workload certificates — rotation happens automatically
Service accounts are mapped to workload identities in the mesh

16. What are the operational complexities involved in using NACLs alongside security groups? When is the complexity justified?

Expected answer points:

NACLs require explicit inbound AND outbound rule definitions (stateless) — twice the rule management surface
They operate at the subnet level, so they affect all resources in that subnet simultaneously
Complexity is justified in regulated environments (PCI-DSS, HIPAA) where subnet-wide deny rules are required
Also justified when you need to block known malicious IP ranges before traffic reaches any security group
For most workloads, well-structured security groups alone are sufficient

17. How would you design network security for a three-tier application (web, app, database) deployed in a multi-AZ VPC?

Expected answer points:

Public subnets in each AZ: load balancers, NAT gateways — receives traffic from the internet
Private subnets in each AZ: web tier then app tier — app tier receives only from web tier security group
Data subnets in each AZ: database tier — accepts connections only from app tier security group, no internet access
App tier security group allows port 5432 (Postgres) only from web/app tier security group, not from 0.0.0.0/0
NACLs add subnet-level deny rules for known malicious IPs on data subnets
Multi-AZ requires careful CIDR allocation per subnet per AZ — plan this in the initial VPC CIDR design

18. What is the blast radius of a compromised pod in a default Kubernetes cluster vs. one with default-deny NetworkPolicy?

Expected answer points:

Default (no NetworkPolicy): the compromised pod can reach every other pod in the cluster — it can exfiltrate data from databases, retrieve secrets from other pods, pivot to external services
Default-deny NetworkPolicy: the compromised pod can only reach pods it was explicitly allowed to reach — lateral movement is blocked
Applying default-deny is low operational overhead — one manifest per namespace — but it requires knowing your application's communication patterns up front
Even with default-deny, a compromised pod can still damage the pod it is allowed to communicate with — defense in depth is necessary

19. How does VPC peering differ from AWS PrivateLink, and when would you choose one over the other?

Expected answer points:

VPC peering: creates a direct network connection between two VPCs — traffic uses AWS backbone but appears as from the peer CIDR
PrivateLink: exposes a service as a private endpoint within your VPC — the service owner manages it, you do not need a peering connection
Choose VPC peering for simple, non-isolated cross-VPC communication within your organization
Choose PrivateLink for cross-account service exposure where the service consumer should not have full VPC-level access
PrivateLink is more secure for consuming third-party SaaS or AWS services because it does not expose your VPC CIDRs

20. What strategies would you use to reduce the operational overhead of managing security groups at scale across multiple environments and AWS accounts?

Expected answer points:

Security group naming conventions and tagging: tag by environment (prod/staging), team, owner for traceability
Security group hierarchy: define a "base" security group with common rules (monitoring, logging) and have workload SGs reference it
Infrastructure as Code (Terraform, Pulumi): treat security group rules as code — review changes via PR, automate application
Avoid rule explosion: use security group references over CIDR rules where possible for automatic scaling
Automated drift detection: compare actual SG rules against IaC definitions and alert on drift
Cross-account security group sharing via AWS Resource Access Manager (RAM) for shared services

Conclusion

Key Takeaways

VPC design sets the foundation: use /16 with subnet segmentation, not /24
Security groups are stateful and instance-level; NACLs are stateless and subnet-level
Kubernetes NetworkPolicy is not enforced by default CNI plugins on EKS — add Calico or Cilium
Service mesh mTLS offloads certificate management from application code
cert-manager automates certificate issuance and renewal across all Kubernetes workloads
Zero-trust means authenticating every request regardless of network origin

Network Security Checklist

# 1. VPC uses /16 with public/private/data subnet segmentation
# 2. Security groups restrict source to specific CIDRs or security groups
# 3. NACLs add subnet-level deny rules for known malicious IPs
# 4. Calico or Cilium installed for NetworkPolicy enforcement on EKS
# 5. Default-deny NetworkPolicy applied to all production namespaces
# 6. cert-manager ClusterIssuer created for Let's Encrypt
# 7. Certificates auto-renewed 30 days before expiry
# 8. Service mesh mTLS enabled for all production namespaces
# 9. Certificate expiration alerts configured at 60, 30, 7, 1 days

Introduction

When to Use

Istio vs. Linkerd vs. No Service Mesh

cert-manager vs. Cloud-Native Certificate Management

NACLs vs. Security Groups Alone

VPC Design and CIDR Allocation

Security Groups and NACLs

Kubernetes NetworkPolicy Enforcement

Service Mesh mTLS

Certificate Management with cert-manager

Zero-Trust Network Architecture

Production Failure Scenarios

Network Security Observability

Common Anti-Patterns

Trade-off Summary

Interview Questions

Further Reading

Conclusion

Key Takeaways

Network Security Checklist

Category

Tags

Related Posts

Container Security: Image Scanning and Vulnerability Management

Deployment Strategies: Rolling, Blue-Green, and Canary Releases

Developing Helm Charts: Templates, Values, and Testing