Table of Contents
- Misconfiguring Resource Limits and Requests
- Neglecting Liveness and Readiness Probes
- Using the “latest” Image Tag
- Insecure Secrets Management
- Ignoring Security Best Practices
- Overcomplicating with Unnecessary Abstractions
- Not Backing Up etcd
- Improper Namespace Usage
- Neglecting Network Policies
- Failing to Monitor Cluster and Application Health
- Conclusion
- References
1. Misconfiguring Resource Limits and Requests
What’s the Mistake?
One of the most frequent mistakes is either omitting resource requests/limits entirely or setting them incorrectly (e.g., setting limits too low, requests too high, or mismatching CPU/memory ratios).
Why It’s a Problem
- No requests/limits: Kubernetes has no way to prioritize workloads. Your app might get starved of resources (CPU throttling, OOM kills) if other pods consume too much, or it might hog resources, leaving none for others.
- Limits too low: Pods may crash unexpectedly due to resource exhaustion (e.g.,
OOMKillederrors). - Requests too high: Wasted cluster capacity—nodes may be underutilized because Kubernetes reserves requested resources, even if the pod doesn’t use them.
How to Avoid It
Understand Requests vs. Limits
- Requests: The minimum resources a pod needs. Kubernetes uses this to schedule pods on nodes with enough free resources.
- Limits: The maximum resources a pod can use. Exceeding CPU limits leads to throttling; exceeding memory limits leads to termination (
OOMKilled).
Best Practices:
- Set requests based on actual usage: Use tools like
kubectl top pod <pod-name>or monitoring tools (Prometheus) to measure typical resource consumption. - Set limits slightly higher than peak usage: Leave buffer for traffic spikes (e.g., 20-30% above peak).
- Avoid overcommitting memory: Unlike CPU, memory is not compressible—set limits carefully to prevent OOM kills.
- Use namespaced resource quotas: Enforce limits at the namespace level to prevent teams from hoarding resources (e.g.,
ResourceQuotaobjects).
Example Configuration:
apiVersion: v1
kind: Pod
metadata:
name: resource-demo
spec:
containers:
- name: app
image: my-app:v1
resources:
requests: # Minimum required
cpu: 100m # 0.1 CPU cores
memory: 128Mi # 128 megabytes
limits: # Maximum allowed
cpu: 500m # 0.5 CPU cores
memory: 256Mi # 256 megabytes
2. Neglecting Liveness and Readiness Probes
What’s the Mistake?
Skipping liveness and readiness probes, or configuring them incorrectly (e.g., using the same endpoint for both, setting overly aggressive timeouts).
Why It’s a Problem
- No liveness probe: Kubernetes can’t detect if a pod is “stuck” (e.g., deadlocks, unresponsive). The pod will keep running even if it’s non-functional.
- No readiness probe: Kubernetes will send traffic to a pod before it’s fully initialized (e.g., still loading data, connecting to a database), causing failed requests.
- Misconfigured probes: Overly strict probes may restart healthy pods (liveness) or delay traffic to ready pods (readiness).
How to Avoid It
Understand Probe Types
- Liveness Probe: Checks if the pod is “alive.” If it fails, Kubernetes restarts the pod.
- Readiness Probe: Checks if the pod is “ready” to receive traffic. If it fails, Kubernetes removes the pod from the service endpoint.
Best Practices:
- Use distinct endpoints for liveness and readiness (e.g.,
/health/livevs./health/ready). - Set appropriate delays: Use
initialDelaySecondsto let the app initialize (e.g., 30s for a Java app). - Choose the right probe type:
httpGet: For HTTP apps (e.g., check a/healthendpoint).tcpSocket: For TCP services (e.g., check if a port is open).exec: Run a command (e.g.,cat /tmp/healthy).
Example Configuration:
apiVersion: v1
kind: Pod
metadata:
name: probe-demo
spec:
containers:
- name: app
image: my-app:v1
ports:
- containerPort: 8080
livenessProbe: # Restart if unresponsive
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30 # Wait for app to start
periodSeconds: 10 # Check every 10s
failureThreshold: 3 # Restart after 3 failures
readinessProbe: # Stop traffic if not ready
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5 # Quick check for readiness
periodSeconds: 5
3. Using the “latest” Image Tag
What’s the Mistake?
Relying on the latest tag for container images (e.g., image: my-app:latest).
Why It’s a Problem
- Non-deterministic deployments: The
latesttag is a moving target. It may point to a new version between deployments, leading to unexpected changes (e.g., breaking updates, untested code). - Rollback headaches: If
latestupdates automatically, rolling back requires manually specifying an older tag (which you may not track). - Debugging complexity: You can’t trace which version of the image is running in production.
How to Avoid It
- Use specific, immutable tags: Tag images with version numbers (e.g.,
v1.2.3), commit hashes (e.g.,sha-abc123), or semantic versions (e.g.,1.2.xfor patch updates). - Pin to digests: Use image digests (e.g.,
my-app@sha256:abcd1234...) for absolute immutability—digests never change, even if the tag is updated. - Automate tagging: Use CI/CD pipelines (GitHub Actions, GitLab CI) to tag images with unique identifiers (e.g., Git commit hashes) on each build.
Example:
# Bad: Uses mutable "latest" tag
image: my-app:latest
# Good: Uses specific version tag
image: my-app:v1.2.3
# Best: Uses immutable digest
image: my-app@sha256:45b23dee08af5e43a7fea6c4cf9c25ccf269ee113168c19722f87876677c5cb2
4. Insecure Secrets Management
What’s the Mistake?
Treating Kubernetes Secrets as a silver bullet for security: storing secrets in plaintext, committing them to version control, or assuming Secrets are encrypted by default.
Why It’s a Problem
- Kubernetes Secrets are not encrypted by default: They are base64-encoded (not encrypted) and stored in etcd. Anyone with access to etcd or
kubectl get secretscan decode them. - Secrets in environment variables: Exposing secrets via
envmakes them visible in tools likekubectl describe podor in pod logs (if accidentally logged). - No rotation: Failing to rotate secrets (e.g., database passwords) increases the risk of long-term exposure if leaked.
How to Avoid It
Secure Secrets at Rest
- Enable etcd encryption: Use Kubernetes’ Static Encryption at Rest to encrypt secrets in etcd with a KMS key (e.g., AWS KMS, HashiCorp Vault).
Avoid Environment Variables
- Mount secrets as files instead of environment variables. This limits exposure:
volumes: - name: db-secrets secret: secretName: db-creds containers: - name: app volumeMounts: - name: db-secrets mountPath: /secrets readOnly: true # Prevent modification
Use External Secret Managers
For production, avoid relying solely on Kubernetes Secrets. Use tools like:
- HashiCorp Vault: Manages secrets with dynamic rotation, access control, and auditing.
- AWS Secrets Manager / GCP Secret Manager: Cloud-native secret storage with auto-rotation.
- Sealed Secrets: Encrypt secrets so they can be safely committed to Git (decrypted only in the cluster).
Restrict Access with RBAC
Use Role-Based Access Control (RBAC) to limit who can view secrets:
# Example: Allow "dev-team" to view secrets in "dev" namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: dev
name: secret-reader
rules:
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: dev-secret-reader
namespace: dev
subjects:
- kind: Group
name: dev-team
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: secret-reader
apiGroup: rbac.authorization.k8s.io
5. Ignoring Security Best Practices
What’s the Mistake?
Overlooking foundational security practices: running containers as root, skipping security contexts, missing network policies, or using unpatched images.
Why It’s a Problem
- Root privileges: A compromised container with root access can escape to the host node.
- Unrestricted network access: Without network policies, any pod can communicate with any other pod, increasing the attack surface.
- Outdated images: Vulnerabilities in base images (e.g.,
alpine:3.7with known CVEs) are a common entry point for attacks.
How to Avoid It
Run Containers as Non-Root
Use securityContext to enforce non-root users and restrict privileges:
securityContext:
runAsUser: 1000 # Non-root UID
runAsGroup: 3000 # Non-root GID
runAsNonRoot: true # Block root execution
readOnlyRootFilesystem: true # Make root filesystem read-only
allowPrivilegeEscalation: false # Prevent privilege escalation
Enforce Pod Security Standards
Replace deprecated PodSecurityPolicy with Pod Security Standards (PSS):
- Privileged: Unrestricted (avoid for production).
- Baseline: Prevents known privilege escalations (e.g., no root).
- Restricted: Strictest—enforces non-root, read-only filesystems, etc.
Limit Network Traffic with Network Policies
Default to deny-all traffic and explicitly allow only required communication:
# Deny all ingress/egress in the "default" namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: default
spec:
podSelector: {} # Applies to all pods in the namespace
policyTypes:
- Ingress
- Egress
Then allow specific traffic (e.g., frontend → backend):
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-to-backend
namespace: default
spec:
podSelector:
matchLabels:
app: backend # Target backend pods
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend # Allow traffic from frontend pods
ports:
- protocol: TCP
port: 8080 # Only allow port 8080
Scan Images for Vulnerabilities
Integrate image scanning into CI/CD pipelines (e.g., Trivy, Clair, or AWS ECR image scanning) to block images with critical CVEs.
6. Overcomplicating with Unnecessary Abstractions
What’s the Mistake?
Jumping to complex tools (Helm charts, Operators, Kustomize) before mastering core Kubernetes resources (Deployments, Services, ConfigMaps).
Why It’s a Problem
- Increased cognitive load: Teams waste time learning abstractions instead of focusing on application logic.
- Hidden complexity: Abstractions can obscure underlying Kubernetes objects, making debugging harder (e.g., “Why is my Helm release failing?”).
- Overhead: Maintaining custom Operators or Helm charts adds operational burden for simple apps (e.g., a static website doesn’t need an Operator).
How to Avoid It
Start with Core Resources
Use native Kubernetes resources for simple workloads:
- Deployments: For stateless apps (web servers, APIs).
- StatefulSets: For stateful apps (databases, message queues).
- ConfigMaps/Secrets: For configuration (avoid custom config operators unless needed).
Adopt Abstractions Only When Justified
Use tools like Helm or Kustomize when they solve a specific problem:
- Helm: For packaging apps with dependencies (e.g., a LAMP stack with multiple services).
- Kustomize: For managing environment-specific configs (dev/staging/prod) without duplicating YAML.
- Operators: For complex stateful apps (e.g., PostgreSQL with backups, scaling, and upgrades).
7. Not Backing Up etcd
What’s the Mistake?
etcd is Kubernetes’ database, storing all cluster state (pods, secrets, deployments). Failing to back up etcd leaves you vulnerable to data loss from cluster corruption, accidental deletions, or node failures.
Why It’s a Problem
- No recovery path: Without backups, a catastrophic etcd failure (e.g., disk corruption, ransomware) means rebuilding the cluster from scratch.
- Data loss: Accidental deletion of critical resources (e.g.,
kubectl delete namespace prod) can’t be undone without a backup.
How to Avoid It
Automate etcd Backups
Use etcdctl (the etcd CLI) to create snapshots. For managed clusters (EKS, GKE, AKS), use cloud provider tools (e.g., AWS EKS snapshots, GKE cluster backups).
Example with etcdctl:
# Take a snapshot (replace endpoints and certs with your cluster details)
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /backups/etcd-snapshot-$(date +%Y%m%d).db
Test Restores Regularly
Backups are useless if they can’t be restored. Periodically test restoring a snapshot to a staging cluster to validate data integrity.
Store Backups Securely
- Off-cluster storage: Store backups in a separate location (e.g., S3, GCS) to avoid losing them if the cluster fails.
- Encrypt backups: Encrypt snapshots at rest (e.g., AWS S3 SSE, GCS customer-managed keys) to protect secrets stored in etcd.
8. Improper Namespace Usage
What’s the Mistake?
Dumping all workloads into the default namespace, leading to clutter, poor isolation, and difficulty managing resource quotas or RBAC.
Why It’s a Problem
- No isolation: A misconfiguration in one app (e.g., a Deployment with
replicas: 1000) can starve resources for others in the same namespace. - Hard to debug:
kubectl get podsreturns hundreds of pods, making it hard to find what’s relevant.
= No granular access control: RBAC and resource quotas are namespace-scoped—without namespaces, you can’t restrict teams to specific resources.
How to Avoid It
Organize Namespaces by Environment or Team
Use namespaces to isolate workloads:
dev,staging,prod: Separate environments to prevent cross-environment interference.team-alpha,team-bravo: Isolate teams to enforce resource quotas and RBAC.
Enforce Resource Quotas
Prevent resource hoarding with ResourceQuota per namespace:
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-alpha-quota
namespace: team-alpha
spec:
hard:
pods: "10" # Max 10 pods
requests.cpu: "4" # Total CPU requests: 4 cores
requests.memory: "8Gi" # Total memory requests: 8Gi
limits.cpu: "8" # Total CPU limits: 8 cores
limits.memory: "16Gi" # Total memory limits: 16Gi
9. Neglecting Network Policies
What’s the Mistake?
Assuming Kubernetes’ default “allow-all” network policy is secure, leaving pods exposed to unnecessary traffic.
Why It’s a Problem
- Lateral movement: An attacker who compromises one pod can freely access other pods (e.g., databases, internal APIs) in the cluster.
- No defense in depth: Without network policies, your only security boundary is the cluster perimeter (e.g., ingress controllers).
How to Avoid It
Default to Deny
Start with a deny-all policy in every namespace to block all traffic by default:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
namespace: default
spec:
podSelector: {} # Applies to all pods in the namespace
policyTypes:
- Ingress # Block incoming traffic
- Egress # Block outgoing traffic
Allow Only Required Traffic
Explicitly allow traffic for specific use cases (e.g., frontend → backend, backend → database):
# Allow frontend pods to access backend pods on port 8080
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-to-backend
namespace: default
spec:
podSelector:
matchLabels:
app: backend # Target backend pods
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend # Source: frontend pods
ports:
- protocol: TCP
port: 8080 # Only allow port 8080
10. Failing to Monitor Cluster and Application Health
What’s the Mistake?
Operating Kubernetes “blind”—not tracking metrics, logs, or events—leaving you unable to detect issues before they cause outages.
Why It’s a Problem
- Undetected failures: A pod may be crashing repeatedly, but without alerts, you won’t know until users report downtime.
- Resource leaks: Unbounded memory growth in an app may go unnoticed until nodes run out of memory.
- Debugging nightmares: Without logs or metrics, troubleshooting issues (e.g., “Why is the app slow?”) becomes guesswork.
How to Avoid It
Monitor Cluster Metrics with Prometheus + Grafana
- Prometheus: Collects metrics (CPU, memory, pod status, API server latency) from Kubernetes components and apps.
- Grafana: Visualizes metrics with dashboards (e.g., node resource usage, pod restart counts).
Key metrics to track:
- Node: CPU/memory usage, disk I/O, network throughput.
- Pod: Restart count, CPU/memory requests vs. limits, latency.
- Control plane: API server response time, etcd health, scheduler latency.
Centralize Logs
Use tools like the ELK Stack (Elasticsearch, Logstash, Kibana) or Loki to aggregate and query pod logs:
- Ship logs to a central system (e.g., Fluentd as a daemonset).
- Set up log retention policies to avoid storage bloat.
Alert on Critical Issues
Use Prometheus Alertmanager or cloud-native tools (e.g., AWS CloudWatch Alarms) to trigger alerts for:
- High CPU/memory usage (e.g., >90% of limits).
- Pod crashes/restarts (e.g., >5 restarts in 5 minutes).
- Node failures (e.g., a node being unavailable for 5 minutes).
Conclusion
Kubernetes is a powerful tool, but its complexity makes it easy to fall into common traps. By avoiding these 10 mistakes—from misconfiguring resources to neglecting backups—you’ll build clusters that are more stable, secure, and maintainable.
Remember: Kubernetes is a journey. Start small, iterate, and always prioritize fundamentals like resource management, security, and monitoring. With these practices in place, you’ll unlock Kubernetes’ full potential while minimizing operational headaches.