coderain guide

10 Common Kubernetes Mistakes and How to Avoid Them

Kubernetes (K8s) has become the de facto standard for container orchestration, empowering teams to deploy, scale, and manage containerized applications with unprecedented flexibility. However, its power comes with complexity: Kubernetes has a steep learning curve, and even experienced users often stumble into common pitfalls. These mistakes can lead to unstable deployments, security vulnerabilities, performance bottlenecks, or operational headaches. In this blog, we’ll explore **10 of the most common Kubernetes mistakes** and provide actionable solutions to avoid them. Whether you’re a beginner setting up your first cluster or a seasoned engineer optimizing production workloads, this guide will help you build more resilient, secure, and efficient Kubernetes environments.

Table of Contents

  1. Misconfiguring Resource Limits and Requests
  2. Neglecting Liveness and Readiness Probes
  3. Using the “latest” Image Tag
  4. Insecure Secrets Management
  5. Ignoring Security Best Practices
  6. Overcomplicating with Unnecessary Abstractions
  7. Not Backing Up etcd
  8. Improper Namespace Usage
  9. Neglecting Network Policies
  10. Failing to Monitor Cluster and Application Health
  11. Conclusion
  12. References

1. Misconfiguring Resource Limits and Requests

What’s the Mistake?

One of the most frequent mistakes is either omitting resource requests/limits entirely or setting them incorrectly (e.g., setting limits too low, requests too high, or mismatching CPU/memory ratios).

Why It’s a Problem

  • No requests/limits: Kubernetes has no way to prioritize workloads. Your app might get starved of resources (CPU throttling, OOM kills) if other pods consume too much, or it might hog resources, leaving none for others.
  • Limits too low: Pods may crash unexpectedly due to resource exhaustion (e.g., OOMKilled errors).
  • Requests too high: Wasted cluster capacity—nodes may be underutilized because Kubernetes reserves requested resources, even if the pod doesn’t use them.

How to Avoid It

Understand Requests vs. Limits

  • Requests: The minimum resources a pod needs. Kubernetes uses this to schedule pods on nodes with enough free resources.
  • Limits: The maximum resources a pod can use. Exceeding CPU limits leads to throttling; exceeding memory limits leads to termination (OOMKilled).

Best Practices:

  • Set requests based on actual usage: Use tools like kubectl top pod <pod-name> or monitoring tools (Prometheus) to measure typical resource consumption.
  • Set limits slightly higher than peak usage: Leave buffer for traffic spikes (e.g., 20-30% above peak).
  • Avoid overcommitting memory: Unlike CPU, memory is not compressible—set limits carefully to prevent OOM kills.
  • Use namespaced resource quotas: Enforce limits at the namespace level to prevent teams from hoarding resources (e.g., ResourceQuota objects).

Example Configuration:

apiVersion: v1
kind: Pod
metadata:
  name: resource-demo
spec:
  containers:
  - name: app
    image: my-app:v1
    resources:
      requests:  # Minimum required
        cpu: 100m  # 0.1 CPU cores
        memory: 128Mi  # 128 megabytes
      limits:  # Maximum allowed
        cpu: 500m  # 0.5 CPU cores
        memory: 256Mi  # 256 megabytes

2. Neglecting Liveness and Readiness Probes

What’s the Mistake?

Skipping liveness and readiness probes, or configuring them incorrectly (e.g., using the same endpoint for both, setting overly aggressive timeouts).

Why It’s a Problem

  • No liveness probe: Kubernetes can’t detect if a pod is “stuck” (e.g., deadlocks, unresponsive). The pod will keep running even if it’s non-functional.
  • No readiness probe: Kubernetes will send traffic to a pod before it’s fully initialized (e.g., still loading data, connecting to a database), causing failed requests.
  • Misconfigured probes: Overly strict probes may restart healthy pods (liveness) or delay traffic to ready pods (readiness).

How to Avoid It

Understand Probe Types

  • Liveness Probe: Checks if the pod is “alive.” If it fails, Kubernetes restarts the pod.
  • Readiness Probe: Checks if the pod is “ready” to receive traffic. If it fails, Kubernetes removes the pod from the service endpoint.

Best Practices:

  • Use distinct endpoints for liveness and readiness (e.g., /health/live vs. /health/ready).
  • Set appropriate delays: Use initialDelaySeconds to let the app initialize (e.g., 30s for a Java app).
  • Choose the right probe type:
    • httpGet: For HTTP apps (e.g., check a /health endpoint).
    • tcpSocket: For TCP services (e.g., check if a port is open).
    • exec: Run a command (e.g., cat /tmp/healthy).

Example Configuration:

apiVersion: v1
kind: Pod
metadata:
  name: probe-demo
spec:
  containers:
  - name: app
    image: my-app:v1
    ports:
    - containerPort: 8080
    livenessProbe:  # Restart if unresponsive
      httpGet:
        path: /health/live
        port: 8080
      initialDelaySeconds: 30  # Wait for app to start
      periodSeconds: 10  # Check every 10s
      failureThreshold: 3  # Restart after 3 failures
    readinessProbe:  # Stop traffic if not ready
      httpGet:
        path: /health/ready
        port: 8080
      initialDelaySeconds: 5  # Quick check for readiness
      periodSeconds: 5

3. Using the “latest” Image Tag

What’s the Mistake?

Relying on the latest tag for container images (e.g., image: my-app:latest).

Why It’s a Problem

  • Non-deterministic deployments: The latest tag is a moving target. It may point to a new version between deployments, leading to unexpected changes (e.g., breaking updates, untested code).
  • Rollback headaches: If latest updates automatically, rolling back requires manually specifying an older tag (which you may not track).
  • Debugging complexity: You can’t trace which version of the image is running in production.

How to Avoid It

  • Use specific, immutable tags: Tag images with version numbers (e.g., v1.2.3), commit hashes (e.g., sha-abc123), or semantic versions (e.g., 1.2.x for patch updates).
  • Pin to digests: Use image digests (e.g., my-app@sha256:abcd1234...) for absolute immutability—digests never change, even if the tag is updated.
  • Automate tagging: Use CI/CD pipelines (GitHub Actions, GitLab CI) to tag images with unique identifiers (e.g., Git commit hashes) on each build.

Example:

# Bad: Uses mutable "latest" tag
image: my-app:latest

# Good: Uses specific version tag
image: my-app:v1.2.3

# Best: Uses immutable digest
image: my-app@sha256:45b23dee08af5e43a7fea6c4cf9c25ccf269ee113168c19722f87876677c5cb2

4. Insecure Secrets Management

What’s the Mistake?

Treating Kubernetes Secrets as a silver bullet for security: storing secrets in plaintext, committing them to version control, or assuming Secrets are encrypted by default.

Why It’s a Problem

  • Kubernetes Secrets are not encrypted by default: They are base64-encoded (not encrypted) and stored in etcd. Anyone with access to etcd or kubectl get secrets can decode them.
  • Secrets in environment variables: Exposing secrets via env makes them visible in tools like kubectl describe pod or in pod logs (if accidentally logged).
  • No rotation: Failing to rotate secrets (e.g., database passwords) increases the risk of long-term exposure if leaked.

How to Avoid It

Secure Secrets at Rest

  • Enable etcd encryption: Use Kubernetes’ Static Encryption at Rest to encrypt secrets in etcd with a KMS key (e.g., AWS KMS, HashiCorp Vault).

Avoid Environment Variables

  • Mount secrets as files instead of environment variables. This limits exposure:
    volumes:
    - name: db-secrets
      secret:
        secretName: db-creds
    containers:
    - name: app
      volumeMounts:
      - name: db-secrets
        mountPath: /secrets
        readOnly: true  # Prevent modification

Use External Secret Managers

For production, avoid relying solely on Kubernetes Secrets. Use tools like:

  • HashiCorp Vault: Manages secrets with dynamic rotation, access control, and auditing.
  • AWS Secrets Manager / GCP Secret Manager: Cloud-native secret storage with auto-rotation.
  • Sealed Secrets: Encrypt secrets so they can be safely committed to Git (decrypted only in the cluster).

Restrict Access with RBAC

Use Role-Based Access Control (RBAC) to limit who can view secrets:

# Example: Allow "dev-team" to view secrets in "dev" namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: dev
  name: secret-reader
rules:
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: dev-secret-reader
  namespace: dev
subjects:
- kind: Group
  name: dev-team
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: secret-reader
  apiGroup: rbac.authorization.k8s.io

5. Ignoring Security Best Practices

What’s the Mistake?

Overlooking foundational security practices: running containers as root, skipping security contexts, missing network policies, or using unpatched images.

Why It’s a Problem

  • Root privileges: A compromised container with root access can escape to the host node.
  • Unrestricted network access: Without network policies, any pod can communicate with any other pod, increasing the attack surface.
  • Outdated images: Vulnerabilities in base images (e.g., alpine:3.7 with known CVEs) are a common entry point for attacks.

How to Avoid It

Run Containers as Non-Root

Use securityContext to enforce non-root users and restrict privileges:

securityContext:
  runAsUser: 1000          # Non-root UID
  runAsGroup: 3000         # Non-root GID
  runAsNonRoot: true       # Block root execution
  readOnlyRootFilesystem: true  # Make root filesystem read-only
  allowPrivilegeEscalation: false  # Prevent privilege escalation

Enforce Pod Security Standards

Replace deprecated PodSecurityPolicy with Pod Security Standards (PSS):

  • Privileged: Unrestricted (avoid for production).
  • Baseline: Prevents known privilege escalations (e.g., no root).
  • Restricted: Strictest—enforces non-root, read-only filesystems, etc.

Limit Network Traffic with Network Policies

Default to deny-all traffic and explicitly allow only required communication:

# Deny all ingress/egress in the "default" namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: default
spec:
  podSelector: {}  # Applies to all pods in the namespace
  policyTypes:
  - Ingress
  - Egress

Then allow specific traffic (e.g., frontend → backend):

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-backend
  namespace: default
spec:
  podSelector:
    matchLabels:
      app: backend  # Target backend pods
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend  # Allow traffic from frontend pods
    ports:
    - protocol: TCP
      port: 8080  # Only allow port 8080

Scan Images for Vulnerabilities

Integrate image scanning into CI/CD pipelines (e.g., Trivy, Clair, or AWS ECR image scanning) to block images with critical CVEs.

6. Overcomplicating with Unnecessary Abstractions

What’s the Mistake?

Jumping to complex tools (Helm charts, Operators, Kustomize) before mastering core Kubernetes resources (Deployments, Services, ConfigMaps).

Why It’s a Problem

  • Increased cognitive load: Teams waste time learning abstractions instead of focusing on application logic.
  • Hidden complexity: Abstractions can obscure underlying Kubernetes objects, making debugging harder (e.g., “Why is my Helm release failing?”).
  • Overhead: Maintaining custom Operators or Helm charts adds operational burden for simple apps (e.g., a static website doesn’t need an Operator).

How to Avoid It

Start with Core Resources

Use native Kubernetes resources for simple workloads:

  • Deployments: For stateless apps (web servers, APIs).
  • StatefulSets: For stateful apps (databases, message queues).
  • ConfigMaps/Secrets: For configuration (avoid custom config operators unless needed).

Adopt Abstractions Only When Justified

Use tools like Helm or Kustomize when they solve a specific problem:

  • Helm: For packaging apps with dependencies (e.g., a LAMP stack with multiple services).
  • Kustomize: For managing environment-specific configs (dev/staging/prod) without duplicating YAML.
  • Operators: For complex stateful apps (e.g., PostgreSQL with backups, scaling, and upgrades).

7. Not Backing Up etcd

What’s the Mistake?

etcd is Kubernetes’ database, storing all cluster state (pods, secrets, deployments). Failing to back up etcd leaves you vulnerable to data loss from cluster corruption, accidental deletions, or node failures.

Why It’s a Problem

  • No recovery path: Without backups, a catastrophic etcd failure (e.g., disk corruption, ransomware) means rebuilding the cluster from scratch.
  • Data loss: Accidental deletion of critical resources (e.g., kubectl delete namespace prod) can’t be undone without a backup.

How to Avoid It

Automate etcd Backups

Use etcdctl (the etcd CLI) to create snapshots. For managed clusters (EKS, GKE, AKS), use cloud provider tools (e.g., AWS EKS snapshots, GKE cluster backups).

Example with etcdctl:

# Take a snapshot (replace endpoints and certs with your cluster details)
etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save /backups/etcd-snapshot-$(date +%Y%m%d).db

Test Restores Regularly

Backups are useless if they can’t be restored. Periodically test restoring a snapshot to a staging cluster to validate data integrity.

Store Backups Securely

  • Off-cluster storage: Store backups in a separate location (e.g., S3, GCS) to avoid losing them if the cluster fails.
  • Encrypt backups: Encrypt snapshots at rest (e.g., AWS S3 SSE, GCS customer-managed keys) to protect secrets stored in etcd.

8. Improper Namespace Usage

What’s the Mistake?

Dumping all workloads into the default namespace, leading to clutter, poor isolation, and difficulty managing resource quotas or RBAC.

Why It’s a Problem

  • No isolation: A misconfiguration in one app (e.g., a Deployment with replicas: 1000) can starve resources for others in the same namespace.
  • Hard to debug: kubectl get pods returns hundreds of pods, making it hard to find what’s relevant.
    = No granular access control: RBAC and resource quotas are namespace-scoped—without namespaces, you can’t restrict teams to specific resources.

How to Avoid It

Organize Namespaces by Environment or Team

Use namespaces to isolate workloads:

  • dev, staging, prod: Separate environments to prevent cross-environment interference.
  • team-alpha, team-bravo: Isolate teams to enforce resource quotas and RBAC.

Enforce Resource Quotas

Prevent resource hoarding with ResourceQuota per namespace:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-alpha-quota
  namespace: team-alpha
spec:
  hard:
    pods: "10"  # Max 10 pods
    requests.cpu: "4"  # Total CPU requests: 4 cores
    requests.memory: "8Gi"  # Total memory requests: 8Gi
    limits.cpu: "8"  # Total CPU limits: 8 cores
    limits.memory: "16Gi"  # Total memory limits: 16Gi

9. Neglecting Network Policies

What’s the Mistake?

Assuming Kubernetes’ default “allow-all” network policy is secure, leaving pods exposed to unnecessary traffic.

Why It’s a Problem

  • Lateral movement: An attacker who compromises one pod can freely access other pods (e.g., databases, internal APIs) in the cluster.
  • No defense in depth: Without network policies, your only security boundary is the cluster perimeter (e.g., ingress controllers).

How to Avoid It

Default to Deny

Start with a deny-all policy in every namespace to block all traffic by default:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: default
spec:
  podSelector: {}  # Applies to all pods in the namespace
  policyTypes:
  - Ingress  # Block incoming traffic
  - Egress   # Block outgoing traffic

Allow Only Required Traffic

Explicitly allow traffic for specific use cases (e.g., frontend → backend, backend → database):

# Allow frontend pods to access backend pods on port 8080
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-backend
  namespace: default
spec:
  podSelector:
    matchLabels:
      app: backend  # Target backend pods
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend  # Source: frontend pods
    ports:
    - protocol: TCP
      port: 8080  # Only allow port 8080

10. Failing to Monitor Cluster and Application Health

What’s the Mistake?

Operating Kubernetes “blind”—not tracking metrics, logs, or events—leaving you unable to detect issues before they cause outages.

Why It’s a Problem

  • Undetected failures: A pod may be crashing repeatedly, but without alerts, you won’t know until users report downtime.
  • Resource leaks: Unbounded memory growth in an app may go unnoticed until nodes run out of memory.
  • Debugging nightmares: Without logs or metrics, troubleshooting issues (e.g., “Why is the app slow?”) becomes guesswork.

How to Avoid It

Monitor Cluster Metrics with Prometheus + Grafana

  • Prometheus: Collects metrics (CPU, memory, pod status, API server latency) from Kubernetes components and apps.
  • Grafana: Visualizes metrics with dashboards (e.g., node resource usage, pod restart counts).

Key metrics to track:

  • Node: CPU/memory usage, disk I/O, network throughput.
  • Pod: Restart count, CPU/memory requests vs. limits, latency.
  • Control plane: API server response time, etcd health, scheduler latency.

Centralize Logs

Use tools like the ELK Stack (Elasticsearch, Logstash, Kibana) or Loki to aggregate and query pod logs:

  • Ship logs to a central system (e.g., Fluentd as a daemonset).
  • Set up log retention policies to avoid storage bloat.

Alert on Critical Issues

Use Prometheus Alertmanager or cloud-native tools (e.g., AWS CloudWatch Alarms) to trigger alerts for:

  • High CPU/memory usage (e.g., >90% of limits).
  • Pod crashes/restarts (e.g., >5 restarts in 5 minutes).
  • Node failures (e.g., a node being unavailable for 5 minutes).

Conclusion

Kubernetes is a powerful tool, but its complexity makes it easy to fall into common traps. By avoiding these 10 mistakes—from misconfiguring resources to neglecting backups—you’ll build clusters that are more stable, secure, and maintainable.

Remember: Kubernetes is a journey. Start small, iterate, and always prioritize fundamentals like resource management, security, and monitoring. With these practices in place, you’ll unlock Kubernetes’ full potential while minimizing operational headaches.

References