INFRASTRUCTURE

Kubernetes & DevOps: Production-Grade Infrastructure

From Docker containers to Kubernetes clusters on GKE and EKS. Helm charts, Istio service mesh, CI/CD pipelines, SRE practices, and cost optimization -- all battle-tested on 26 microservice platforms.

By Jose Nobile | Updated 2026-04-23 | 13 min read

Docker: Container Foundations

Containerization is the foundation of modern infrastructure. Docker packages your application, its dependencies, and runtime environment into a portable, reproducible unit. Getting containers right at the foundation level prevents cascading problems in Kubernetes, CI/CD, and production.

Key practices for production Docker images:

  • Multi-stage builds -- Separate build and runtime stages. Build with full SDK, run with minimal runtime. Reduces image size by 60-80%.
  • Non-root users -- Never run containers as root. Create a dedicated user in the Dockerfile for security compliance.
  • Layer caching -- Order Dockerfile instructions from least to most frequently changed. Copy package.json before source code to cache dependency installation.
  • Health checks -- Define HEALTHCHECK instructions for proper container orchestration and load balancer integration.
  • Image scanning -- Integrate Trivy or Snyk in your CI pipeline to catch vulnerabilities before deployment.

Kubernetes on GKE & EKS

Managed Kubernetes services (GKE on Google Cloud, EKS on AWS) eliminate the complexity of cluster management while providing enterprise-grade reliability. The choice between GKE and EKS depends on your existing cloud investment and specific requirements.

GKE

Google Kubernetes Engine

Fastest cluster provisioning, Autopilot mode for hands-off management, built-in GKE Ingress with Google Cloud Load Balancer, tight integration with Cloud Build and Artifact Registry. Best cost-performance ratio for most workloads.

EKS

Amazon Elastic Kubernetes

Deep AWS integration (IAM roles for service accounts, EBS/EFS storage, ALB Ingress), Fargate for serverless pods, extensive marketplace of add-ons. Preferred when your data and services are already on AWS.

BEST PRACTICE

Cluster Design

Separate namespaces per environment (dev, staging, prod). Use node pools for workload isolation (CPU-intensive vs. memory-intensive). Enable cluster autoscaler with appropriate min/max bounds.

Helm Chart Management

Helm packages Kubernetes manifests into reusable, versioned charts. For platforms with 26 microservices, Helm is essential -- it standardizes deployment configurations, enables environment-specific overrides via values files, and provides atomic rollback capabilities.

A well-structured Helm chart strategy for microservice platforms:

  • Base chart -- A generic chart that handles 90% of your services (deployment, service, ingress, HPA, PDB). Services use values.yaml overrides for customization.
  • Environment values -- Separate values files per environment: values-dev.yaml, values-staging.yaml, values-prod.yaml. Keep secrets in a separate encrypted store.
  • Chart versioning -- Semantic versioning for charts. Breaking changes bump the major version. CI validates chart templates before merge.
  • Helmfile -- Declaratively manage multiple releases across environments. One command to sync all services in an environment.

Istio Service Mesh

Istio adds a transparent infrastructure layer that handles service-to-service communication, security, and observability without modifying application code. For microservice platforms, Istio provides capabilities that would otherwise require custom implementation in every service.

Traffic Management

Canary deployments with percentage-based traffic splitting, A/B testing via header-based routing, circuit breaking for failing services, automatic retries with configurable backoff, and request timeouts.

Mutual TLS

Automatic mTLS between all services without code changes. Every service-to-service call is encrypted and authenticated. Certificate rotation happens transparently.

Observability

Distributed tracing (Jaeger/Zipkin), service-level metrics (latency, error rates, throughput), service dependency visualization (Kiali), and access logging for every request.

CI/CD Pipelines

Continuous Integration and Continuous Deployment pipelines automate the path from code commit to production deployment. A well-designed pipeline ensures every change is tested, validated, and deployed consistently -- eliminating human error from the release process.

GitOps and Platform Engineering (2026): GitOps adoption has crossed a critical threshold, with over 64% of enterprises reporting it as their primary delivery mechanism. Argo CD 3.3 (February 2026) closes several long-standing operational gaps including OIDC background token refresh, configurable Kubernetes API timeouts, and improved ApplicationSet UI. Flux continues to differentiate with its composable GitOps Toolkit that monitors image registries and auto-updates manifests. The broader trend is platform engineering standardization -- internal developer platforms (IDPs) now embed Argo CD or Flux as day-one tools, providing golden paths that enforce consistency and traceability while abstracting Kubernetes complexity from application developers.

Kubernetes 1.36 "Haru" (released April 22, 2026) ships with 70 tracked enhancements, including DRA reaching GA for GPU scheduling, HPAScaleToZero enabled by default, User Namespaces GA for rootless containers, OCI VolumeSource GA, and MutatingAdmissionPolicy graduating to stable. The Ingress NGINX Controller was officially retired on March 24, 2026 -- all teams should migrate to Gateway API. On the tooling side, Helm 4.1.4 (the current stable line) delivers Server-Side Apply by default and kstatus-based readiness checks, making helm install --wait reliable in CI/CD pipelines for the first time. Helm 4.2.0 is due May 13, 2026.

CI/CD

GitLab CI/CD

Jose's primary CI/CD platform. Integrated with GitLab repositories, Docker registry, and Kubernetes clusters. Features: multi-project pipelines, parent-child pipelines, dynamic environments, and auto DevOps.

CI/CD

GitHub Actions

Used for open-source and smaller projects. Matrix builds for cross-platform testing, reusable workflows for shared CI logic, GitHub Container Registry for images, and OIDC for secure cloud deployments.

CI/CD

Pipeline Architecture

Standard stages: lint, test, build, scan, deploy-staging, integration-test, deploy-prod. Parallel execution where possible. Each stage has clear success/failure criteria and rollback triggers.

The CI/CD pipeline handles 26 microservices and 41 branded Android app variants. A single push to the main branch triggers parallel builds across all services, runs integration tests, and deploys to staging. Production deployment is a one-click promotion after QA approval.

SRE Practices

Site Reliability Engineering bridges development and operations, using software engineering practices to solve infrastructure problems. The core principles that make the difference in production:

SLOs & Error Budgets

Define Service Level Objectives for latency (p99 < 200ms), availability (99.9%), and error rate (< 0.1%). Error budgets determine when to freeze features and focus on reliability.

Incident Response

Documented runbooks for common failures, on-call rotation with escalation paths, post-incident reviews (blameless), and automated remediation for known failure patterns.

Capacity Planning

Load testing before major releases, autoscaling policies based on real traffic patterns, resource quotas per namespace, and regular right-sizing reviews for cost optimization.

Change Management

Progressive rollouts (canary then full), feature flags for decoupling deploy from release, automated rollback on error rate increase, and deployment windows for high-risk changes.

Observability Stack

Observability means understanding what's happening inside your system by examining its outputs. The three pillars -- metrics, logs, and traces -- work together to provide complete visibility into system behavior.

  • Metrics -- Prometheus for collection, Grafana for visualization. Key metrics: request rate, error rate, latency (RED method), and resource utilization (CPU, memory, disk).
  • Logs -- Structured JSON logging, centralized with ELK stack or Loki. Correlation IDs across services for request tracing through log analysis.
  • Traces -- Distributed tracing with Jaeger or Tempo. Visualize request flow across microservices, identify bottlenecks, and measure service dependencies.
  • Alerting -- Alert on SLO violations, not on individual metric thresholds. PagerDuty or Opsgenie integration for on-call notifications with severity-based routing.

Cost Optimization

SAVINGS

Right-Sizing

Analyze actual resource usage vs. requested limits. Most services over-provision by 40-60%. Use VPA (Vertical Pod Autoscaler) recommendations to right-size requests and limits.

SAVINGS

Spot / Preemptible Nodes

Run stateless workloads on spot instances (60-90% savings). Configure pod disruption budgets and pod anti-affinity to handle node evictions gracefully.

SAVINGS

Cluster Autoscaler

Scale nodes automatically based on pod scheduling needs. Configure scale-down delays to avoid thrashing. Use node pool priorities to prefer cheaper instance types.

SAVINGS

Resource Quotas

Set namespace-level quotas to prevent runaway resource consumption. Implement LimitRanges for default container limits. Review and adjust quotas monthly.

Real Example: Production Infrastructure

The platform runs 26 microservices on Google Kubernetes Engine, serving multiple countries with localized billing (Stripe + MercadoPago). The infrastructure supports 41 branded Android app variants built and deployed through a unified CI/CD pipeline.

20+ Microservices on GKE

Node.js/TypeScript services running on GKE with Helm-managed deployments. Autoscaling based on CPU and custom metrics. Istio for service mesh, Prometheus + Grafana for observability.

41 Android App CI/CD

A single GitLab CI pipeline builds 41 branded variants of the Android app. Each variant has its own branding assets, API endpoints, and Play Store listing. Single push deploys all variants, saving ~2 weeks per release cycle.

Multi-Country Deployment

Region-specific configurations for tax calculations, payment providers, and compliance requirements. Infrastructure as code ensures consistent environments across regions with minimal config drift.

More Guides