CLOUD

AWS: Cloud Infrastructure for Production Workloads

A deep technical guide to building, securing, and optimizing production infrastructure on Amazon Web Services. Covers compute (EC2, ECS, EKS, Lambda), Aurora RDS, S3 + CloudFront, SES, VPC networking, IAM security, CloudWatch monitoring, Route 53 DNS, Secrets Manager, EventBridge, cost optimization, and Infrastructure as Code.

1. Compute: EC2, ECS, EKS, Lambda

EC2 (Elastic Compute Cloud)

EC2 provides full control over virtual machines. You select instance types, configure networking, and manage the OS. Best for workloads that need persistent state, GPU access, or specific kernel configurations.

  • Instance families: General purpose (t3, m6i), compute-optimized (c6i), memory-optimized (r6i), GPU (p4d, g5)
  • Pricing models: On-Demand, Reserved Instances (1yr/3yr commits for 40-60% savings), Spot Instances (up to 90% savings for fault-tolerant workloads)
  • Auto Scaling Groups with launch templates, target tracking policies, and predictive scaling
  • Placement groups for low-latency inter-instance communication (cluster, spread, partition)
  • EBS volume types: gp3 (general), io2 (high IOPS), st1 (throughput), sc1 (cold storage)

ECS (Elastic Container Service)

ECS is AWS's native container orchestration platform. It runs Docker containers without managing Kubernetes complexity. Two launch types: EC2 (you manage instances) and Fargate (serverless).

  • Task definitions: CPU/memory limits, container images, environment variables, secrets from SSM Parameter Store or Secrets Manager
  • Services: Desired count, deployment strategies (rolling update, blue/green via CodeDeploy), circuit breaker for failed deployments
  • Fargate pricing: Pay per vCPU-second and GB-second. No EC2 instance management. Best for variable workloads
  • Service Connect and Cloud Map for service discovery between microservices
  • ALB integration with path-based and host-based routing, health checks, and sticky sessions
# ECS task definition (key fields)
{
  "family": "api-service",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "containerDefinitions": [{
    "name": "api",
    "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/api:latest",
    "portMappings": [{"containerPort": 3000, "protocol": "tcp"}],
    "secrets": [
      {"name": "DB_PASSWORD", "valueFrom": "arn:aws:ssm:us-east-1:123456789:parameter/prod/db-password"}
    ],
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {"awslogs-group": "/ecs/api-service", "awslogs-region": "us-east-1"}
    }
  }]
}

EKS (Elastic Kubernetes Service)

EKS is AWS's managed Kubernetes service. AWS handles the control plane (API server, etcd, scheduler) while you manage worker nodes or use Fargate for serverless pods. Best for teams already invested in the Kubernetes ecosystem or running multi-cloud workloads.

  • Managed node groups: AWS provisions and manages EC2 instances as worker nodes. Automatic AMI updates and draining
  • Fargate profiles: Run pods without managing nodes. Define which pods run on Fargate by namespace/label selectors
  • Add-ons: CoreDNS, kube-proxy, VPC CNI, EBS CSI driver. Managed by AWS with automatic version updates
  • IAM Roles for Service Accounts (IRSA): Map Kubernetes service accounts to IAM roles. Fine-grained pod-level permissions
  • Cluster Autoscaler / Karpenter: Scale nodes based on pending pod resource requests. Karpenter is faster and more flexible
EKS costs $0.10/hour for the control plane plus worker node costs. For teams not already using Kubernetes, ECS is simpler and more cost-effective. Choose EKS when you need Kubernetes-specific features like Helm charts, custom operators, or multi-cloud portability.

Lambda (Serverless Functions)

Lambda runs code without provisioning servers. You pay only for execution time (billed per 1ms). Ideal for event-driven architectures, API endpoints with variable traffic, and scheduled tasks.

  • Triggers: API Gateway, S3 events, SQS, SNS, DynamoDB Streams, EventBridge, CloudWatch Events
  • Cold start mitigation: Provisioned Concurrency (keeps instances warm), SnapStart (Java 11+, Python 3.12+, .NET 8+ -- GA), smaller package sizes
  • Memory configuration: 128MB to 10,240MB. CPU scales proportionally with memory allocation
  • Layers: Share common code/dependencies across functions. Up to 5 layers per function
  • Limits: 15-minute max execution, 10GB max memory, 250MB deployment package (unzipped), 1000 concurrent executions (default)
Cold starts add 100ms-2s latency on first invocation. SnapStart (now GA for Python 3.12+ and .NET 8+, not just Java) reduces cold starts by up to 90% -- from 2s to under 200ms -- with minimal code changes. For latency-sensitive APIs, use SnapStart or Provisioned Concurrency; consider ECS Fargate for sustained traffic.

2. RDS Aurora: MySQL-Compatible High Availability

Aurora is a MySQL/PostgreSQL-compatible relational database built for the cloud. It separates compute from storage, replicates data 6 ways across 3 AZs, and delivers up to 5x the throughput of standard MySQL on the same hardware.

  • Storage: Auto-scales from 10GB to 128TB. No need to pre-provision. Data replicated 6 times across 3 AZs
  • Read replicas: Up to 15 replicas with sub-10ms replication lag. Auto-failover in <30 seconds
  • Aurora Serverless v2: Scales from 0.5 to 128 ACUs. Scales in increments of 0.5 ACU. Ideal for variable workloads
  • Backtrack: Rewind the database to any point in the last 72 hours without restoring from backup
  • Global Database: Cross-region replication with <1 second lag. RPO of 1 second, RTO of <1 minute
  • Performance Insights: Identify top SQL queries, wait events, and bottlenecks. Free for 7 days of retention
# Aurora cluster endpoint configuration (TypeORM)
{
  type: 'mysql',
  replication: {
    master: {
      host: 'mydb-cluster.cluster-xxxxx.us-east-1.rds.amazonaws.com',
      port: 3306,
      username: 'admin',
      password: process.env.DB_PASSWORD,
      database: 'myapp_prod'
    },
    slaves: [{
      host: 'mydb-cluster.cluster-ro-xxxxx.us-east-1.rds.amazonaws.com',
      port: 3306,
      username: 'admin',
      password: process.env.DB_PASSWORD,
      database: 'myapp_prod'
    }]
  },
  extra: {
    connectionLimit: 20,
    connectTimeout: 10000,
    waitForConnections: true
  }
}
Aurora's reader endpoint automatically load-balances across all read replicas. Use the cluster endpoint for writes and the reader endpoint for reads in your application's connection configuration.

3. S3 and CloudFront: Storage + CDN

S3 (Simple Storage Service)

S3 provides virtually unlimited object storage with 99.999999999% (11 nines) durability.

  • Storage classes: Standard, Intelligent-Tiering, Standard-IA, One Zone-IA, Glacier Instant/Flexible/Deep Archive
  • Lifecycle policies: Automatically transition objects between classes based on age
  • Versioning: Keep every version of every object. Protect against accidental deletes
  • Server-side encryption: SSE-S3 (default), SSE-KMS (auditable), SSE-C (customer keys)
  • Pre-signed URLs: Grant temporary access (upload/download) without exposing credentials
  • S3 Event Notifications: Trigger Lambda, SQS, or SNS on object creation/deletion

CloudFront (CDN)

CloudFront distributes content from 450+ edge locations worldwide with single-digit millisecond latency.

  • Origin Access Control (OAC): Secure S3 access so objects are only accessible via CloudFront
  • Cache behaviors: Different TTLs, headers, and compression settings per path pattern
  • Lambda@Edge and CloudFront Functions: Run code at edge locations for URL rewrites, A/B testing, auth
  • Real-time logs: Stream access logs to Kinesis for real-time analytics
  • Price classes: Restrict edge locations to reduce cost (PriceClass_100: US/EU only)

Production Pattern: S3 + CloudFront + OAC

# CloudFormation: S3 bucket with CloudFront distribution
Resources:
  AssetsBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: app-static-assets
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      LifecycleConfiguration:
        Rules:
          - Id: TransitionToIA
            Status: Enabled
            Transitions:
              - StorageClass: STANDARD_IA
                TransitionInDays: 90

  CDN:
    Type: AWS::CloudFront::Distribution
    Properties:
      DistributionConfig:
        Origins:
          - Id: S3Origin
            DomainName: !GetAtt AssetsBucket.RegionalDomainName
            OriginAccessControlId: !Ref OAC
            S3OriginConfig:
              OriginAccessIdentity: ''
        DefaultCacheBehavior:
          TargetOriginId: S3Origin
          ViewerProtocolPolicy: redirect-to-https
          CachePolicyId: 658327ea-f89d-4fab-a63d-7e88639e58f6  # CachingOptimized
          Compress: true
        PriceClass: PriceClass_100
        ViewerCertificate:
          AcmCertificateArn: !Ref SSLCert
          MinimumProtocolVersion: TLSv1.2_2021

4. SES: Email Service at Scale

Amazon SES (Simple Email Service) handles transactional and marketing email at $0.10 per 1,000 emails. It provides high deliverability when configured correctly with authentication records and reputation management.

  • Authentication: SPF (Sender Policy Framework), DKIM (DomainKeys Identified Mail), DMARC alignment
  • Sending modes: SMTP interface (port 587), AWS SDK (SendEmail/SendRawEmail API), SendBulkTemplatedEmail for batch
  • Configuration sets: Track delivery, bounces, complaints, opens, and clicks. Route events to SNS, Kinesis, or CloudWatch
  • Suppression list: Automatically stops sending to addresses that bounced or complained. Reduces bounce rate
  • Dedicated IPs: Isolate your sending reputation from other SES users. Required for high-volume senders
  • Templates: Store email templates in SES. Use Handlebars-style placeholders for personalization
// Node.js: Send transactional email via SES SDK v3
import { SESv2Client, SendEmailCommand } from '@aws-sdk/client-sesv2';

const ses = new SESv2Client({ region: 'us-east-1' });

await ses.send(new SendEmailCommand({
  FromEmailAddress: 'noreply@example.app',
  Destination: { ToAddresses: [userEmail] },
  Content: {
    Template: {
      TemplateName: 'BookingConfirmation',
      TemplateData: JSON.stringify({
        userName: 'Jose',
        className: 'CrossFit 7AM',
        date: '2026-03-17',
        location: 'Santiago Centro'
      })
    }
  },
  ConfigurationSetName: 'app-transactional'
}));
New SES accounts start in sandbox mode (can only send to verified addresses). Request production access early -- approval can take 24-48 hours. Always set up bounce/complaint handling before going live.

5. VPC Networking: Subnets, Security Groups, NACLs

VPC Architecture

A VPC (Virtual Private Cloud) is your isolated network within AWS. Every production workload runs inside a VPC. Proper network design is the foundation of AWS security -- it determines what can communicate with what.

  • CIDR planning: Use /16 for production VPCs (65,536 IPs). Avoid overlapping CIDRs if you need VPC peering or Transit Gateway
  • Public subnets: Contain ALBs and NAT Gateways. Route table points 0.0.0.0/0 to Internet Gateway
  • Private subnets: Contain EC2 instances, ECS tasks, RDS databases, ElastiCache. Route table points 0.0.0.0/0 to NAT Gateway for outbound internet
  • Multi-AZ: Deploy subnets across at least 2 Availability Zones for high availability. 3 AZs for production workloads
  • VPC Endpoints: Gateway endpoints (S3, DynamoDB) are free. Interface endpoints (most other services) cost ~$0.01/hour + data charges. Eliminate NAT Gateway data processing fees
  • VPC Flow Logs: Capture IP traffic metadata for all network interfaces. Send to CloudWatch Logs or S3 for security analysis

Security Groups

  • Stateful: Return traffic is automatically allowed. Only define inbound rules
  • Reference other security groups as sources instead of CIDRs for internal traffic
  • ALB SG: Allow 443 from 0.0.0.0/0. App SG: Allow 3000 only from ALB SG. DB SG: Allow 3306 only from App SG
  • Default deny: Security groups deny all inbound by default. Explicitly allow only what is needed

Network ACLs

  • Stateless: You must define both inbound and outbound rules, including ephemeral port ranges
  • Processed in order by rule number. First matching rule wins. Explicit deny is possible
  • Use as defense-in-depth: Block known malicious CIDRs at the subnet level before traffic reaches instances
  • Default NACL allows all traffic. Custom NACLs deny all by default

Production VPC Layout

# Terraform: Multi-AZ VPC with public and private subnets
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
  tags = { Name = "myapp-prod" }
}

resource "aws_subnet" "public" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet("10.0.0.0/16", 8, count.index)       # 10.0.0.0/24, 10.0.1.0/24, 10.0.2.0/24
  availability_zone = data.aws_availability_zones.az.names[count.index]
  map_public_ip_on_launch = true
  tags = { Name = "public-${count.index}" }
}

resource "aws_subnet" "private" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet("10.0.0.0/16", 8, count.index + 10)  # 10.0.10.0/24, 10.0.11.0/24, 10.0.12.0/24
  availability_zone = data.aws_availability_zones.az.names[count.index]
  tags = { Name = "private-${count.index}" }
}

# Security group chain: ALB -> App -> DB
resource "aws_security_group" "alb" {
  vpc_id = aws_vpc.main.id
  ingress { from_port = 443; to_port = 443; protocol = "tcp"; cidr_blocks = ["0.0.0.0/0"] }
}

resource "aws_security_group" "app" {
  vpc_id = aws_vpc.main.id
  ingress { from_port = 3000; to_port = 3000; protocol = "tcp"; security_groups = [aws_security_group.alb.id] }
}

resource "aws_security_group" "db" {
  vpc_id = aws_vpc.main.id
  ingress { from_port = 3306; to_port = 3306; protocol = "tcp"; security_groups = [aws_security_group.app.id] }
}

6. IAM: Roles, Policies, and MFA

Principle of Least Privilege

IAM (Identity and Access Management) controls who can do what in your AWS account. Every API call is evaluated against IAM policies. A misconfigured policy is the most common cause of AWS security breaches.

  • Never use root account for daily operations. Enable MFA on root and all IAM users
  • Use IAM roles (not long-lived access keys) for EC2 instances, ECS tasks, and Lambda functions
  • Policy types: Identity-based (attached to users/roles), Resource-based (attached to S3/SQS/etc.), Permission boundaries
  • Use AWS Organizations with SCPs (Service Control Policies) to restrict what member accounts can do
  • IAM Access Analyzer: Identifies resources shared with external accounts. Run continuously
  • CloudTrail: Log every API call across all AWS services. Enable in all regions. Send to centralized S3 bucket
// Least-privilege policy for an ECS task that reads from S3 and writes to SQS
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": "arn:aws:s3:::app-static-assets/*"
    },
    {
      "Effect": "Allow",
      "Action": ["sqs:SendMessage"],
      "Resource": "arn:aws:sqs:us-east-1:123456789:notification-queue"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:GetParameter",
        "ssm:GetParameters"
      ],
      "Resource": "arn:aws:ssm:us-east-1:123456789:parameter/prod/*"
    }
  ]
}

Additional Security Services

  • AWS WAF: Protect ALBs and CloudFront from SQL injection, XSS, and rate-limit abusive IPs
  • Secrets Manager: Rotate database passwords automatically. Never store secrets in environment variables or code
  • GuardDuty: ML-based threat detection. Monitors CloudTrail, VPC Flow Logs, and DNS logs
  • AWS Config: Track resource configuration history and evaluate compliance rules continuously
  • Security Hub: Aggregates findings from GuardDuty, Inspector, Macie, and third-party tools into a single dashboard

7. Cost Optimization: Reserved, Spot, Savings Plans

AWS costs can spiral quickly without active management. These strategies consistently cut spending by 30-60% on production workloads.

  • Right-sizing: Use AWS Compute Optimizer to analyze CPU/memory utilization. Downsize over-provisioned instances. Most teams over-provision by 40-60%
  • Reserved Instances / Savings Plans: Commit to 1-year or 3-year usage for 30-60% discount. Compute Savings Plans are the most flexible (apply across EC2, Fargate, Lambda)
  • Spot Instances: Use for stateless workloads, batch processing, CI/CD runners. Combine with On-Demand via mixed instance policies in ASGs
  • S3 Lifecycle Policies: Move infrequently accessed data to Standard-IA after 30 days, Glacier after 90 days. Archive old logs to Deep Archive
  • NAT Gateway costs: NAT Gateways charge $0.045/GB of data processed. Use VPC endpoints for S3, DynamoDB, and other AWS services to avoid NAT charges
  • Turn off dev/staging: Schedule non-production environments to shut down outside business hours. Use Instance Scheduler or Lambda-based automation
  • Data transfer: Keep traffic within the same AZ when possible. Use CloudFront to reduce origin data transfer costs. Avoid cross-region replication unless required for DR
  • Cost monitoring: Set up AWS Budgets with alerts at 50%, 80%, and 100% thresholds. Use Cost Explorer's daily granularity to catch anomalies early
The single highest-impact action for most teams: buy Compute Savings Plans for your baseline steady-state usage, and run everything above baseline on Spot or On-Demand.

8. Infrastructure as Code: CloudFormation & Terraform

CloudFormation

  • AWS-native IaC. YAML/JSON templates. Tight integration with every AWS service
  • Stacks and nested stacks for modular infrastructure
  • Change sets: Preview changes before applying. Drift detection to find manual changes
  • Stack policies: Prevent accidental deletion of critical resources (RDS, S3)
  • Rollback on failure: Automatically reverts to previous state if deployment fails

Terraform

  • Multi-cloud IaC by HashiCorp. HCL language. Provider ecosystem for AWS, GCP, Azure, Cloudflare, etc.
  • State management: Remote state in S3 + DynamoDB locking. State locking prevents concurrent modifications
  • Modules: Reusable infrastructure components. Terraform Registry has thousands of community modules
  • Plan/Apply workflow: Always review terraform plan before applying. CI/CD integration with plan as PR comment
  • Import: Bring existing resources under Terraform management without recreation

Terraform Example: Aurora + VPC

resource "aws_rds_cluster" "aurora" {
  cluster_identifier     = "myapp-prod"
  engine                 = "aurora-mysql"
  engine_version         = "8.0.mysql_aurora.3.04.0"
  database_name          = "myapp"
  master_username        = "admin"
  master_password        = var.db_password
  db_subnet_group_name   = aws_db_subnet_group.private.name
  vpc_security_group_ids = [aws_security_group.aurora.id]
  backup_retention_period = 14
  preferred_backup_window = "03:00-04:00"
  deletion_protection     = true
  storage_encrypted       = true
  kms_key_id             = aws_kms_key.rds.arn

  serverlessv2_scaling_configuration {
    min_capacity = 0.5
    max_capacity = 16
  }
}

resource "aws_rds_cluster_instance" "writer" {
  identifier         = "myapp-prod-writer"
  cluster_identifier = aws_rds_cluster.aurora.id
  instance_class     = "db.serverless"
  engine             = aws_rds_cluster.aurora.engine
  engine_version     = aws_rds_cluster.aurora.engine_version
}

resource "aws_rds_cluster_instance" "reader" {
  identifier         = "myapp-prod-reader"
  cluster_identifier = aws_rds_cluster.aurora.id
  instance_class     = "db.serverless"
  engine             = aws_rds_cluster.aurora.engine
  engine_version     = aws_rds_cluster.aurora.engine_version
}

9. CloudWatch Monitoring and Observability

CloudWatch is the central monitoring and observability service for all AWS resources. It collects metrics, logs, and traces. Without proper CloudWatch configuration, production incidents go undetected until users report them.

  • Metrics: Every AWS service emits default metrics (CPU, network, errors). Custom metrics for application-level data (queue depth, active users, response times)
  • Alarms: Trigger notifications or Auto Scaling actions when metrics cross thresholds. Use composite alarms to combine multiple conditions
  • Logs: Centralize logs from ECS tasks, Lambda functions, EC2 instances, and API Gateway. Use Logs Insights to query with SQL-like syntax
  • Dashboards: Build real-time operational dashboards. Cross-account and cross-region dashboards for multi-account setups
  • Container Insights: Automatic metrics for ECS and EKS clusters: CPU, memory, network, and disk per task/pod
  • Anomaly Detection: ML-based anomaly detection on metrics. Automatically adjusts to seasonal patterns. Reduces alert noise
  • Metric filters: Extract numeric values from log data and create CloudWatch metrics. Example: count 5xx errors per minute from ALB logs
# CloudWatch alarm for high API error rate
resource "aws_cloudwatch_metric_alarm" "api_5xx" {
  alarm_name          = "app-api-5xx-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "HTTPCode_Target_5XX_Count"
  namespace           = "AWS/ApplicationELB"
  period              = 300
  statistic           = "Sum"
  threshold           = 50
  alarm_description   = "API 5xx errors exceeded 50 in 5 minutes"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    TargetGroup  = aws_lb_target_group.api.arn_suffix
    LoadBalancer = aws_lb.main.arn_suffix
  }
}
Set up CloudWatch alarms on day one, not after the first incident. Key alarms: CPU > 80%, memory > 85%, 5xx error rate > 1%, RDS connections > 80% of max, queue depth growing for more than 10 minutes.

10. Route 53: DNS and Traffic Management

Route 53 is AWS's DNS service with 100% SLA uptime. It handles domain registration, DNS resolution, and health-check-based routing. Supports public and private hosted zones.

  • Routing policies: Simple, weighted (A/B testing), latency-based (route to closest region), failover (active-passive DR), geolocation, multi-value answer
  • Alias records: AWS-specific record type that maps directly to ALBs, CloudFront, S3 website endpoints, and other AWS resources. No charge for alias queries to AWS resources
  • Health checks: Monitor endpoint availability from multiple global locations. Failover to standby when primary is unhealthy. Integrates with CloudWatch alarms
  • Private hosted zones: DNS resolution within VPCs. Internal service names (api.internal.example.com) that resolve to private IPs
  • DNSSEC: Sign hosted zones to protect against DNS spoofing. Route 53 manages KMS keys for signing
# Route 53: Latency-based routing with health checks
resource "aws_route53_record" "api" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"

  alias {
    name                   = aws_lb.main.dns_name
    zone_id                = aws_lb.main.zone_id
    evaluate_target_health = true
  }

  set_identifier = "us-east-1"
  latency_routing_policy {
    region = "us-east-1"
  }
}

11. Secrets Manager

Secrets Manager stores and rotates database credentials, API keys, and tokens. It eliminates hardcoded secrets and provides automatic rotation with zero-downtime credential updates.

  • Automatic rotation: Lambda-based rotation for RDS, Redshift, and DocumentDB credentials. Custom rotation for any secret type. Configurable rotation intervals (30, 60, 90 days)
  • Cross-account access: Share secrets across AWS accounts using resource-based policies. Useful for shared services in multi-account architectures
  • ECS/Lambda integration: Reference secrets directly in ECS task definitions and Lambda environment variables. Secrets are decrypted at runtime, never stored in plaintext
  • Versioning: Each secret maintains current and previous versions. Applications can reference specific versions or always fetch the latest
  • Pricing: $0.40/secret/month + $0.05/10,000 API calls. Far cheaper than a credential leak incident
// Fetch secret from Secrets Manager in Node.js
import { SecretsManagerClient, GetSecretValueCommand } from '@aws-sdk/client-secrets-manager';

const client = new SecretsManagerClient({ region: 'us-east-1' });

async function getDbCredentials() {
  const response = await client.send(
    new GetSecretValueCommand({ SecretId: 'myapp/prod/aurora-credentials' })
  );
  return JSON.parse(response.SecretString);
  // { username: "admin", password: "rotated-password-xyz", host: "...", port: 3306 }
}
Never store secrets in environment variables, SSM Parameter Store (unencrypted), or source code. Use Secrets Manager for all credentials. Enable automatic rotation and audit access via CloudTrail.

12. EventBridge: Event-Driven Architecture

EventBridge is a serverless event bus that connects applications using events. It decouples producers from consumers, enabling scalable event-driven architectures. It replaces CloudWatch Events with richer filtering and more targets.

  • Event buses: Default bus receives AWS service events. Custom buses for application events. Partner buses for SaaS integrations (Datadog, PagerDuty, Stripe)
  • Rules and patterns: Filter events using content-based matching. Match on event source, detail-type, and any field in the event payload using prefix, suffix, numeric, and exists patterns
  • Targets: Route matched events to Lambda, SQS, SNS, Step Functions, API Gateway, Kinesis, ECS tasks, and 20+ other targets
  • Scheduler: EventBridge Scheduler for cron and rate-based schedules. Replaces CloudWatch Events for scheduled tasks. One-time schedules for deferred actions
  • Archive and replay: Archive events for replay during debugging or recovery. Filter by date range and event pattern
  • Schema registry: Automatically discovers event schemas from your bus. Generates code bindings for TypeScript, Python, Java
// EventBridge: Send custom application event
import { EventBridgeClient, PutEventsCommand } from '@aws-sdk/client-eventbridge';

const eb = new EventBridgeClient({ region: 'us-east-1' });

await eb.send(new PutEventsCommand({
  Entries: [{
    Source: 'myapp.bookings',
    DetailType: 'BookingConfirmed',
    Detail: JSON.stringify({
      bookingId: 'bk-12345',
      userId: 'usr-67890',
      className: 'CrossFit 7AM',
      locationId: 'loc-santiago-centro',
      timestamp: new Date().toISOString()
    }),
    EventBusName: 'app-events'
  }]
}));
// Rule targets: Lambda (send confirmation email via SES),
// SQS (update analytics), Step Functions (trigger post-booking workflow)
EventBridge is the backbone of event-driven architectures on AWS. Use it instead of direct Lambda-to-Lambda calls or hard-wired SQS queues. It gives you filtering, retry, dead-letter queues, and replay for free.

13. Amazon Bedrock: Managed AI/ML Platform

Amazon Bedrock is AWS's fully managed service for building generative AI applications. It provides access to foundation models from Anthropic (Claude), Amazon (Titan), Meta (Llama), Mistral, Cohere, and others through a unified API. Bedrock handles infrastructure, scaling, and security so you focus on application logic rather than model hosting.

  • Claude models on Bedrock: Claude Opus 4.7 (available April 17, 2026 -- the most intelligent Opus model), Claude Opus 4.6 (February 2026), Claude Sonnet 4.6 (February 2026), Claude Haiku 4.5, Claude Sonnet 4.5, and Claude Opus 4.1 are all available. Bedrock's next-generation inference engine with dynamic scheduling and scaling logic improves availability for steady-state workloads. Supports both 200K and 1M context windows for processing extensive documents and codebases. The Bedrock Marketplace now hosts nearly 100 serverless foundation models from 10+ providers
  • Bedrock Agents and AgentCore Runtime: Build autonomous AI agents that can reason, plan, and execute multi-step tasks. Define action groups specifying APIs the agent can call, connect knowledge bases for domain-specific RAG, and orchestrate complex workflows without writing agent orchestration code. The AgentCore Runtime adds a first-class A2A Protocol contract so agents built in Strands, LangGraph, OpenAI Agents SDK, or Google ADK can interoperate with agents in other clouds out of the box
  • Knowledge Bases (RAG): Fully managed Retrieval Augmented Generation. Ingest documents from S3, chunk and embed them automatically, store in a managed vector database, and query with automatic context injection. Eliminates custom RAG pipeline development
  • Guardrails: Configurable safety policies for AI applications. Content and word filters, prompt attack detection, denied topic classification, PII redaction, and hallucination detection with Automated Reasoning checks. Blocks up to 88% of harmful content with 99% accuracy on correct response identification
  • Security and privacy: Data never leaves your AWS account and is never used to train models. VPC isolation, IAM role-based access, encryption in transit and at rest. All API calls logged in CloudTrail for compliance auditing
  • Claude Code integration: Set CLAUDE_CODE_USE_BEDROCK=1 to route all Claude Code traffic through Bedrock. Traffic stays within your VPC, costs appear on your AWS bill, and IAM policies control who can use AI services
// Invoke Claude on Bedrock (AWS SDK v3)
import { BedrockRuntimeClient, InvokeModelCommand }
  from "@aws-sdk/client-bedrock-runtime";

const client = new BedrockRuntimeClient({ region: "us-east-1" });

const response = await client.send(new InvokeModelCommand({
  modelId: "anthropic.claude-sonnet-4-6-20260217-v1:0",
  contentType: "application/json",
  body: JSON.stringify({
    anthropic_version: "bedrock-2023-05-31",
    max_tokens: 4096,
    messages: [{
      role: "user",
      content: "Analyze this architecture for security risks."
    }]
  })
}));

const result = JSON.parse(
  new TextDecoder().decode(response.body)
);
For AI consulting engagements, Bedrock is the recommended starting point for AWS-centric organizations. It provides enterprise-grade security, predictable pricing, and seamless integration with existing AWS infrastructure -- no need to manage GPU instances or model deployments.

14. Latest AWS Updates (April 2026)

AWS DevOps Agent (GA): Investigates incidents, reduces time to resolution, and prevents issues. Preview customers report up to 75% lower MTTR and 3-5x faster resolution. Integrates with CloudWatch, X-Ray, and EventBridge for automated root-cause analysis.

AWS Security Agent (GA): Continuous, context-aware penetration testing integrated into the development lifecycle. Teams report 50%+ faster testing and ~30% lower costs with significantly fewer false positives compared to traditional scanning tools.

Database Savings Plans: Now supports Amazon OpenSearch Service and Neptune Analytics -- save up to 35% on eligible serverless and provisioned instance usage with a one-year commitment.

Elastic Beanstalk AI Analysis: When environment health is degraded, Beanstalk can collect events, instance health, and logs and send them to Amazon Bedrock for analysis, providing step-by-step troubleshooting recommendations.

VPC Encryption Controls: Transitioned from free preview to paid feature starting March 1, 2026.

Lambda SnapStart for Python and .NET (GA): SnapStart now supports Python 3.12+ and .NET 8+ runtimes (in addition to Java 11+), reducing cold starts by up to 90% -- from 2 seconds to sub-200ms. Available in 23+ AWS Regions. Particularly impactful for Python functions loading heavy ML libraries (LangChain, NumPy, Pandas) or web frameworks (Flask, Django). Use runtime hooks to run code before snapshot capture and after resume for proper initialization handling.

Bedrock Model Garden Expansion: Amazon Bedrock now hosts nearly 100 serverless foundation models from 10+ providers. Claude Opus 4.7 launched April 17, 2026, powered by Bedrock's next-generation inference engine with dynamic scheduling. Bedrock added 18 fully managed open-weight models (Mistral Large 3, Google Gemma, MiniMax, Moonshot, NVIDIA, Qwen) and now supports reinforcement fine-tuning with OpenAI-compatible APIs for open-weight models.

Lambda Managed Instances (Preview): Run Lambda functions on customer-selected EC2 instance types including GPUs (p4d with NVIDIA A100, g5 with A10G). Enables compute-intensive workloads like ML inference and HPC directly within the Lambda programming model, bridging the gap between serverless simplicity and GPU access.

15. Real-World Experience

In production, I architected and managed production AWS infrastructure supporting 26 microservices across multiple Latin American countries. Key components:

  • Aurora RDS MySQL: Primary database for all services. Writer + reader endpoint separation, automated backups with 14-day retention, Performance Insights for query optimization
  • S3 + CloudFront: Static asset delivery (images, documents, exports) via CloudFront CDN with OAC. Pre-signed URLs for secure file uploads from mobile apps
  • SES: Transactional email (booking confirmations, password resets, invoices) with DKIM/SPF authentication and dedicated IP for deliverability
  • Cost management: Implemented Compute Savings Plans, S3 lifecycle policies, and VPC endpoints that reduced monthly AWS spend by ~35%
  • Security: Private subnets for all databases, IAM roles for ECS tasks, Secrets Manager for credential rotation, CloudTrail for audit logging
Book free 1-hour consult All Guides Home

More Guides