Cloud Computing (7): Operations and DevOps Practices
Chen Kai BOSS

Picture this: you've deployed your application to the cloud. It's running smoothly, users are happy, and everything looks great. Then, at 3 AM on a Sunday, your monitoring alerts start firing. Response times have spiked, error rates are climbing, and your on-call engineer is scrambling to figure out what's wrong. Is it a database connection pool exhaustion? A memory leak? A sudden traffic spike? Or something else entirely?

This scenario plays out daily in cloud operations. Building and deploying applications is only half the battle — keeping them running reliably, efficiently, and cost-effectively is where operations and DevOps practices become critical. The difference between a well-operated cloud system and a poorly managed one isn't just uptime; it's the ability to detect issues before they impact users, respond to incidents quickly, optimize costs continuously, and scale seamlessly.

In this comprehensive guide, we'll explore the full spectrum of cloud operations and DevOps practices: infrastructure as code, monitoring and observability, logging and analysis, application performance management, automation pipelines, auto-scaling strategies, cost optimization techniques, troubleshooting methodologies, Site Reliability Engineering principles, and modern GitOps workflows. We'll dive deep into tools like Terraform, Ansible, Prometheus, Grafana, ELK stack, and examine real-world case studies that demonstrate these practices in action.

The Evolution of Cloud Operations

From Traditional IT to DevOps

Traditional IT operations followed a siloed model: developers wrote code, handed it to operations teams who deployed and maintained it, and the two groups often worked at cross-purposes. Developers wanted rapid releases; operations wanted stability. This tension created bottlenecks, slow deployments, and finger-pointing when things went wrong.

DevOps emerged as a cultural and technical movement to bridge this divide. The term, coined around 2009, combines "development" and "operations" to emphasize collaboration, shared responsibility, and automation. DevOps isn't just about tools — it's about creating a culture where building, deploying, and operating software becomes a unified, continuous process.

Key Principles:

  • Automation: Eliminate manual, error-prone processes through scripting and tooling
  • Continuous Integration/Continuous Deployment (CI/CD): Integrate code changes frequently and deploy automatically
  • Infrastructure as Code: Manage infrastructure through version-controlled code
  • Monitoring and Logging: Comprehensive observability into system behavior
  • Collaboration: Break down silos between development and operations teams

Cloud Operations Challenges

Operating cloud infrastructure introduces unique challenges compared to traditional on-premises environments:

Scale and Complexity: Cloud systems can span multiple regions, availability zones, and services. A single application might depend on dozens of microservices, databases, caches, message queues, and external APIs. Understanding dependencies and failure modes becomes exponentially more complex.

Dynamic Nature: Resources are ephemeral — instances come and go, auto-scaling groups expand and contract, containers are created and destroyed. Traditional static monitoring approaches don't work well in this environment.

Multi-Tenancy: Cloud providers operate massive shared infrastructure. Understanding how your application's performance might be affected by "noisy neighbors" or provider-side issues requires sophisticated monitoring.

Cost Management: Cloud costs can spiral out of control without careful management. Idle resources, over-provisioning, inefficient instance types, and data transfer costs can quickly exceed budgets.

Security and Compliance: Cloud environments require continuous security monitoring, compliance validation, and access management. Misconfigurations can expose sensitive data or create security vulnerabilities.

Vendor Lock-in: While cloud providers offer powerful managed services, relying too heavily on proprietary services can make migration difficult. Balancing convenience with portability is a constant operational consideration.

Infrastructure as Code (IaC)

Infrastructure as Code is the practice of managing and provisioning infrastructure through machine-readable definition files rather than manual configuration. This approach brings version control, testing, code review, and automation to infrastructure management.

Why Infrastructure as Code?

Consistency: Manual configuration leads to drift — servers configured differently, missing security patches, inconsistent networking rules. IaC ensures every environment is identical.

Speed: Provisioning infrastructure manually takes hours or days. With IaC, you can spin up entire environments in minutes.

Risk Reduction: Manual changes are error-prone. IaC allows you to test infrastructure changes before applying them, review changes through pull requests, and roll back if needed.

Documentation: IaC code serves as living documentation of your infrastructure. New team members can understand the system by reading the code.

Cost Control: IaC makes it easy to destroy unused resources, right-size instances, and replicate cost-optimized configurations.

Terraform: Declarative Infrastructure Provisioning

Terraform by HashiCorp is the most popular IaC tool, using a declarative configuration language (HCL - HashiCorp Configuration Language) to define desired infrastructure state.

Core Concepts:

  1. Providers: Plugins that interact with cloud APIs (AWS, Azure, GCP, etc.)
  2. Resources: Infrastructure components (EC2 instances, S3 buckets, VPCs)
  3. State: A file tracking the mapping between configuration and real infrastructure
  4. Modules: Reusable, parameterized configurations

Example: AWS EC2 Instance with Terraform

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# variables.tf
variable "instance_type" {
description = "EC2 instance type"
type = string
default = "t3.medium"
}

variable "ami_id" {
description = "AMI ID for the instance"
type = string
}

variable "environment" {
description = "Environment name"
type = string
}

# main.tf
terraform {
required_version = ">= 1.0"

required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}

provider "aws" {
region = "us-east-1"
}

# Data source to get the latest Amazon Linux 2 AMI
data "aws_ami" "amazon_linux" {
most_recent = true
owners = ["amazon"]

filter {
name = "name"
values = ["amzn2-ami-hvm-*-x86_64-gp2"]
}
}

# Security group for the instance
resource "aws_security_group" "web_server" {
name = "${var.environment}-web-server-sg"
description = "Security group for web server"

ingress {
description = "HTTP"
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}

ingress {
description = "HTTPS"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}

ingress {
description = "SSH"
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = ["10.0.0.0/8"]
}

egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}

tags = {
Name = "${var.environment}-web-server-sg"
Environment = var.environment
}
}

# EC2 instance
resource "aws_instance" "web_server" {
ami = var.ami_id != "" ? var.ami_id : data.aws_ami.amazon_linux.id
instance_type = var.instance_type

vpc_security_group_ids = [aws_security_group.web_server.id]

user_data = <<-EOF
#!/bin/bash
yum update -y
yum install -y httpd
systemctl start httpd
systemctl enable httpd
echo "<h1>Hello from${var.environment}</h1>" > /var/www/html/index.html
EOF

tags = {
Name = "${var.environment}-web-server"
Environment = var.environment
ManagedBy = "Terraform"
}
}

# Output the instance public IP
output "instance_public_ip" {
description = "Public IP address of the instance"
value = aws_instance.web_server.public_ip
}

Terraform Workflow:

  1. Write Configuration: Define resources in .tf files
  2. Initialize: terraform init downloads providers and modules
  3. Plan: terraform plan shows what changes will be made
  4. Apply: terraform apply creates or modifies infrastructure
  5. Destroy: terraform destroy removes infrastructure

State Management: Terraform state files track resource mappings. For team collaboration, store state remotely using backends like S3, Azure Storage, or Terraform Cloud.

Best Practices:

  • Use modules for reusable components
  • Separate environments (dev, staging, prod) into different workspaces or directories
  • Enable state locking to prevent concurrent modifications
  • Use variables and outputs for flexibility
  • Implement policy as code with tools like OPA (Open Policy Agent)

Ansible: Configuration Management and Automation

While Terraform focuses on provisioning infrastructure, Ansible excels at configuration management and application deployment. Ansible uses YAML playbooks to define automation tasks.

Key Concepts:

  • Playbooks: YAML files describing automation tasks
  • Tasks: Individual units of work (install package, start service, copy file)
  • Modules: Reusable units of code (apt, service, copy, template)
  • Inventory: List of hosts to manage
  • Roles: Reusable collections of tasks, variables, and templates

Example: Ansible Playbook for Web Server Setup

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# playbook.yml
---
- name: Configure web server
hosts: webservers
become: yes
vars:
http_port: 80
https_port: 443
app_user: www-data

tasks:

- name: Update apt cache
apt:
update_cache: yes
cache_valid_time: 3600

- name: Install required packages
apt:
name:

- nginx
- certbot
- python3-certbot-nginx
state: present

- name: Create web directory
file:
path: /var/www/app
state: directory
owner: "{{ app_user }}"
group: "{{ app_user }}"
mode: '0755'

- name: Configure Nginx
template:
src: nginx.conf.j2
dest: /etc/nginx/sites-available/app
notify: restart nginx

- name: Enable site
file:
src: /etc/nginx/sites-available/app
dest: /etc/nginx/sites-enabled/app
state: link
notify: restart nginx

- name: Start and enable Nginx
systemd:
name: nginx
state: started
enabled: yes

handlers:

- name: restart nginx
systemd:
name: nginx
state: restarted

Ansible Inventory Example:

1
2
3
4
5
6
7
8
9
10
11
# inventory.ini
[webservers]
web1.example.com ansible_host=10.0.1.10
web2.example.com ansible_host=10.0.1.11

[databases]
db1.example.com ansible_host=10.0.2.10

[webservers:vars]
ansible_user=ubuntu
ansible_ssh_private_key_file=~/.ssh/id_rsa

Ansible vs Terraform:

  • Terraform: Best for provisioning cloud resources (VPCs, instances, load balancers)
  • Ansible: Best for configuring existing systems (installing software, managing services, deploying applications)
  • Together: Use Terraform to create infrastructure, then Ansible to configure it

AWS CloudFormation: Native AWS IaC

CloudFormation is AWS's native IaC service, using JSON or YAML templates to define AWS resources.

Example: CloudFormation Template

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Web server stack with auto-scaling'

Parameters:
InstanceType:
Type: String
Default: t3.medium
AllowedValues:

- t3.small
- t3.medium
- t3.large
Description: EC2 instance type

Environment:
Type: String
Default: dev
AllowedValues:

- dev
- staging
- prod

Resources:
VPC:
Type: AWS::EC2::VPC
Properties:
CidrBlock: 10.0.0.0/16
EnableDnsHostnames: true
EnableDnsSupport: true
Tags:

- Key: Name
Value: !Sub '${Environment}-vpc'

PublicSubnet:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
CidrBlock: 10.0.1.0/24
AvailabilityZone: !Select [0, !GetAZs '']
MapPublicIpOnLaunch: true

InternetGateway:
Type: AWS::EC2::InternetGateway
Properties:
Tags:

- Key: Name
Value: !Sub '${Environment}-igw'

AttachGateway:
Type: AWS::EC2::VPCGatewayAttachment
Properties:
VpcId: !Ref VPC
InternetGatewayId: !Ref InternetGateway

PublicRouteTable:
Type: AWS::EC2::RouteTable
Properties:
VpcId: !Ref VPC

DefaultPublicRoute:
Type: AWS::EC2::Route
DependsOn: AttachGateway
Properties:
RouteTableId: !Ref PublicRouteTable
DestinationCidrBlock: 0.0.0.0/0
GatewayId: !Ref InternetGateway

PublicSubnetRouteTableAssociation:
Type: AWS::EC2::SubnetRouteTableAssociation
Properties:
SubnetId: !Ref PublicSubnet
RouteTableId: !Ref PublicRouteTable

WebServerSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupName: !Sub '${Environment}-web-sg'
GroupDescription: Security group for web servers
VpcId: !Ref VPC
SecurityGroupIngress:

- IpProtocol: tcp
FromPort: 80
ToPort: 80
CidrIp: 0.0.0.0/0

- IpProtocol: tcp
FromPort: 443
ToPort: 443
CidrIp: 0.0.0.0/0
SecurityGroupEgress:

- IpProtocol: -1
CidrIp: 0.0.0.0/0

LaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: !Sub '${Environment}-web-lt'
LaunchTemplateData:
ImageId: ami-0c55b159cbfafe1f0
InstanceType: !Ref InstanceType
SecurityGroupIds:

- !Ref WebServerSecurityGroup
UserData:
Fn::Base64: !Sub |
#!/bin/bash
yum update -y
yum install -y httpd
systemctl start httpd
systemctl enable httpd

AutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
VPCZoneIdentifier:

- !Ref PublicSubnet
LaunchTemplate:
LaunchTemplateId: !Ref LaunchTemplate
Version: !GetAtt LaunchTemplate.LatestVersionNumber
MinSize: 2
MaxSize: 10
DesiredCapacity: 2
HealthCheckType: ELB
HealthCheckGracePeriod: 300
Tags:

- Key: Name
Value: !Sub '${Environment}-web-asg'
PropagateAtLaunch: true

Outputs:
VPCId:
Description: VPC ID
Value: !Ref VPC
Export:
Name: !Sub '${AWS::StackName}-VPCId'

PublicSubnetId:
Description: Public Subnet ID
Value: !Ref PublicSubnet
Export:
Name: !Sub '${AWS::StackName}-PublicSubnetId'

CloudFormation Features:

  • Stack Management: Create, update, and delete entire stacks
  • Change Sets: Preview changes before applying
  • Drift Detection: Identify manual changes to resources
  • Nested Stacks: Organize complex templates
  • Stack Policies: Control which resources can be modified

Monitoring and Observability

Monitoring tells you what's happening; observability helps you understand why. Modern cloud applications require comprehensive observability across metrics, logs, and traces.

The Three Pillars of Observability

Metrics: Numerical measurements over time (CPU usage, request rate, error count). Metrics are efficient to store and query but lose detail.

Logs: Event records with timestamps and context. Logs provide detailed information but can be expensive to store and search.

Traces: Request flows across distributed systems. Traces show how requests propagate through microservices, helping identify bottlenecks.

Prometheus: Metrics Collection and Alerting

Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. It uses a pull-based model where Prometheus scrapes metrics from instrumented applications.

Core Concepts:

  1. Metrics: Time-series data points
  2. Labels: Key-value pairs that identify metrics
  3. Scraping: Prometheus pulls metrics from targets
  4. PromQL: Query language for metrics
  5. Alertmanager: Handles alerts and routing

Prometheus Configuration Example:

Problem Background: In production environments, monitoring requires collecting metrics from diverse sources including infrastructure (servers, containers), applications, and Kubernetes resources. Prometheus's pull-based architecture requires careful configuration to discover and scrape targets efficiently while maintaining scalability and reliability.

Solution Approach: - Global settings: Define default scrape intervals and external labels for all targets - Service discovery: Use Kubernetes SD for dynamic target discovery - Relabeling: Transform and filter discovered targets before scraping - Alert integration: Configure Alertmanager for alert routing and notification

Design Considerations: - Scrape intervals: Balance data freshness with resource consumption (15s default, adjust per job) - Label cardinality: External labels identify metric origin in multi-cluster setups - Target filtering: Use relabeling to selectively scrape annotated pods - High availability: Deploy multiple Prometheus instances with consistent configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# prometheus.yml
# Prometheus main configuration file
# Purpose: Configure metric collection, alerting, and service discovery

# Global configuration applies to all scrape jobs
global:
# Scrape interval: How often Prometheus scrapes metrics from targets
# 15s provides good balance between data freshness and load
# Can be overridden per job for critical or less important metrics
scrape_interval: 15s

# Evaluation interval: How often Prometheus evaluates alert rules
# Should match or exceed scrape_interval to ensure sufficient data
evaluation_interval: 15s

# External labels: Added to all time series and alerts
# Critical for: Multi-cluster federation, alert routing, data identification
external_labels:
cluster: 'production' # Identifies which cluster metrics come from
environment: 'prod' # Distinguishes prod/staging/dev
region: 'us-east-1' # Optional: Geographic location

# Alert rule files: Define conditions that trigger alerts
# Separate files enable modular alert management
rule_files:
- "alerts.yml"
# Can include multiple files for organization:
# - "infrastructure_alerts.yml"
# - "application_alerts.yml"
# - "business_alerts.yml"

# Alerting configuration: Where to send triggered alerts
alerting:
alertmanagers:
# Static configuration for Alertmanager endpoints
# For HA: Configure multiple Alertmanager instances
- static_configs:
- targets:
- alertmanager:9093
# High availability setup:
# - alertmanager-1:9093
# - alertmanager-2:9093

# Scrape configurations: Define what and how to scrape
scrape_configs:
# Job 1: Monitor Prometheus itself
# Purpose: Ensure Prometheus is healthy and performing well
- job_name: 'prometheus'
# Static targets: Manually specified endpoints
# Use for: Fixed infrastructure, Prometheus components
static_configs:
- targets: ['localhost:9090']
# Optional labels to identify this Prometheus instance
labels:
instance_role: 'primary'

# Job 2: Node Exporter for system metrics
# Purpose: Collect OS-level metrics (CPU, memory, disk, network)
# Requires: node_exporter running on each host
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
# Add custom labels for better metric organization
labels:
node_type: 'compute'

# Relabeling: Transform labels before storing metrics
relabel_configs:
# Extract hostname from address (remove port)
# Useful for: Clean instance labels in dashboards
- source_labels: [__address__]
target_label: instance
regex: '([^:]+):\d+' # Match hostname:port
replacement: '${1}' # Keep only hostname

# Job 3: Application metrics
# Purpose: Collect application-specific metrics
# Requires: Application exposes /metrics endpoint
- job_name: 'app'
static_configs:
- targets: ['app:8080']

# Custom metrics path (default is /metrics)
metrics_path: '/metrics'

# Override global scrape_interval for this job
# 10s for critical application metrics
scrape_interval: 10s

# Optional: Add scrape timeout (default 10s)
# scrape_timeout: 5s

# Job 4: Kubernetes Pod service discovery
# Purpose: Automatically discover and scrape Kubernetes pods
# Advantage: No manual configuration, adapts to pod changes
- job_name: 'kubernetes-pods'
# Kubernetes service discovery configuration
kubernetes_sd_configs:
- role: pod # Discover pods
# Optional: Limit to specific namespaces
# namespaces:
# names: ['production', 'staging']

# Relabeling: Filter and transform discovered targets
# Critical for: Selective scraping, label enrichment
relabel_configs:
# Rule 1: Only scrape pods with annotation prometheus.io/scrape=true
# Purpose: Opt-in monitoring, avoid scraping all pods
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep # Keep only matching targets
regex: true # Match annotation value "true"
# Note: Pods must have this annotation to be scraped

# Rule 2: Use custom metrics path from annotation
# Default: /metrics
# Override with: prometheus.io/path annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+) # Match any non-empty value
# Example: prometheus.io/path: "/custom/metrics"

# Rule 3: Use custom port from annotation
# Default: Use pod's exposed port
# Override with: prometheus.io/port annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: '${1}'
# Example: prometheus.io/port: "8080"

# Optional: Add namespace as label for multi-tenant setups
# - source_labels: [__meta_kubernetes_namespace]
# target_label: kubernetes_namespace

# Optional: Add pod name for troubleshooting
# - source_labels: [__meta_kubernetes_pod_name]
# target_label: kubernetes_pod

Key Points Interpretation: - Service discovery: Kubernetes SD automatically detects pods, no manual target management needed - Relabeling power: Transform discovered targets before scraping, enabling filtering, label enrichment, and custom addressing - External labels: Critical for multi-cluster setups, help identify metric origin in federated Prometheus or long-term storage - Scrape intervals: Different jobs can have different intervals based on metric importance and cardinality

Design Trade-offs: - Scrape frequency vs Load: Lower intervals (5s) provide real-time visibility but increase Prometheus CPU/memory usage and network traffic - Service discovery vs Static: SD adapts to changes automatically but adds complexity; static configs are simpler but require manual updates - Label cardinality: More labels enable better filtering but increase storage and query costs; avoid high-cardinality labels like request IDs

Common Questions: - Q: How do I monitor only specific pods? A: Use pod annotations (prometheus.io/scrape: "true") and relabeling to filter - Q: What's the recommended scrape interval? A: 15s for standard metrics, 10s for critical applications, 30-60s for less important infrastructure - Q: How do I handle high-cardinality metrics? A: Use recording rules to pre-aggregate, drop unnecessary labels, or use metric_relabel_configs to filter before storage

Production Practices: - Deploy Prometheus in HA mode with identical configurations for redundancy - Use ConfigMaps in Kubernetes to manage prometheus.yml, enabling GitOps workflows - Monitor Prometheus itself: Set up alerts for scrape failures, target down, high cardinality - Use federation for multi-cluster monitoring: Central Prometheus scrapes from cluster Prometheus instances - Implement metric retention policies based on storage capacity and query patterns - Use recording rules to pre-calculate expensive queries used in dashboards - Set appropriate resource limits to prevent Prometheus from consuming excessive resources - Regularly review and optimize scrape configs to remove unused jobs and reduce cardinality

PromQL Query Examples:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# CPU usage percentage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Request rate per second
rate(http_requests_total[5m])

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Available memory
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

Alert Rules Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# alerts.yml
groups:

- name: infrastructure
interval: 30s
rules:

- alert: HighCPUUsage
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes on {{$labels.instance }}"

- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 85% on {{ $labels.instance }}"

- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space running low"
description: "Disk space is below 15% on {{$labels.instance }}"

- name: application
interval: 30s
rules:

- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is above 5% for the last 5 minutes"

- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P95 latency is above 1 second for the last 10 minutes"

Grafana: Visualization and Dashboards

Grafana is an open-source analytics and visualization platform that works with Prometheus and other data sources. It provides rich dashboards for visualizing metrics.

Dashboard JSON Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
{
"dashboard": {
"title": "Application Monitoring Dashboard",
"tags": ["application", "monitoring"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m]))",
"legendFormat": "Requests/sec"
}
],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
},
{
"id": 2,
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100",
"legendFormat": "Error %"
}
],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
},
{
"id": 3,
"title": "Response Time (P95)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "P95 Latency"
}
],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
},
{
"id": 4,
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "100 - (avg(irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "CPU %"
}
],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}
}
],
"refresh": "10s",
"time": {
"from": "now-6h",
"to": "now"
}
}
}

Grafana Features:

  • Multiple Data Sources: Prometheus, InfluxDB, Elasticsearch, CloudWatch, etc.
  • Rich Visualizations: Graphs, heatmaps, tables, stat panels, logs
  • Alerting: Create alerts based on dashboard queries
  • Variables: Dynamic dashboard variables for filtering
  • Annotations: Mark events on graphs
  • Sharing: Export dashboards as JSON or share via URL

Application Performance Monitoring (APM)

APM tools provide deep insights into application performance, tracking request flows, database queries, external API calls, and identifying performance bottlenecks.

Key APM Capabilities:

  • Distributed Tracing: Track requests across microservices
  • Code Profiling: Identify slow functions and database queries
  • Error Tracking: Capture and analyze application errors
  • Real User Monitoring (RUM): Track actual user experience
  • Synthetic Monitoring: Proactive testing from various locations

Popular APM Tools:

  • Datadog APM: Full-stack observability with distributed tracing
  • New Relic: Application performance monitoring with AI-powered insights
  • Dynatrace: AI-powered observability platform
  • Jaeger: Open-source distributed tracing system
  • Zipkin: Distributed tracing system
  • OpenTelemetry: Vendor-neutral observability framework

OpenTelemetry Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Python application instrumentation
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

# Initialize tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger-agent",
agent_port=6831,
)

# Add span processor
span_processor = BatchSpanProcessor(jaeger_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

# Instrument libraries
RequestsInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()

# Example application code
from flask import Flask
import requests

app = Flask(__name__)

@app.route('/api/users/<user_id>')
def get_user(user_id):
with tracer.start_as_current_span("get_user") as span:
span.set_attribute("user.id", user_id)

# Database query (automatically traced)
user = db.session.query(User).filter_by(id=user_id).first()

# External API call (automatically traced)
response = requests.get(f"https://api.example.com/profile/{user_id}")

return {"user": user, "profile": response.json()}

Logging and Analysis

Logs provide detailed records of system events, errors, and user activities. Effective log management requires collection, storage, indexing, and analysis capabilities.

ELK Stack: Elasticsearch, Logstash, and Kibana

The ELK stack is a popular open-source solution for log management:

  • Elasticsearch: Distributed search and analytics engine
  • Logstash: Log processing pipeline
  • Kibana: Visualization and exploration interface

Logstash Configuration Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# logstash.conf
input {
# File input for application logs
file {
path => "/var/log/app/*.log"
start_position => "beginning"
sincedb_path => "/dev/null"
codec => "json"
}

# Beats input for system logs
beats {
port => 5044
}

# Syslog input
syslog {
port => 514
}
}

filter {
# Parse JSON logs
if [type] == "app" {
json {
source => "message"
}

# Extract timestamp
date {
match => ["timestamp", "ISO8601"]
}

# Add fields
mutate {
add_field => { "environment" => "production" }
}

# Parse error messages
if [level] == "ERROR" {
grok {
match => { "message" => "%{GREEDYDATA:error_message}" }
}
}
}

# Parse system logs
if [type] == "syslog" {
grok {
match => { "message" => "%{SYSLOGLINE}" }
}
}

# Remove sensitive data
mutate {
remove_field => ["password", "api_key", "token"]
}
}

output {
# Send to Elasticsearch
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{+YYYY.MM.dd}"
template_name => "logs"
template => "/etc/logstash/templates/logs-template.json"
template_overwrite => true
}

# Debug output (remove in production)
stdout {
codec => rubydebug
}
}

Filebeat Configuration (lightweight log shipper):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# filebeat.yml
filebeat.inputs:

- type: log
enabled: true
paths:

- /var/log/app/*.log
fields:
log_type: application
environment: production
fields_under_root: false
multiline.pattern: '^\d{4}-\d{2}-\d{2}'
multiline.negate: true
multiline.match: after

- type: log
enabled: true
paths:

- /var/log/nginx/access.log
fields:
log_type: nginx_access
json.keys_under_root: true
json.add_error_key: true

processors:

- add_host_metadata:
when.not.contains.tags: forwarded

- add_docker_metadata: ~

output.logstash:
hosts: ["logstash:5044"]

# Optional: Send directly to Elasticsearch
# output.elasticsearch:
# hosts: ["elasticsearch:9200"]
# index: "filebeat-%{+yyyy.MM.dd}"

Kibana Query Examples:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Find all errors in the last hour
level: ERROR AND @timestamp >= now()-1h

# Find slow requests (>1 second)
duration_ms > 1000 AND @timestamp >= now()-24h

# Group errors by service
level: ERROR | stats count() by service

# Find authentication failures
message: "authentication failed" OR message: "login failed"

# Filter by user ID
user_id: "12345" AND @timestamp >= now()-7d

Centralized Logging Best Practices

Structured Logging: Use JSON format for logs to enable easier parsing and querying.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import json
import logging

class JSONFormatter(logging.Formatter):
def format(self, record):
log_entry = {
"timestamp": self.formatTime(record),
"level": record.levelname,
"logger": record.name,
"message": record.getMessage(),
"module": record.module,
"function": record.funcName,
"line": record.lineno,
}

# Add extra fields
if hasattr(record, "user_id"):
log_entry["user_id"] = record.user_id
if hasattr(record, "request_id"):
log_entry["request_id"] = record.request_id
if hasattr(record, "duration_ms"):
log_entry["duration_ms"] = record.duration_ms

# Add exception info
if record.exc_info:
log_entry["exception"] = self.formatException(record.exc_info)

return json.dumps(log_entry)

# Usage
logger = logging.getLogger(__name__)
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)

logger.info("User logged in", extra={"user_id": "12345", "ip": "192.168.1.1"})

Log Retention Policies: Define retention periods based on compliance requirements and cost considerations. Hot storage for recent logs, warm storage for older logs, cold storage for archival.

Log Sampling: For high-volume applications, sample logs to reduce storage costs while maintaining visibility into errors and important events.

Security: Sanitize logs to remove sensitive information (passwords, credit card numbers, PII). Use log encryption in transit and at rest.

Automation and CI/CD

Automation is the backbone of modern cloud operations, enabling rapid, reliable deployments and reducing manual errors.

CI/CD Pipeline Components

Continuous Integration (CI): Automatically build and test code changes when developers commit to version control.

Continuous Deployment (CD): Automatically deploy code changes to production after passing tests.

Pipeline Stages: 1. Source: Code repository (Git) 2. Build: Compile code, run unit tests 3. Test: Integration tests, security scans 4. Deploy: Deploy to staging/production 5. Verify: Smoke tests, monitoring checks 6. Rollback: Automatic rollback on failure

GitHub Actions Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# .github/workflows/deploy.yml
name: Deploy to AWS

on:
push:
branches:

- main
pull_request:
branches:

- main

env:
AWS_REGION: us-east-1
ECR_REPOSITORY: my-app
ECS_SERVICE: my-app-service
ECS_CLUSTER: production

jobs:
test:
runs-on: ubuntu-latest
steps:

- uses: actions/checkout@v3

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'

- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pytest pytest-cov

- name: Run tests
run: |
pytest --cov=app --cov-report=xml

- name: Upload coverage
uses: codecov/codecov-action@v3
with:
file: ./coverage.xml

security-scan:
runs-on: ubuntu-latest
steps:

- uses: actions/checkout@v3

- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
scan-type: 'fs'
scan-ref: '.'
format: 'sarif'
output: 'trivy-results.sarif'

- name: Upload Trivy results
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: 'trivy-results.sarif'

build-and-deploy:
needs: [test, security-scan]
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'

steps:

- uses: actions/checkout@v3

- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key:${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}

- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v1

- name: Build, tag, and push image
env:
ECR_REGISTRY:${{ steps.login-ecr.outputs.registry }}
IMAGE_TAG: ${{ github.sha }}
run: |
docker build -t$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
docker push$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
docker tag$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG$ECR_REGISTRY/$ECR_REPOSITORY:latest
docker push$ECR_REGISTRY/$ECR_REPOSITORY:latest

- name: Deploy to ECS
uses: aws-actions/amazon-ecs-deploy-task-definition@v1
with:
task-definition: task-definition.json
service:${{ env.ECS_SERVICE }}
cluster: ${{ env.ECS_CLUSTER }}
wait-for-service-stability: true

- name: Run smoke tests
run: |
sleep 30
curl -f https://api.example.com/health || exit 1

- name: Notify on failure
if: failure()
uses: 8398a7/action-slack@v3
with:
status:${{ job.status }}
text: 'Deployment failed!'
webhook_url: ${{ secrets.SLACK_WEBHOOK }}

GitLab CI/CD Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# .gitlab-ci.yml
stages:

- build
- test
- security
- deploy

variables:
DOCKER_DRIVER: overlay2
DOCKER_TLS_CERTDIR: "/certs"

build:
stage: build
image: docker:latest
services:

- docker:dind
script:

- docker build -t$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
- docker build -t$CI_REGISTRY_IMAGE:latest .
- docker login -u$CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD$CI_REGISTRY
- docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
- docker push$CI_REGISTRY_IMAGE:latest
only:

- main
- develop

test:
stage: test
image: python:3.11
script:

- pip install -r requirements.txt
- pip install pytest pytest-cov
- pytest --cov=app --cov-report=html --cov-report=term
coverage: '/TOTAL.*\s+(\d+%)$/'
artifacts:
reports:
coverage_report:
coverage_format: cobertura
path: coverage.xml
paths:

- htmlcov/

security:
stage: security
image: aquasec/trivy:latest
script:

- trivy image --exit-code 0 --severity HIGH,CRITICAL$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA

deploy-staging:
stage: deploy
image: bitnami/kubectl:latest
script:

- kubectl set image deployment/app app=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA -n staging
- kubectl rollout status deployment/app -n staging
environment:
name: staging
url: https://staging.example.com
only:

- develop

deploy-production:
stage: deploy
image: bitnami/kubectl:latest
script:

- kubectl set image deployment/app app=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA -n production
- kubectl rollout status deployment/app -n production
environment:
name: production
url: https://api.example.com
when: manual
only:

- main

Auto-Scaling Strategies

Auto-scaling automatically adjusts compute resources based on demand, ensuring optimal performance while minimizing costs.

Horizontal vs Vertical Scaling

Horizontal Scaling (Scale Out/In): Add or remove instances. Better for cloud environments, provides high availability.

Vertical Scaling (Scale Up/Down): Increase or decrease instance size. Simpler but has limits and requires downtime.

AWS Auto Scaling Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
# Auto Scaling Group with Launch Template
Resources:
LaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: app-launch-template
LaunchTemplateData:
ImageId: ami-0c55b159cbfafe1f0
InstanceType: t3.medium
SecurityGroupIds:

- !Ref SecurityGroup
UserData:
Fn::Base64: !Sub |
#!/bin/bash
yum update -y
yum install -y docker
systemctl start docker
systemctl enable docker
docker run -d -p 80:8080${ECR_REPO}:latest

AutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
VPCZoneIdentifier:

- !Ref Subnet1
- !Ref Subnet2
LaunchTemplate:
LaunchTemplateId: !Ref LaunchTemplate
Version: !GetAtt LaunchTemplate.LatestVersionNumber
MinSize: 2
MaxSize: 10
DesiredCapacity: 2
HealthCheckType: ELB
HealthCheckGracePeriod: 300
TargetGroupARNs:

- !Ref TargetGroup
Tags:

- Key: Name
Value: app-instance
PropagateAtLaunch: true

# Scale-up policy (CPU > 70%)
ScaleUpPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AdjustmentType: ChangeInCapacity
AutoScalingGroupName: !Ref AutoScalingGroup
Cooldown: 300
ScalingAdjustment: 1

# Scale-down policy (CPU < 30%)
ScaleDownPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AdjustmentType: ChangeInCapacity
AutoScalingGroupName: !Ref AutoScalingGroup
Cooldown: 300
ScalingAdjustment: -1

CPUAlarmHigh:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: HighCPU
AlarmDescription: Alarm when CPU exceeds 70%
MetricName: CPUUtilization
Namespace: AWS/EC2
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 70
ComparisonOperator: GreaterThanThreshold
Dimensions:

- Name: AutoScalingGroupName
Value: !Ref AutoScalingGroup
AlarmActions:

- !Ref ScaleUpPolicy

CPUAlarmLow:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: LowCPU
AlarmDescription: Alarm when CPU drops below 30%
MetricName: CPUUtilization
Namespace: AWS/EC2
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 30
ComparisonOperator: LessThanThreshold
Dimensions:

- Name: AutoScalingGroupName
Value: !Ref AutoScalingGroup
AlarmActions:

- !Ref ScaleDownPolicy

Kubernetes Horizontal Pod Autoscaler

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app
minReplicas: 2
maxReplicas: 10
metrics:

- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80

- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:

- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:

- type: Percent
value: 100
periodSeconds: 15

- type: Pods
value: 2
periodSeconds: 15
selectPolicy: Max

Auto-Scaling Best Practices

Predictive Scaling: Use machine learning to predict traffic patterns and scale proactively.

Multiple Metrics: Don't rely solely on CPU. Consider memory, request rate, queue depth, and custom metrics.

Cooldown Periods: Prevent rapid scaling oscillations with appropriate cooldown periods.

Gradual Scaling: Scale up quickly but scale down gradually to handle traffic spikes.

Health Checks: Ensure new instances are healthy before routing traffic to them.

Cost Optimization: Use spot instances for non-critical workloads, reserved instances for baseline capacity.

Cost Optimization

Cloud costs can quickly spiral out of control without proper management. Effective cost optimization requires continuous monitoring, right-sizing, and strategic resource usage.

Cost Optimization Strategies

Right-Sizing: Match instance types to actual workload requirements. Use cloud provider tools to analyze utilization and recommend sizes.

Reserved Instances: Commit to 1-3 year terms for predictable workloads to save 30-70% compared to on-demand pricing.

Spot Instances: Use spot instances for fault-tolerant, flexible workloads. Can save up to 90% compared to on-demand.

Auto-Shutdown: Automatically stop non-production resources during off-hours.

Storage Optimization: Use appropriate storage classes (hot, warm, cold, archive) based on access patterns.

Data Transfer Optimization: Minimize data transfer costs by using CDNs, compressing data, and optimizing API calls.

Tagging and Cost Allocation: Tag resources to track costs by project, team, or environment.

AWS Cost Optimization Script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import boto3
from datetime import datetime, timedelta

def analyze_idle_instances():
"""Identify EC2 instances with low CPU utilization"""
ec2 = boto3.client('ec2')
cloudwatch = boto3.client('cloudwatch')

# Get all running instances
response = ec2.describe_instances(
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
)

idle_instances = []
end_time = datetime.utcnow()
start_time = end_time - timedelta(days=7)

for reservation in response['Reservations']:
for instance in reservation['Instances']:
instance_id = instance['InstanceId']

# Get CPU utilization for the last 7 days
metrics = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=start_time,
EndTime=end_time,
Period=3600,
Statistics=['Average']
)

if metrics['Datapoints']:
avg_cpu = sum([d['Average'] for d in metrics['Datapoints']]) / len(metrics['Datapoints'])

if avg_cpu < 10: # Less than 10% average CPU
idle_instances.append({
'instance_id': instance_id,
'instance_type': instance['InstanceType'],
'avg_cpu': avg_cpu,
'tags': instance.get('Tags', [])
})

return idle_instances

def generate_cost_report():
"""Generate cost optimization report"""
cost_explorer = boto3.client('ce')

end_date = datetime.utcnow().strftime('%Y-%m-%d')
start_date = (datetime.utcnow() - timedelta(days=30)).strftime('%Y-%m-%d')

# Get costs by service
response = cost_explorer.get_cost_and_usage(
TimePeriod={'Start': start_date, 'End': end_date},
Granularity='MONTHLY',
Metrics=['BlendedCost', 'UnblendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
)

print("Cost by Service (Last 30 days):")
for result in response['ResultsByTime']:
for group in result['Groups']:
service = group['Keys'][0]
cost = float(group['Metrics']['BlendedCost']['Amount'])
print(f" {service}: ${cost:.2f}")

# Get costs by instance type
response = cost_explorer.get_cost_and_usage(
TimePeriod={'Start': start_date, 'End': end_date},
Granularity='MONTHLY',
Metrics=['BlendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'INSTANCE_TYPE'}]
)

print("\nCost by Instance Type:")
for result in response['ResultsByTime']:
for group in result['Groups']:
instance_type = group['Keys'][0] if group['Keys'][0] else 'N/A'
cost = float(group['Metrics']['BlendedCost']['Amount'])
print(f" {instance_type}:${cost:.2f}")

if __name__ == '__main__':
print("Analyzing idle instances...")
idle = analyze_idle_instances()
print(f"\nFound {len(idle)} potentially idle instances:")
for inst in idle:
print(f" {inst['instance_id']}: {inst['instance_type']} (Avg CPU: {inst['avg_cpu']:.2f}%)")

print("\n" + "="*50)
generate_cost_report()

Troubleshooting Methodologies

Effective troubleshooting requires systematic approaches to identify root causes quickly.

The Troubleshooting Process

  1. Observe: Gather symptoms, check monitoring dashboards, review logs
  2. Hypothesize: Form theories about what might be wrong
  3. Test: Verify hypotheses through targeted checks
  4. Fix: Implement solutions
  5. Verify: Confirm the fix resolved the issue
  6. Document: Record the incident and resolution

Common Cloud Issues and Solutions

High Latency:

  • Check database query performance
  • Review cache hit rates
  • Analyze network latency
  • Check for resource contention
  • Review application code for inefficient algorithms

High Error Rates:

  • Check application logs for error patterns
  • Review dependency health (databases, APIs)
  • Check for resource exhaustion (memory, connections)
  • Review recent deployments
  • Check for configuration errors

Resource Exhaustion:

  • Monitor memory usage and leaks
  • Check connection pool sizes
  • Review disk space
  • Analyze CPU usage patterns
  • Check for runaway processes

Network Issues:

  • Verify security group rules
  • Check route tables
  • Review DNS configuration
  • Analyze network ACLs
  • Check for DDoS attacks

Troubleshooting Tools

Cloud Provider Tools:

  • AWS CloudWatch, X-Ray, Systems Manager
  • Google Cloud Monitoring, Trace, Debugger
  • Azure Monitor, Application Insights

Open Source Tools:

  • htop, iostat, netstat for system monitoring
  • tcpdump, wireshark for network analysis
  • strace, perf for application profiling
  • kubectl, docker stats for container debugging

Site Reliability Engineering (SRE)

Site Reliability Engineering, pioneered by Google, applies software engineering principles to operations, focusing on reliability, scalability, and efficiency.

SLO, SLI, and Error Budgets

Service Level Indicator (SLI): A quantitative measure of service quality (e.g., request latency, error rate, availability).

Service Level Objective (SLO): A target value for an SLI (e.g., 99.9% availability, P95 latency < 200ms).

Service Level Agreement (SLA): A contract with customers specifying consequences if SLOs aren't met.

Error Budget: The acceptable amount of unreliability (100% - SLO). If error budget is exhausted, freeze new feature development and focus on reliability.

Example SLO Definition:

1
2
3
4
5
6
7
8
9
10
11
SLI: Availability
Measurement: Successful requests / Total requests
Window: Rolling 30-day window
SLO: 99.9% availability
Error Budget: 0.1% (43.2 minutes of downtime per month)

SLI: Latency
Measurement: P95 request latency
Window: Rolling 7-day window
SLO: P95 latency < 200ms
Error Budget: 5% of requests can exceed 200ms

SRE Practices

Toil Reduction: Automate repetitive operational tasks to free engineers for high-value work.

Incident Response: Structured processes for handling incidents: 1. Detect and alert 2. Assess and escalate 3. Respond and mitigate 4. Post-mortem and learn

Post-Mortems: Blameless analysis of incidents focusing on process improvements, not individual blame.

Canary Deployments: Gradually roll out changes to a small subset of users before full deployment.

Feature Flags: Control feature rollouts and enable quick rollbacks without code deployments.

Chaos Engineering: Deliberately inject failures to test system resilience and identify weaknesses.

GitOps

GitOps is an operational model that uses Git as the single source of truth for infrastructure and application deployments.

GitOps Principles

  1. Declarative: Everything is defined declaratively (Kubernetes manifests, Terraform configs)
  2. Version Controlled: All changes tracked in Git
  3. Automated: Changes to Git automatically trigger deployments
  4. Observable: System state is continuously compared to Git state

GitOps Workflow

1
2
3
4
5
Developer → Git Commit → CI Pipeline → Container Registry

GitOps Operator

Kubernetes Cluster

ArgoCD Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Application manifest
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/example/k8s-manifests
targetRevision: main
path: apps/my-app
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:

- CreateNamespace=true
revisionHistoryLimit: 10

Flux Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Flux GitRepository
apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: GitRepository
metadata:
name: app-repo
namespace: flux-system
spec:
interval: 1m
url: https://github.com/example/k8s-manifests
ref:
branch: main
secretRef:
name: git-credentials

# Flux Kustomization
apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: Kustomization
metadata:
name: app-kustomization
namespace: flux-system
spec:
interval: 5m
path: ./apps/my-app
prune: true
sourceRef:
kind: GitRepository
name: app-repo
validation: client
healthChecks:

- apiVersion: apps/v1
kind: Deployment
name: my-app
namespace: production

Case Studies

Case Study 1: E-Commerce Platform Auto-Scaling

Challenge: An e-commerce platform experienced unpredictable traffic spikes during flash sales, causing site crashes and lost revenue.

Solution: Implemented comprehensive auto-scaling with predictive scaling:

  1. Infrastructure: AWS Auto Scaling Groups with Launch Templates
  2. Metrics: CPU, memory, request rate, queue depth
  3. Predictive Scaling: ML-based traffic prediction using historical data
  4. Database: Read replicas with connection pooling
  5. Caching: Multi-layer caching (CDN, Redis, application cache)

Results:

  • Handled 10x traffic spikes without manual intervention
  • Reduced costs by 40% through right-sizing and spot instances
  • Improved availability from 99.5% to 99.95%
  • Zero downtime during flash sales

Key Learnings:

  • Predictive scaling requires historical data; start with reactive scaling
  • Test auto-scaling under load before production
  • Monitor costs continuously; auto-scaling can increase costs if not configured properly

Case Study 2: Microservices Observability

Challenge: A company migrated from monolith to microservices but lost visibility into system behavior. Debugging issues took hours instead of minutes.

Solution: Implemented comprehensive observability stack:

  1. Metrics: Prometheus with custom exporters
  2. Logging: ELK stack with structured logging
  3. Tracing: OpenTelemetry with Jaeger
  4. APM: Datadog for application performance monitoring
  5. Dashboards: Grafana for visualization
  6. Alerting: PagerDuty integration

Architecture:

1
2
3
4
5
6
7
Applications → OpenTelemetry SDK → OTLP Collector

┌───────────────────┼───────────────────┐
↓ ↓ ↓
Prometheus Elasticsearch Jaeger
↓ ↓ ↓
Grafana Kibana Jaeger UI

Results:

  • Mean Time To Detect (MTTD): Reduced from 30 minutes to 2 minutes
  • Mean Time To Resolve (MTTR): Reduced from 4 hours to 30 minutes
  • Improved developer productivity through better debugging tools
  • Proactive issue detection before user impact

Key Learnings:

  • Instrumentation overhead is minimal (<1% CPU)
  • Structured logging is essential for microservices
  • Distributed tracing reveals unexpected dependencies
  • Start with metrics and logs, add tracing when needed

Case Study 3: Cost Optimization for Startup

Challenge: A startup's cloud costs grew from $500/month to$15,000/month in 6 months without corresponding revenue growth.

Solution: Comprehensive cost optimization program:

  1. Cost Analysis: Identified cost drivers using AWS Cost Explorer
  2. Right-Sizing: Analyzed CloudWatch metrics to resize instances
  3. Reserved Instances: Purchased 1-year RIs for baseline capacity
  4. Spot Instances: Migrated batch jobs to spot instances
  5. Auto-Shutdown: Automated shutdown of dev/staging environments
  6. Storage Optimization: Moved old data to S3 Glacier
  7. Tagging: Implemented comprehensive tagging for cost allocation

Actions Taken:

  • Reduced instance sizes: t3.large → t3.medium (saved 50%)
  • Purchased RIs: Saved 40% on baseline capacity
  • Spot instances for batch: Saved 70% on batch processing
  • Auto-shutdown: Saved 60% on non-production environments
  • Storage optimization: Saved 80% on archival storage

Results:

  • Reduced monthly costs from $15,000 to$4,500 (70% reduction)
  • Maintained performance and availability
  • Established cost monitoring and alerting
  • Created cost allocation by team/project

Key Learnings:

  • Regular cost reviews prevent cost creep
  • Tagging is essential for cost allocation
  • Non-production environments are often over-provisioned
  • Reserved instances require commitment but provide significant savings

Q&A: Common Cloud Operations Questions

Q1: How do I choose between Terraform, Ansible, and CloudFormation?

A: Choose based on your cloud provider and use case:

  • Terraform: Multi-cloud, declarative, large ecosystem. Best for provisioning infrastructure across providers.
  • CloudFormation: AWS-native, integrates well with AWS services. Best if you're AWS-only.
  • Ansible: Configuration management and application deployment. Use alongside Terraform/CloudFormation for configuring provisioned resources.

Many teams use Terraform for provisioning and Ansible for configuration.

Q2: What's the difference between monitoring and observability?

A: Monitoring tells you what's happening (metrics, alerts). Observability helps you understand why (metrics + logs + traces + context).

Monitoring answers "Is the system working?" Observability answers "Why isn't it working?" when something goes wrong.

Q3: How do I determine appropriate SLO targets?

A: Start with business requirements: 1. What availability do customers expect? 2. What latency is acceptable? 3. What error rate is tolerable?

Then work backwards:

  • Analyze historical data to understand current performance
  • Set SLOs slightly above current performance (achievable but requires improvement)
  • Consider error budgets: more aggressive SLOs = less room for new features
  • Review and adjust quarterly based on business needs

Q4: Should I use managed services or self-hosted tools?

A: Consider:

  • Managed services (CloudWatch, Datadog, New Relic): Faster setup, less maintenance, higher cost, potential vendor lock-in
  • Self-hosted (Prometheus, Grafana, ELK): More control, lower cost at scale, requires operational expertise

Start with managed services for speed, migrate to self-hosted if costs become significant or you need specific customizations.

Q5: How do I implement effective alerting?

A: Follow alerting best practices:

  • Alert on symptoms users care about, not every metric
  • Use alert fatigue prevention: Appropriate thresholds, grouping, and suppression
  • Page on-call only for actionable alerts that require immediate response
  • Use different severity levels: Critical (page), Warning (ticket), Info (dashboard)
  • Test alerts regularly to ensure they work
  • Document runbooks for common alerts

Q6: What's the best auto-scaling strategy?

A: Use multiple strategies:

  • Reactive scaling: Scale based on current metrics (CPU, memory, request rate)
  • Predictive scaling: Use ML to predict traffic and scale proactively
  • Scheduled scaling: Scale based on known patterns (business hours, events)
  • Multiple metrics: Don't rely on CPU alone; consider memory, queue depth, custom metrics

Start simple with reactive scaling, add predictive scaling as you gather data.

Q7: How do I optimize cloud costs without impacting performance?

A: Systematic approach: 1. Measure: Use cost allocation tags and cost analysis tools 2. Right-size: Analyze utilization and resize instances 3. Reserved instances: For predictable workloads 4. Spot instances: For fault-tolerant workloads 5. Auto-shutdown: Stop non-production resources when not needed 6. Storage optimization: Use appropriate storage classes 7. Review regularly: Monthly cost reviews to catch cost creep

Always test cost optimizations in non-production first.

Q8: What's the difference between CI and CD?

A:

  • CI (Continuous Integration): Automatically build and test code when developers commit. Focuses on code quality.
  • CD (Continuous Deployment): Automatically deploy code to production after passing tests. Focuses on delivery speed.

Some teams use Continuous Delivery (manual approval before production deployment) instead of Continuous Deployment (automatic).

Q9: How do I implement GitOps?

A: Steps: 1. Store everything in Git: Infrastructure code, application configs, Kubernetes manifests 2. Use GitOps operator: ArgoCD or Flux to sync Git state to clusters 3. Automate: CI pipeline builds containers, GitOps operator deploys 4. Monitor: GitOps operator continuously compares cluster state to Git 5. Self-heal: Automatically revert manual changes to match Git

Start with a single application, expand gradually.

Q10: What should I include in a post-mortem?

A: Post-mortem structure: 1. Timeline: Chronological events leading to incident 2. Impact: Users affected, duration, business impact 3. Root cause: Technical and process causes 4. What went well: Response effectiveness 5. What went wrong: Gaps in monitoring, processes, tools 6. Action items: Specific, assigned, time-bound improvements 7. Follow-up: Review action items in next post-mortem

Keep post-mortems blameless and focused on learning.

Cloud Operations Checklist

Use this checklist to ensure comprehensive cloud operations coverage:

Infrastructure

Monitoring and Observability

Alerting

CI/CD

Auto-Scaling

Cost Management

Security

Documentation

SRE Practices

Disaster Recovery

Conclusion

Cloud operations and DevOps practices are essential for building and maintaining reliable, scalable, and cost-effective cloud systems. The journey from traditional IT operations to modern cloud operations requires cultural shifts, new tools, and continuous learning.

Key takeaways:

  • Infrastructure as Code brings version control, testing, and automation to infrastructure management
  • Comprehensive observability (metrics, logs, traces) is essential for understanding system behavior
  • Automation reduces manual errors and enables rapid, reliable deployments
  • Auto-scaling ensures optimal performance while controlling costs
  • Cost optimization requires continuous monitoring and strategic resource usage
  • SRE practices (SLOs, error budgets, post-mortems) improve reliability systematically
  • GitOps provides a modern operational model using Git as the source of truth

The cloud operations landscape continues to evolve. New tools emerge, best practices refine, and the complexity of distributed systems increases. Staying current requires continuous learning, experimentation, and adaptation. Start with the fundamentals, implement incrementally, measure results, and iterate based on what you learn.

Remember: perfect operations don't exist. The goal is continuous improvement — detecting issues faster, resolving them quicker, and preventing them proactively. Every incident is a learning opportunity, every optimization a step toward better reliability and efficiency.

  • Post title:Cloud Computing (7): Operations and DevOps Practices
  • Post author:Chen Kai
  • Create time:2023-03-08 00:00:00
  • Post link:https://www.chenk.top/en/cloud-computing-operations-devops/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments