云计算（七）运维与 DevOps 实践

云计算的本质是将基础设施抽象化，但抽象不等于消失。当应用部署到云端，运维工作并没有减少，而是发生了根本性的转变：从物理机房的硬件维护，转向了云资源的编排、监控、优化和自动化。 DevOps 理念的引入，让开发与运维的边界逐渐模糊，形成了"你构建它，你运行它"的文化。

本文将深入探讨云运维的核心流程、基础设施即代码实践、监控与日志体系、自动化运维、成本优化，以及 SRE 最佳实践。通过实战案例和工具链介绍，帮助读者构建完整的云运维知识体系。

云运维核心流程与职责

传统运维关注的是服务器、网络、存储等物理资源，而云运维的核心是资源生命周期管理和服务可用性保障。云运维团队的主要职责包括：

基础设施管理

基础设施管理不再是购买服务器、安装操作系统，而是通过云控制台或 API 创建和管理虚拟资源。运维人员需要：

资源规划：根据业务需求选择合适的云服务类型（计算、存储、网络）
资源创建：通过控制台、 CLI 或代码创建和管理资源
配置管理：确保资源配置符合安全策略和最佳实践
资源回收：及时清理不再使用的资源，避免成本浪费

应用部署与发布

云环境下的应用部署通常采用容器化或 Serverless 架构：

容器编排：使用 Kubernetes 、 Docker Swarm 等管理容器生命周期
CI/CD 流水线：自动化构建、测试、部署流程
蓝绿部署/金丝雀发布：降低发布风险，实现零停机更新
回滚机制：快速恢复到上一个稳定版本

监控与告警

监控是云运维的眼睛，需要覆盖多个维度：

基础设施监控： CPU 、内存、磁盘、网络等资源使用情况
应用监控：请求量、响应时间、错误率、吞吐量等业务指标
日志监控：应用日志、系统日志、审计日志的收集与分析
告警管理：设置合理的告警阈值，避免告警疲劳

故障处理与恢复

云环境下的故障处理需要快速定位和恢复：

故障定位：通过监控、日志、链路追踪快速定位问题根因
应急响应：建立故障响应流程，明确责任人和处理步骤
自动恢复：利用云服务的自动恢复能力（如健康检查、自动重启）
事后复盘：分析故障原因，优化系统架构和流程

安全与合规

云环境的安全责任是共担的：

身份与访问管理： IAM 策略、 RBAC 、最小权限原则
网络安全： VPC 、安全组、防火墙规则配置
数据安全：加密传输、加密存储、密钥管理
合规审计：满足行业合规要求，定期进行安全审计

基础设施即代码（ IaC）

基础设施即代码（ Infrastructure as Code, IaC）是云运维的核心实践。它将基础设施的定义、配置和管理以代码的形式描述，实现版本控制、可重复性和自动化。

Terraform

Terraform 是 HashiCorp 开源的 IaC 工具，支持多云平台。它使用声明式语言描述期望的基础设施状态，通过 terraform plan 预览变更，terraform apply 执行变更。

Terraform 基础语法

# provider.tf - 配置云服务提供商
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "cn-north-1"
}

# vpc.tf - 创建 VPC 网络
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "production-vpc"
    Environment = "prod"
  }
}

# subnet.tf - 创建子网
resource "aws_subnet" "public" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.${count.index + 1}.0/24"
  availability_zone = data.aws_availability_zones.available.names[count.index]

  map_public_ip_on_launch = true

  tags = {
    Name = "public-subnet-${count.index + 1}"
  }
}

resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.${count.index + 10}.0/24"
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name = "private-subnet-${count.index + 1}"
  }
}

# security-group.tf - 安全组配置
resource "aws_security_group" "web" {
  name        = "web-sg"
  description = "Security group for web servers"
  vpc_id      = aws_vpc.main.id

  ingress {
    description = "HTTP"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    description = "HTTPS"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "web-security-group"
  }
}

# ec2.tf - EC2 实例配置
resource "aws_instance" "web" {
  count                  = 2
  ami                    = data.aws_ami.ubuntu.id
  instance_type          = "t3.medium"
  subnet_id              = aws_subnet.public[count.index].id
  vpc_security_group_ids = [aws_security_group.web.id]

  user_data = <<-EOF
              #!/bin/bash
              apt-get update
              apt-get install -y nginx
              systemctl start nginx
              systemctl enable nginx
              EOF

  tags = {
    Name = "web-server-${count.index + 1}"
  }
}

# data.tf - 数据源定义
data "aws_availability_zones" "available" {
  state = "available"
}

data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"] # Canonical

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }
}

# output.tf - 输出值
output "vpc_id" {
  value = aws_vpc.main.id
}

output "public_subnet_ids" {
  value = aws_subnet.public[*].id
}

output "web_server_ips" {
  value = aws_instance.web[*].public_ip
}

Terraform 模块化

模块化可以提高代码复用性和可维护性：

# modules/ec2/main.tf
variable "instance_type" {
  description = "EC2 instance type"
  type        = string
}

variable "subnet_id" {
  description = "Subnet ID"
  type        = string
}

variable "security_group_ids" {
  description = "Security group IDs"
  type        = list(string)
}

resource "aws_instance" "this" {
  ami                    = var.ami_id
  instance_type          = var.instance_type
  subnet_id              = var.subnet_id
  vpc_security_group_ids = var.security_group_ids
}

output "instance_id" {
  value = aws_instance.this.id
}

# main.tf - 使用模块
module "web_server" {
  source = "./modules/ec2"

  instance_type      = "t3.medium"
  subnet_id          = aws_subnet.public[0].id
  security_group_ids = [aws_security_group.web.id]
}

Terraform 状态管理

Terraform 状态文件记录了资源的实际状态，需要安全存储：

# backend.tf - 使用 S3 作为后端存储
terraform {
  backend "s3" {
    bucket         = "terraform-state-bucket"
    key            = "production/terraform.tfstate"
    region         = "cn-north-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

Ansible

Ansible 是 Red Hat 开源的配置管理工具，使用 YAML 语法描述配置，通过 SSH 执行任务，无需在目标机器安装 agent 。

Ansible Playbook 示例

# playbook.yml - 部署 Nginx 服务器
---
- name: Configure web servers
  hosts: web_servers
  become: yes
  vars:
    nginx_version: "1.24.0"
    worker_processes: 4

  tasks:

    - name: Update apt cache
      apt:
        update_cache: yes
        cache_valid_time: 3600

    - name: Install Nginx
      apt:
        name: nginx
        state: present

    - name: Configure Nginx
      template:
        src: nginx.conf.j2
        dest: /etc/nginx/nginx.conf
      notify: restart nginx

    - name: Enable and start Nginx
      systemd:
        name: nginx
        enabled: yes
        state: started

  handlers:

    - name: restart nginx
      systemd:
        name: nginx
        state: restarted

# inventory.yml - 主机清单
all:
  children:
    web_servers:
      hosts:
        web1:
          ansible_host: 10.0.1.10
        web2:
          ansible_host: 10.0.1.11
      vars:
        ansible_user: ubuntu
        ansible_ssh_private_key_file: ~/.ssh/id_rsa

# nginx.conf.j2 - Jinja2 模板
user www-data;
worker_processes {{  worker_processes  }};
pid /run/nginx.pid;

events {
    worker_connections 1024;
}

http {
    include /etc/nginx/mime.types;
    default_type application/octet-stream;

    log_format main '$ remote_addr - $ remote_user [$ time_local] "$ request" '
                    '$ status $ body_bytes_sent "$ http_referer" '
                    '"$ http_user_agent" "$ http_x_forwarded_for"';

    access_log /var/log/nginx/access.log main;
    error_log /var/log/nginx/error.log;

    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    types_hash_max_size 2048;

    server {
        listen 80;
        server_name _;

        location / {
            proxy_pass http://backend;
            proxy_set_header Host $ host;
            proxy_set_header X-Real-IP $ remote_addr;
            proxy_set_header X-Forwarded-For $ proxy_add_x_forwarded_for;
        }
    }
}

Ansible Roles

Roles 用于组织复杂的 Playbook：

# roles/nginx/tasks/main.yml
- name: Install Nginx
  apt:
    name: nginx
    state: present

- name: Configure Nginx
  template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf

- name: Start Nginx
  systemd:
    name: nginx
    enabled: yes
    state: started

# site.yml - 使用 Role
---
- hosts: web_servers
  roles:

    - nginx
    - { role: monitoring, tags: ['monitoring'] }

AWS CloudFormation

CloudFormation 是 AWS 原生的 IaC 工具，使用 JSON 或 YAML 格式：

# cloudformation-template.yml
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Web application infrastructure'

Parameters:
  InstanceType:
    Type: String
    Default: t3.medium
    AllowedValues:

      - t3.small
      - t3.medium
      - t3.large

Resources:
  VPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 10.0.0.0/16
      EnableDnsHostnames: true
      Tags:

        - Key: Name
          Value: ProductionVPC

  PublicSubnet:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      CidrBlock: 10.0.1.0/24
      AvailabilityZone: !Select [0, !GetAZs '']

  WebServer:
    Type: AWS::EC2::Instance
    Properties:
      ImageId: ami-0c55b159cbfafe1f0
      InstanceType: !Ref InstanceType
      SubnetId: !Ref PublicSubnet
      SecurityGroupIds:

        - !Ref WebSecurityGroup

  WebSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for web servers
      VpcId: !Ref VPC
      SecurityGroupIngress:

        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0

        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 0.0.0.0/0

Outputs:
  VPCId:
    Description: VPC ID
    Value: !Ref VPC
    Export:
      Name: !Sub '${AWS::StackName}-VPCId'

  WebServerIP:
    Description: Web server public IP
    Value: !GetAtt WebServer.PublicIp

IaC 最佳实践

版本控制：所有 IaC 代码纳入 Git 管理，使用分支策略
代码审查：通过 Pull Request 审查基础设施变更
环境隔离：开发、测试、生产环境使用独立的配置和状态
状态锁定：使用 DynamoDB 、 Consul 等实现状态文件锁定
变更预览：执行变更前先预览，确认无误后再应用
回滚计划：保留历史状态，支持快速回滚
文档化：为每个模块和资源添加清晰的注释和文档

监控体系

监控是云运维的基石，需要建立多层次的监控体系，覆盖基础设施、应用、业务等各个层面。

Prometheus

Prometheus 是 CNCF 毕业的开源监控系统，采用拉取模式收集指标，使用 PromQL 查询语言。

Prometheus 配置

问题背景： Prometheus 采用拉取（ Pull）模式收集指标，需要配置抓取目标（ scrape targets）和告警规则。在 Kubernetes 环境中， Pod 是动态创建和销毁的，需要自动发现监控目标。合理的 Prometheus 配置可以确保指标收集的完整性和告警的及时性。

解决思路： - 全局配置：设置统一的抓取间隔和评估间隔，平衡数据精度和资源消耗 - 服务发现：使用 Kubernetes 服务发现自动发现 Pod，无需手动配置目标列表 - 标签重写：使用 relabel_configs 重写标签，统一指标格式，便于查询和告警 - 告警集成：配置 Alertmanager 地址，将告警规则评估结果发送到告警管理器

设计考虑： - 抓取间隔：较短的间隔（ 15s）提供更实时数据，但增加 Prometheus 负载；建议根据指标重要性调整 - 外部标签：为所有指标添加外部标签（如 cluster 、 environment），便于多集群管理 - 服务发现： Kubernetes 服务发现自动发现 Pod，通过 Pod 注解控制是否抓取 - 安全配置：访问 Kubernetes API 需要 TLS 证书和 Bearer Token

# prometheus.yml
# Prometheus 主配置文件
# 用途：配置指标抓取目标、告警规则和 Alertmanager 集成

# 全局配置：应用于所有抓取作业的默认设置
global:
  # 抓取间隔： Prometheus 从目标抓取指标的频率
  # 15 秒：平衡数据精度和 Prometheus 负载
  # 注意：可以在每个 job 中覆盖此设置
  scrape_interval: 15s
  
  # 评估间隔： Prometheus 评估告警规则的频率
  # 应与 scrape_interval 相同或略大，确保有足够数据评估告警
  evaluation_interval: 15s
  
  # 外部标签：添加到所有时间序列的标签
  # 用途：在多集群环境中标识数据来源，便于数据聚合和查询
  external_labels:
    cluster: 'production'      # 集群标识
    environment: 'prod'        # 环境标识
    # 可以添加更多标签，如 region 、 team 等

# 告警规则文件：定义告警条件和通知内容
# 注意：文件路径相对于 Prometheus 配置文件所在目录
rule_files:
  - "alert_rules.yml"
  # 可以包含多个规则文件，按功能分类
  # - "infrastructure_alerts.yml"
  # - "application_alerts.yml"

# 抓取配置：定义 Prometheus 从哪里抓取指标
scrape_configs:

  # Job 1: Prometheus 自身监控
  # 用途：监控 Prometheus 自身的性能指标
  - job_name: 'prometheus'
    # 静态配置：手动指定目标列表
    # 适用场景：固定 IP/域名的服务
    static_configs:
      - targets: ['localhost:9090']
        # 可以添加标签，用于区分不同实例
        # labels:
        #   instance: 'prometheus-1'

  # Job 2: Node Exporter 监控
  # 用途：监控节点（服务器）的基础设施指标（ CPU 、内存、磁盘等）
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
        # 可以监控多个节点
        # - targets: ['node-exporter-1:9100', 'node-exporter-2:9100']

  # Job 3: Kubernetes Pod 自动发现
  # 用途：自动发现 Kubernetes 集群中的 Pod，抓取应用指标
  # 优势：无需手动配置，自动适应 Pod 创建和销毁
  - job_name: 'kubernetes-pods'
    # Kubernetes 服务发现配置
    kubernetes_sd_configs:
      - role: pod  # 发现 Pod 资源
        # 可选：限制命名空间
        # namespaces:
        #   names:
        #     - production
        #     - staging
    
    # 标签重写配置：修改或过滤发现的 targets
    relabel_configs:
      # 规则 1：只抓取带有 prometheus.io/scrape=true 注解的 Pod
      # 用途：选择性监控，避免抓取所有 Pod
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep  # 保留匹配的 targets
        regex: true    # 匹配值为 true 的注解

      # 规则 2：使用 Pod 注解指定 metrics 路径
      # 用途：支持自定义 metrics 端点路径
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__  # Prometheus 内置标签
        regex: (.+)  # 匹配任何非空值

      # 规则 3：使用 Pod 注解指定 metrics 端口
      # 用途：支持非标准端口（默认 9090）
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)  # 提取 IP 和端口
        replacement: $1:$2              # 组合为 IP:Port 格式
        target_label: __address__

      # 可选：添加自定义标签，便于查询和告警
      # - source_labels: [__meta_kubernetes_pod_name]
      #   target_label: pod_name
      # - source_labels: [__meta_kubernetes_namespace]
      #   target_label: namespace

  # Job 4: Kubernetes API Server 监控
  # 用途：监控 Kubernetes API Server 的性能指标
  # 安全：需要 ServiceAccount 权限和 TLS 配置
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints  # 发现 Endpoints 资源
    # HTTPS 配置： API Server 使用 HTTPS
    scheme: https
    tls_config:
      # CA 证书：验证 API Server 证书
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    # Bearer Token：用于 API Server 认证
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      # 只监控 default 命名空间的 kubernetes 服务
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

# 告警配置：配置 Alertmanager 地址
# 用途：将触发的告警发送到 Alertmanager 进行路由和通知
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 'alertmanager:9093'
          # 可以配置多个 Alertmanager 实现高可用
          # - 'alertmanager-1:9093'
          # - 'alertmanager-2:9093'

关键点解读： - 服务发现机制： Kubernetes 服务发现自动发现 Pod，通过 Pod 注解（prometheus.io/scrape）控制是否抓取，实现零配置监控 - 标签重写： relabel_configs 在抓取前修改 target 标签，可以过滤、重命名、添加标签，统一指标格式 - 抓取间隔：较短的间隔提供更实时数据，但增加 Prometheus 和网络负载；建议关键指标 15s，非关键指标 30s-60s - 外部标签：添加到所有时间序列，便于在多集群环境中聚合和查询数据

设计权衡： - 抓取频率 vs 资源消耗：更频繁的抓取提供更实时数据，但增加 Prometheus CPU 和存储消耗；建议根据指标重要性调整 - 服务发现 vs 静态配置：服务发现自动适应变化但配置复杂，静态配置简单但需要手动维护；建议混合使用 - 标签数量 vs 查询性能：更多标签提供更细粒度查询，但增加存储和查询开销；建议只添加必要的标签

常见问题： - Q: 如何只监控特定命名空间的 Pod？ A: 在 kubernetes_sd_configs 中添加 namespaces 配置，或在 relabel_configs 中过滤 - Q: Pod 注解的格式是什么？ A: prometheus.io/scrape: "true"、prometheus.io/port: "8080"、prometheus.io/path: "/metrics" - Q: 如何减少 Prometheus 存储占用？ A: 增加抓取间隔、减少保留时间、使用记录规则预聚合指标

生产实践： - 使用 ConfigMap 管理 Prometheus 配置，实现配置版本控制和自动化部署 - 为不同环境（生产、测试）使用不同的外部标签，便于数据隔离和查询 - 定期审查抓取配置，移除不再使用的 job，优化 Prometheus 性能 - 使用 Prometheus Operator 简化 Kubernetes 环境下的 Prometheus 管理 - 配置合理的保留时间（ retention），平衡存储成本和历史数据需求 - 使用记录规则（ recording rules）预聚合常用查询，提高查询性能 - 监控 Prometheus 自身指标（如prometheus_target_scrapes_exceeded_sample_limit_total），及时发现配置问题

Prometheus 告警规则

# alert_rules.yml
groups:

  - name: infrastructure
    interval: 30s
    rules:

      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage is above 80% for {{  $ labels.instance  }}"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage detected"
          description: "Memory usage is above 85% for {{  $ labels.instance  }}"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space is running low"
          description: "Disk space is below 15% for {{  $ labels.instance  }}"

  - name: application
    interval: 30s
    rules:

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is above 5% for {{  $ labels.service  }}"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "95th percentile latency is above 1s for {{  $ labels.service  }}"

PromQL 查询示例

# CPU 使用率
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# HTTP 请求 QPS
sum(rate(http_requests_total[5m])) by (service)

# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)

# P95 延迟
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

Grafana

Grafana 是开源的可视化平台，支持多种数据源，提供丰富的图表类型和告警功能。

Grafana Dashboard JSON 示例

{
  "dashboard": {
    "title": "Infrastructure Monitoring",
    "panels": [
      {
        "title": "CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{ instance }}"
          }
        ],
        "yaxes": [
          {
            "format": "percent",
            "max": 100,
            "min": 0
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
            "legendFormat": "{{ instance }}"
          }
        ],
        "yaxes": [
          {
            "format": "percent",
            "max": 100,
            "min": 0
          }
        ]
      },
      {
        "title": "HTTP Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{ service }}"
          }
        ],
        "yaxes": [
          {
            "format": "reqps"
          }
        ]
      }
    ]
  }
}

云监控服务

各大云厂商都提供了原生的监控服务：

AWS CloudWatch

# 创建 CloudWatch 告警
aws cloudwatch put-metric-alarm \
  --alarm-name high-cpu-usage \
  --alarm-description "Alert when CPU exceeds 80%" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:cn-north-1:123456789012:alerts-topic

# 发送自定义指标
aws cloudwatch put-metric-data \
  --namespace MyApp \
  --metric-name RequestCount \
  --value 100 \
  --unit Count

阿里云云监控

# Python SDK 发送自定义指标
from aliyunsdkcore.client import AcsClient
from aliyunsdkcms.request.v20190101 import PutCustomMetricRequest

client = AcsClient('your-access-key-id', 'your-access-key-secret', 'cn-hangzhou')

request = PutCustomMetricRequest.PutCustomMetricRequest()
request.set_MetricList([
    {
        "MetricName": "RequestCount",
        "Value": "100",
        "Unit": "Count",
        "Dimensions": "{\"service\":\"api\"}"
    }
])

response = client.do_action_with_exception(request)

监控最佳实践

分层监控：基础设施层、应用层、业务层分别监控
指标选择：关注关键指标（黄金信号：延迟、流量、错误、饱和度）
告警收敛：避免告警风暴，使用告警分组和抑制规则
SLO 驱动：基于 SLO 设置告警阈值
可观测性：指标、日志、链路追踪三位一体
成本控制：合理设置指标保留期，避免存储成本过高

日志管理

日志是故障排查和审计的重要依据。云环境下的日志管理需要解决收集、存储、检索和分析等问题。

ELK Stack

ELK Stack（ Elasticsearch 、 Logstash 、 Kibana）是经典的日志管理方案。

Logstash 配置

# logstash.conf
input {
  file {
    path => "/var/log/nginx/access.log"
    type => "nginx-access"
    start_position => "beginning"
  }
  
  file {
    path => "/var/log/nginx/error.log"
    type => "nginx-error"
    start_position => "beginning"
  }
  
  beats {
    port => 5044
  }
}

filter {
  if [type] == "nginx-access" {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
    
    date {
      match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
    
    geoip {
      source => "clientip"
    }
  }
  
  if [type] == "nginx-error" {
    grok {
      match => { "message" => "(?<timestamp>\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) \[%{LOGLEVEL:severity}\] %{GREEDYDATA:message}" }
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{type}-%{+YYYY.MM.dd}"
  }
  
  stdout {
    codec => rubydebug
  }
}

Filebeat 配置

# filebeat.yml
filebeat.inputs:

  - type: log
    enabled: true
    paths:

      - /var/log/app/*.log
    fields:
      service: api-server
      environment: production
    fields_under_root: false
    multiline.pattern: '^\d{4}-\d{2}-\d{2}'
    multiline.negate: true
    multiline.match: after

  - type: container
    enabled: true
    paths:

      - '/var/lib/docker/containers/*/*.log'
    processors:

      - add_docker_metadata:
          host: "unix:///var/run/docker.sock"

processors:

  - add_host_metadata:
      when.not.contains.tags: forwarded

  - add_cloud_metadata: ~

output.logstash:
  hosts: ["logstash:5044"]

logging.level: info
logging.to_files: true
logging.files:
  path: /var/log/filebeat
  name: filebeat
  keepfiles: 7
  permissions: 0644

Elasticsearch 索引模板

{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "logs-policy",
      "index.lifecycle.rollover_alias": "logs"
    },
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "message": {
          "type": "text",
          "analyzer": "standard"
        },
        "level": {
          "type": "keyword"
        },
        "service": {
          "type": "keyword"
        },
        "host": {
          "properties": {
            "name": {
              "type": "keyword"
            },
            "ip": {
              "type": "ip"
            }
          }
        }
      }
    }
  }
}

云日志服务

AWS CloudWatch Logs

# 创建日志组
aws logs create-log-group --log-group-name /aws/ec2/myapp

# 发送日志
aws logs put-log-events \
  --log-group-name /aws/ec2/myapp \
  --log-stream-name stream1 \
  --log-events timestamp=1234567890000,message="Log message"

# 查询日志
aws logs filter-log-events \
  --log-group-name /aws/ec2/myapp \
  --filter-pattern "ERROR" \
  --start-time 1234567890000

阿里云日志服务（ SLS）

# Python SDK 发送日志
from aliyun.log import LogClient
from aliyun.log.putlogsrequest import PutLogsRequest
import time

client = LogClient('cn-hangzhou', 'your-access-key-id', 'your-access-key-secret')

log_group = {
    'loggroup': [
        {
            'logs': [
                {
                    'time': int(time.time()),
                    'contents': [
                        {'key': 'level', 'value': 'INFO'},
                        {'key': 'message', 'value': 'Application started'}
                    ]
                }
            ]
        }
    ]
}

request = PutLogsRequest(
    project='my-project',
    logstore='my-logstore',
    topic='',
    source='api-server',
    logitems=log_group['loggroup'][0]['logs']
)

response = client.put_logs(request)

日志分析实践

日志查询示例

-- Elasticsearch Query DSL
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "level": "ERROR"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-1h"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "errors_by_service": {
      "terms": {
        "field": "service.keyword",
        "size": 10
      }
    }
  }
}

日志告警规则

# ElastAlert 规则
name: High Error Rate
type: frequency
index: logs-*
num_events: 100
timeframe:
  minutes: 5
filter:

  - term:
      level: "ERROR"
alert:

  - "email"
  - "slack"
email:

  - "ops@example.com"
slack:
  slack_webhook_url: "https://hooks.slack.com/services/..."

日志管理最佳实践

结构化日志：使用 JSON 格式，便于解析和查询
日志级别：合理使用 DEBUG 、 INFO 、 WARN 、 ERROR
敏感信息：避免记录密码、 token 等敏感信息
日志轮转：设置合理的保留策略，控制存储成本
集中收集：统一收集所有服务的日志，便于关联分析
实时监控：对错误日志设置实时告警

APM 应用性能监控

APM（ Application Performance Monitoring）专注于应用层面的性能监控，帮助定位性能瓶颈和优化点。

APM 核心指标

响应时间：请求从发起到收到响应的时间
吞吐量：单位时间内处理的请求数
错误率：失败请求占总请求的比例
资源使用： CPU 、内存、数据库连接等资源消耗
依赖调用：外部服务、数据库、缓存等调用情况

分布式追踪

分布式追踪用于跟踪请求在微服务架构中的完整路径。

OpenTelemetry 集成

# Python 应用集成 OpenTelemetry
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

jaeger_exporter = JaegerExporter(
    agent_host_name="jaeger",
    agent_port=6831,
)

span_processor = BatchSpanProcessor(jaeger_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

# 自动注入 Flask 和 Requests
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

# 手动创建 Span
def process_order(order_id):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        
        # 调用外部服务
        with tracer.start_as_current_span("payment_service") as payment_span:
            payment_span.set_attribute("payment.amount", 100.0)
            # 调用支付服务
            result = call_payment_service(order_id)
            payment_span.set_status(trace.Status(trace.StatusCode.OK))
        
        return result

Jaeger 配置

# docker-compose.yml
version: '3'
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:

      - "16686:16686"  # UI
      - "6831:6831/udp"  # Agent
      - "6832:6832/udp"
      - "14268:14268"  # Collector HTTP
    environment:

      - COLLECTOR_ZIPKIN_HTTP_PORT=9411

APM 工具

New Relic

# New Relic Python Agent
import newrelic.agent

newrelic.agent.initialize('newrelic.ini')

@newrelic.agent.function_trace()
def expensive_operation():
    # 业务逻辑
    pass

# 自定义指标
newrelic.agent.record_custom_metric('Custom/OrderCount', 100)

# 自定义事件
newrelic.agent.record_custom_event('OrderCreated', {
    'order_id': '12345',
    'amount': 99.99
})

Datadog APM

# Datadog Python APM
from ddtrace import patch_all
patch_all()

from flask import Flask
app = Flask(__name__)

@app.route('/api/orders')
def get_orders():
    # 自动追踪
    return {'orders': []}

# 自定义 Span
from ddtrace import tracer

with tracer.trace("custom.operation") as span:
    span.set_tag("order.id", "12345")
    # 业务逻辑

APM 最佳实践

采样策略：合理设置采样率，平衡监控覆盖和性能开销
关键路径追踪：重点监控核心业务流程
异常捕获：自动捕获和上报异常信息
性能基线：建立性能基线，及时发现性能退化
依赖分析：识别慢依赖，优化调用链

自动化运维实践

自动化是 DevOps 的核心，通过自动化减少人工操作，提高效率和可靠性。

自动化部署脚本

#!/bin/bash
# deploy.sh - 自动化部署脚本

set -e  # 遇到错误立即退出

ENVIRONMENT=${1:-production}
VERSION=${2:-latest}

echo "Deploying version $VERSION to $ENVIRONMENT"

# 1. 拉取最新代码
git fetch origin
git checkout $VERSION

# 2. 构建镜像
docker build -t myapp:$VERSION .
docker tag myapp:$VERSION registry.example.com/myapp:$VERSION

# 3. 推送镜像
docker push registry.example.com/myapp:$VERSION

# 4. 更新 Kubernetes 部署
kubectl set image deployment/myapp \
  app=registry.example.com/myapp:$VERSION \
  -n $ENVIRONMENT

# 5. 等待部署完成
kubectl rollout status deployment/myapp -n $ENVIRONMENT --timeout=5m

# 6. 健康检查
for i in {1..10}; do
  if curl -f http://myapp.$ENVIRONMENT.example.com/health; then
    echo "Health check passed"
    exit 0
  fi
  sleep 10
done

echo "Health check failed"
exit 1

自动化备份脚本

#!/bin/bash
# backup.sh - 数据库备份脚本

BACKUP_DIR="/backups"
DATE=$(date +%Y%m%d_%H%M%S)
RETENTION_DAYS=30

# MySQL 备份
mysqldump -h db.example.com -u backup_user -p$DB_PASSWORD \
  --single-transaction \
  --routines \
  --triggers \
  myapp > $BACKUP_DIR/mysql_$DATE.sql

# 压缩备份
gzip $BACKUP_DIR/mysql_$DATE.sql

# 上传到 S3
aws s3 cp $BACKUP_DIR/mysql_$DATE.sql.gz \
  s3://backup-bucket/mysql/$DATE.sql.gz

# 清理本地旧备份
find $BACKUP_DIR -name "mysql_*.sql.gz" -mtime +$RETENTION_DAYS -delete

# 清理 S3 旧备份
aws s3 ls s3://backup-bucket/mysql/ | \
  awk '{print $4}' | \
  while read file; do
    file_date=$(echo $ file | cut -d'_' -f2 | cut -d'.' -f1)
    if [ $(date -d "$ file_date" +%s) -lt $(date -d "$RETENTION_DAYS days ago" +%s) ]; then
      aws s3 rm s3://backup-bucket/mysql/$ file
    fi
  done

自动化测试脚本

#!/bin/bash
# test.sh - 自动化测试脚本

set -e

echo "Running unit tests..."
npm test

echo "Running integration tests..."
docker-compose -f docker-compose.test.yml up -d
sleep 10
npm run test:integration
docker-compose -f docker-compose.test.yml down

echo "Running E2E tests..."
npm run test:e2e

echo "All tests passed!"

CI/CD 流水线

Jenkins Pipeline

// Jenkinsfile
pipeline {
    agent any
    
    environment {
        DOCKER_REGISTRY = 'registry.example.com'
        KUBERNETES_NAMESPACE = 'production'
    }
    
    stages {
        stage('Checkout') {
            steps {
                checkout scm
            }
        }
        
        stage('Build') {
            steps {
                sh 'docker build -t ${DOCKER_REGISTRY}/myapp:${BUILD_NUMBER} .'
            }
        }
        
        stage('Test') {
            steps {
                sh 'docker run --rm ${DOCKER_REGISTRY}/myapp:${BUILD_NUMBER} npm test'
            }
        }
        
        stage('Security Scan') {
            steps {
                sh 'trivy image ${DOCKER_REGISTRY}/myapp:${BUILD_NUMBER}'
            }
        }
        
        stage('Push') {
            steps {
                sh 'docker push ${DOCKER_REGISTRY}/myapp:${BUILD_NUMBER}'
                sh 'docker tag ${DOCKER_REGISTRY}/myapp:${BUILD_NUMBER} ${DOCKER_REGISTRY}/myapp:latest'
                sh 'docker push ${DOCKER_REGISTRY}/myapp:latest'
            }
        }
        
        stage('Deploy') {
            steps {
                sh '''
                    kubectl set image deployment/myapp \
                      app=${DOCKER_REGISTRY}/myapp:${BUILD_NUMBER} \
                      -n ${KUBERNETES_NAMESPACE}
                    kubectl rollout status deployment/myapp -n ${KUBERNETES_NAMESPACE}
                '''
            }
        }
    }
    
    post {
        success {
            slackSend(
                channel: '#deployments',
                color: 'good',
                message: "Deployment successful: ${BUILD_NUMBER}"
            )
        }
        failure {
            slackSend(
                channel: '#deployments',
                color: 'danger',
                message: "Deployment failed: ${BUILD_NUMBER}"
            )
        }
    }
}

GitLab CI/CD

# .gitlab-ci.yml
stages:

  - build
  - test
  - security
  - deploy

variables:
  DOCKER_DRIVER: overlay2
  DOCKER_TLS_CERTDIR: "/certs"

build:
  stage: build
  script:

    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
    - docker tag $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA $CI_REGISTRY_IMAGE:latest
    - docker push $CI_REGISTRY_IMAGE:latest

test:
  stage: test
  script:

    - docker run --rm $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA npm test

security-scan:
  stage: security
  script:

    - docker run --rm -v /var/run/docker.sock:/var/run/docker.sock
        aquasec/trivy image $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA

deploy:
  stage: deploy
  script:

    - kubectl set image deployment/myapp app=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA -n production
    - kubectl rollout status deployment/myapp -n production
  only:

    - main

自动化运维最佳实践

幂等性：确保脚本可以安全地重复执行
错误处理：完善的错误处理和回滚机制
日志记录：详细记录每个步骤的执行情况
通知机制：及时通知相关人员部署状态
版本控制：所有脚本纳入版本控制
测试验证：在测试环境验证后再应用到生产

弹性伸缩策略

弹性伸缩是云计算的核心优势，根据负载自动调整资源规模。

Kubernetes HPA（ Horizontal Pod Autoscaler）

HPA 根据 CPU 、内存等指标水平扩展 Pod 数量。

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 2
  maxReplicas: 10
  metrics:

    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:

        - type: Percent
          value: 50
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:

        - type: Percent
          value: 100
          periodSeconds: 15

        - type: Pods
          value: 2
          periodSeconds: 15
      selectPolicy: Max

Kubernetes VPA（ Vertical Pod Autoscaler）

VPA 根据历史使用情况垂直调整 Pod 的资源请求和限制。

# vpa.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: myapp-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:

      - containerName: app
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi
        controlledResources: ["cpu", "memory"]

Kubernetes CA（ Cluster Autoscaler）

CA 根据 Pod 调度需求自动调整节点数量。

# cluster-autoscaler.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      containers:

        - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.27.0
          name: cluster-autoscaler
          command:

            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/production
          env:

            - name: AWS_REGION
              value: cn-north-1
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi

AWS Auto Scaling

# Terraform 配置 Auto Scaling Group
resource "aws_launch_template" "web" {
  name_prefix   = "web-"
  image_id      = data.aws_ami.ubuntu.id
  instance_type = "t3.medium"

  vpc_security_group_ids = [aws_security_group.web.id]

  user_data = base64encode(<<-EOF
    #!/bin/bash
    apt-get update
    apt-get install -y nginx
    systemctl start nginx
  EOF
  )
}

resource "aws_autoscaling_group" "web" {
  name                = "web-asg"
  vpc_zone_identifier = aws_subnet.public[*].id
  target_group_arns   = [aws_lb_target_group.web.arn]
  health_check_type   = "ELB"
  health_check_grace_period = 300

  min_size         = 2
  max_size         = 10
  desired_capacity = 2

  launch_template {
    id      = aws_launch_template.web.id
    version = "$Latest"
  }

  tag {
    key                 = "Name"
    value               = "web-server"
    propagate_at_launch = true
  }
}

resource "aws_autoscaling_policy" "scale_up" {
  name                   = "scale-up"
  scaling_adjustment     = 2
  adjustment_type        = "ChangeInCapacity"
  cooldown               = 300
  autoscaling_group_name = aws_autoscaling_group.web.name
}

resource "aws_cloudwatch_metric_alarm" "cpu_high" {
  alarm_name          = "cpu-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 300
  statistic           = "Average"
  threshold           = 70
  alarm_description   = "This metric monitors ec2 cpu utilization"
  alarm_actions       = [aws_autoscaling_policy.scale_up.arn]

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.web.name
  }
}

弹性伸缩最佳实践

预热时间：设置合理的冷却期，避免频繁伸缩
预测性伸缩：基于历史数据预测负载变化
多指标策略：结合 CPU 、内存、请求量等多个指标
成本平衡：在性能和成本之间找到平衡点
节点池分离：不同类型的工作负载使用不同的节点池

成本优化方法

云成本优化是运维的重要职责，需要在性能和成本之间找到平衡。

Reserved Instances（预留实例）

预留实例可以大幅降低云资源成本，适合稳定负载。

# AWS CLI 购买预留实例
aws ec2 purchase-reserved-instances-offering \
  --reserved-instances-offering-id "12345678-1234-1234-1234-123456789012" \
  --instance-count 5

# 查询预留实例
aws ec2 describe-reserved-instances \
  --filters "Name=state,Values=active"

Spot Instances（竞价实例）

Spot 实例价格低廉，适合可中断的工作负载。

# Terraform 配置 Spot 实例
resource "aws_spot_instance_request" "worker" {
  ami                    = data.aws_ami.ubuntu.id
  instance_type          = "t3.large"
  spot_price             = "0.05"
  wait_for_fulfillment   = true
  vpc_security_group_ids = [aws_security_group.worker.id]

  tags = {
    Name = "spot-worker"
  }
}

# Spot Fleet
resource "aws_spot_fleet_request" "workers" {
  iam_fleet_role      = aws_iam_role.spot_fleet.arn
  allocation_strategy = "lowestPrice"
  target_capacity     = 10
  valid_until         = "2025-12-31T23:59:59Z"

  launch_specification {
    instance_type     = "t3.medium"
    ami               = data.aws_ami.ubuntu.id
    spot_price        = "0.05"
    availability_zone = "cn-north-1a"
  }

  launch_specification {
    instance_type     = "t3.large"
    ami               = data.aws_ami.ubuntu.id
    spot_price        = "0.08"
    availability_zone = "cn-north-1b"
  }
}

Savings Plans

Savings Plans 提供灵活的折扣，适用于各种计算服务。

# 创建 Savings Plan
aws savingsplans create-savings-plan \
  --savings-plan-offering-id "sp-1234567890abcdef0" \
  --commitment "1000" \
  --upfront-payment-amount "0" \
  --purchase-time "2025-01-01T00:00:00Z"

成本优化脚本

# cost_optimizer.py - 成本优化脚本
import boto3
from datetime import datetime, timedelta

def find_unused_volumes():
    """查找未使用的 EBS 卷"""
    ec2 = boto3.client('ec2')
    
    volumes = ec2.describe_volumes(
        Filters=[{'Name': 'status', 'Values': ['available']}]
    )
    
    unused = []
    for volume in volumes['Volumes']:
        # 检查是否有关联的快照
        snapshots = ec2.describe_snapshots(
            Filters=[{'Name': 'volume-id', 'Values': [volume['VolumeId']]}]
        )
        
        if not snapshots['Snapshots']:
            unused.append({
                'VolumeId': volume['VolumeId'],
                'Size': volume['Size'],
                'CreateTime': volume['CreateTime']
            })
    
    return unused

def find_idle_instances():
    """查找空闲的 EC2 实例"""
    cloudwatch = boto3.client('cloudwatch')
    ec2 = boto3.client('ec2')
    
    end_time = datetime.utcnow()
    start_time = end_time - timedelta(days=7)
    
    instances = ec2.describe_instances()
    idle_instances = []
    
    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            if instance['State']['Name'] != 'running':
                continue
            
            # 检查 CPU 使用率
            response = cloudwatch.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilization',
                Dimensions=[
                    {'Name': 'InstanceId', 'Value': instance['InstanceId']}
                ],
                StartTime=start_time,
                EndTime=end_time,
                Period=3600,
                Statistics=['Average']
            )
            
            avg_cpu = sum(
                point['Average'] for point in response['Datapoints']
            ) / len(response['Datapoints']) if response['Datapoints'] else 0
            
            if avg_cpu < 5:  # CPU 使用率低于 5%
                idle_instances.append({
                    'InstanceId': instance['InstanceId'],
                    'InstanceType': instance['InstanceType'],
                    'AvgCPU': avg_cpu
                })
    
    return idle_instances

def optimize_rightsizing():
    """实例规格优化建议"""
    cloudwatch = boto3.client('cloudwatch')
    ec2 = boto3.client('ec2')
    
    end_time = datetime.utcnow()
    start_time = end_time - timedelta(days=14)
    
    instances = ec2.describe_instances()
    recommendations = []
    
    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            if instance['State']['Name'] != 'running':
                continue
            
            # 获取 CPU 和内存使用率
            cpu_response = cloudwatch.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilization',
                Dimensions=[
                    {'Name': 'InstanceId', 'Value': instance['InstanceId']}
                ],
                StartTime=start_time,
                EndTime=end_time,
                Period=3600,
                Statistics=['Average', 'Maximum']
            )
            
            # 分析使用率，给出优化建议
            # ... 优化逻辑 ...
    
    return recommendations

if __name__ == '__main__':
    print("Finding unused volumes...")
    unused_volumes = find_unused_volumes()
    print(f"Found {len(unused_volumes)} unused volumes")
    
    print("Finding idle instances...")
    idle_instances = find_idle_instances()
    print(f"Found {len(idle_instances)} idle instances")

成本优化最佳实践

资源标签：使用标签追踪资源成本和归属
定期审查：定期审查资源使用情况，清理闲置资源
自动化优化：使用脚本自动识别和优化资源
预算告警：设置预算告警，及时发现问题
多账户管理：使用多账户隔离不同环境，便于成本管理
使用 Spot 实例：对可中断工作负载使用 Spot 实例
预留实例规划：分析历史使用情况，合理购买预留实例

故障排查与恢复

故障是不可避免的，快速定位和恢复是运维的核心能力。

故障排查流程

确认故障：验证故障是否真实存在
收集信息：收集日志、监控数据、用户反馈
定位根因：分析数据，定位问题根因
制定方案：制定恢复和修复方案
执行恢复：执行恢复操作
验证修复：验证问题是否解决
事后复盘：分析故障原因，优化系统

常见故障场景

服务不可用

# 检查服务状态
kubectl get pods -n production
kubectl describe pod <pod-name> -n production
kubectl logs <pod-name> -n production

# 检查服务端点
kubectl get endpoints -n production
kubectl get svc -n production

# 检查节点状态
kubectl get nodes
kubectl describe node <node-name>

# 检查资源使用
kubectl top nodes
kubectl top pods -n production

性能问题

# 分析 CPU 使用
kubectl exec -it <pod-name> -n production -- top

# 分析内存使用
kubectl exec -it <pod-name> -n production -- free -h

# 分析网络连接
kubectl exec -it <pod-name> -n production -- netstat -an

# 分析进程
kubectl exec -it <pod-name> -n production -- ps aux

数据库问题

-- 检查连接数
SHOW PROCESSLIST;

-- 检查慢查询
SELECT * FROM mysql.slow_log ORDER BY start_time DESC LIMIT 10;

-- 检查锁等待
SELECT * FROM information_schema.innodb_locks;
SELECT * FROM information_schema.innodb_lock_waits;

-- 检查表大小
SELECT 
    table_schema,
    table_name,
    ROUND(((data_length + index_length) / 1024 / 1024), 2) AS size_mb
FROM information_schema.tables
ORDER BY size_mb DESC;

故障恢复脚本

#!/bin/bash
# disaster_recovery.sh - 灾难恢复脚本

set -e

BACKUP_BUCKET="s3://backup-bucket"
RESTORE_DATE=${1:-$(date +%Y%m%d)}

echo "Starting disaster recovery for date: $RESTORE_DATE"

# 1. 恢复数据库
echo "Restoring database..."
aws s3 cp $BACKUP_BUCKET/mysql_${RESTORE_DATE}.sql.gz /tmp/
gunzip /tmp/mysql_${RESTORE_DATE}.sql.gz
mysql -h db.example.com -u admin -p$DB_PASSWORD myapp < /tmp/mysql_${RESTORE_DATE}.sql

# 2. 恢复文件存储
echo "Restoring file storage..."
aws s3 sync $BACKUP_BUCKET/files_${RESTORE_DATE}/ /var/www/files/

# 3. 重启服务
echo "Restarting services..."
kubectl rollout restart deployment/myapp -n production
kubectl rollout status deployment/myapp -n production

# 4. 验证服务
echo "Verifying services..."
for i in {1..30}; do
    if curl -f http://myapp.example.com/health; then
        echo "Service is healthy"
        exit 0
    fi
    sleep 10
done

echo "Service verification failed"
exit 1

故障排查最佳实践

建立 Runbook：为常见故障建立标准处理流程
监控告警：设置完善的监控和告警
日志集中：集中收集所有日志，便于关联分析
演练培训：定期进行故障演练，提高团队能力
文档记录：详细记录故障处理过程，形成知识库

SRE 最佳实践

SRE（ Site Reliability Engineering）是 Google 提出的运维理念，强调通过工程方法保障服务可靠性。

错误预算（ Error Budget）

错误预算是允许的不可靠性上限，等于 100% 减去 SLO 。

1	错误预算 = 1 - SLO

例如， SLO 为 99.9%，错误预算为 0.1%。

SLO（ Service Level Objective）

SLO 是服务可靠性目标，通常用可用性或延迟表示。

# SLO 定义示例
slo:
  name: api-availability
  description: API availability SLO
  target: 99.9%
  window: 30 days
  metrics:

    - name: availability
      type: ratio
      numerator: successful_requests
      denominator: total_requests
      threshold: 0.999

SLI（ Service Level Indicator）

SLI 是衡量服务可靠性的指标。

# 可用性 SLI
sum(rate(http_requests_total{status!~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m]))

# 延迟 SLI
histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

SRE 实践

监控 SLO

# Prometheus SLO 监控
groups:

  - name: slo
    rules:

      - record: slo:availability:ratio
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
      
      - alert: SLOBreach
        expr: |
          (
            (1 - slo:availability:ratio) * 86400 * 30
          ) > (
            (1 - 0.999) * 86400 * 30
          )
        for: 5m
        annotations:
          summary: "SLO breach detected"
          description: "Error budget consumption exceeds threshold"

错误预算策略

预算充足：可以发布新功能，承担一定风险
预算紧张：暂停新功能发布，专注于稳定性
预算耗尽：停止所有变更，全力修复问题

SRE 最佳实践

定义清晰的 SLO：基于用户需求定义合理的 SLO
监控错误预算：实时监控错误预算消耗情况
自动化运维：通过自动化减少人为错误
容量规划：提前规划容量，避免资源不足
变更管理：严格控制变更，降低故障风险
事后复盘：每次故障后进行复盘，持续改进

DevOps 工具链

DevOps 工具链覆盖开发、构建、测试、部署、监控等各个环节。

版本控制

Git：分布式版本控制系统
GitHub/GitLab：代码托管和协作平台
Git Flow：分支管理策略

CI/CD

Jenkins：开源 CI/CD 平台
GitLab CI/CD：集成在 GitLab 中的 CI/CD
GitHub Actions： GitHub 的 CI/CD 服务
CircleCI：云原生 CI/CD 平台
Travis CI：持续集成服务

容器化

Docker：容器运行时
Kubernetes：容器编排平台
Helm： Kubernetes 包管理工具
Docker Compose：多容器应用编排

配置管理

Ansible：自动化配置管理
Chef：基础设施自动化
Puppet：配置管理工具
SaltStack：事件驱动的自动化

监控与日志

Prometheus：监控系统
Grafana：可视化平台
ELK Stack：日志管理
Jaeger：分布式追踪
New Relic： APM 平台
Datadog：监控和分析平台

安全

Vault：密钥管理
Trivy：容器安全扫描
OWASP ZAP：安全测试工具
SonarQube：代码质量分析

GitOps 实践

GitOps 是一种基于 Git 的运维实践，将基础设施和应用的配置存储在 Git 仓库中，通过 Git 操作触发部署。

GitOps 工作流

开发提交代码：开发者提交代码到 Git 仓库
CI 构建镜像： CI 系统构建 Docker 镜像并推送到镜像仓库
更新配置：更新 Kubernetes 配置文件中的镜像版本
Git 提交：提交配置变更到 Git 仓库
GitOps 工具同步： ArgoCD 或 Flux 检测到变更并同步到集群
自动部署：应用自动部署到 Kubernetes 集群

ArgoCD

ArgoCD 是 CNCF 的 GitOps 工具，提供 Web UI 和 CLI 。

# argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/example/myapp.git
    targetRevision: main
    path: k8s
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:

      - CreateNamespace=true

Flux

Flux 是另一个流行的 GitOps 工具。

# flux-kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: myapp
  namespace: flux-system
spec:
  interval: 5m
  path: ./k8s
  prune: true
  sourceRef:
    kind: GitRepository
    name: myapp
  validation: client

GitOps 最佳实践

单一事实来源： Git 仓库是配置的唯一来源
声明式配置：使用声明式配置，而非命令式操作
自动化同步：自动检测和同步配置变更
环境隔离：不同环境使用不同的 Git 分支或目录
审计追踪：所有变更通过 Git 提交记录，便于审计

实战案例

案例一：微服务架构的监控体系建设

某电商平台采用微服务架构，包含用户服务、订单服务、支付服务等 20+ 个服务。需要建立统一的监控体系。

挑战

服务数量多，监控指标复杂
服务间调用链长，故障定位困难
需要实时监控业务指标

解决方案

Prometheus + Grafana 监控基础设施
- 每个服务暴露 Prometheus 指标
- 使用 ServiceMonitor 自动发现
- 建立统一的 Grafana Dashboard
Jaeger 分布式追踪
- 集成 OpenTelemetry SDK
- 追踪所有服务间调用
- 建立调用链可视化
ELK Stack 日志管理
- Filebeat 收集所有服务日志
- Logstash 解析和丰富日志
- Elasticsearch 存储和检索
- Kibana 可视化分析
告警规则
- 基于 SLO 设置告警阈值
- 使用 Alertmanager 分组和抑制
- 集成 Slack 和 PagerDuty

效果

故障定位时间从 30 分钟降低到 5 分钟
告警准确率提升到 95%
业务指标可视化，便于决策

案例二： Kubernetes 集群的弹性伸缩优化

某 SaaS 平台运行在 Kubernetes 上，面临流量波动大、成本高的问题。

挑战

流量波动大，高峰期是平均值的 10 倍
固定资源导致成本浪费
需要快速响应流量变化

解决方案

HPA 水平伸缩

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:

    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"

VPA 垂直伸缩
- 分析历史资源使用
- 自动调整资源请求和限制
- 减少资源浪费
Cluster Autoscaler
- 自动添加和移除节点
- 使用 Spot 实例降低成本
- 多可用区部署提高可用性
预测性伸缩
- 基于历史数据预测流量
- 提前扩容，避免流量突增

效果

成本降低 40%
高峰期自动扩容，无服务降级
资源利用率提升到 75%

案例三：多环境基础设施即代码实践

某金融公司需要在 AWS 上部署开发、测试、生产三个环境，要求环境一致性和快速部署。

挑战

三个环境配置差异大
手动部署容易出错
环境一致性难以保证

解决方案

Terraform 模块化

terraform/
├── modules/
│   ├── vpc/
│   ├── ec2/
│   ├── rds/
│   └── eks/
├── environments/
│   ├── dev/
│   ├── test/
│   └── prod/
└── main.tf

环境配置分离

# environments/dev/main.tf
module "infrastructure" {
  source = "../../modules"
  
  environment = "dev"
  instance_type = "t3.small"
  min_size = 1
  max_size = 3
}

# environments/prod/main.tf
module "infrastructure" {
  source = "../../modules"
  
  environment = "prod"
  instance_type = "t3.large"
  min_size = 3
  max_size = 10
}

CI/CD 集成
- Git 提交触发 Terraform Plan
- 代码审查后执行 Apply
- 状态文件存储在 S3
- 使用 DynamoDB 锁定状态
合规检查
- 使用 Checkov 检查配置
- 集成安全策略检查
- 自动生成合规报告

效果

环境部署时间从 2 天降低到 2 小时
环境一致性 100%
配置变更可追溯、可回滚

❓ Q&A: 云运维与 DevOps 常见问题

Q1: 如何选择合适的监控工具？

A: 选择监控工具需要考虑以下因素：

需求分析：明确需要监控的指标类型（基础设施、应用、业务）
技术栈：考虑与现有技术栈的集成能力
成本：开源工具免费但需要自运维，商业工具功能完善但成本高
社区支持：选择活跃的开源项目或成熟的商业产品
扩展性：考虑未来业务增长的需求

推荐组合： Prometheus + Grafana（开源监控）+ ELK Stack（日志管理）+ Jaeger（分布式追踪）

Q2: 如何设置合理的告警阈值？

A: 告警阈值设置应该基于：

SLO 目标：根据 SLO 计算错误预算，设置告警阈值
历史数据：分析历史监控数据，了解正常波动范围
业务影响：考虑对业务的实际影响，设置不同严重级别
渐进式告警：设置多个阈值（警告、严重、紧急），避免告警疲劳
动态调整：根据实际情况持续调整阈值

例如，如果 SLO 是 99.9%，可以设置：

警告：错误率 > 0.05%（消耗 50% 错误预算）
严重：错误率 > 0.08%（消耗 80% 错误预算）
紧急：错误率 > 0.1%（ SLO 违反）

Q3: Kubernetes 中如何实现零停机部署？

A: 实现零停机部署的方法：

滚动更新： Kubernetes 默认的更新策略，逐步替换 Pod

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # 确保始终有可用 Pod

就绪探针：确保新 Pod 就绪后再接收流量

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

蓝绿部署：使用两个完全相同的环境切换
金丝雀发布：逐步将流量切换到新版本
服务网格：使用 Istio 等工具实现更精细的流量控制

Q4: 如何优化云成本？

A: 云成本优化策略：

资源标签：使用标签追踪资源成本和归属
定期审查：定期审查资源使用，清理闲置资源
预留实例：对稳定负载使用预留实例，节省 30-70% 成本
Spot 实例：对可中断工作负载使用 Spot 实例，节省 50-90% 成本
自动伸缩：根据负载自动调整资源规模
实例规格优化：根据实际使用情况选择合适的实例规格
存储优化：使用合适的存储类型，定期清理旧数据
网络优化：优化数据传输，减少跨区域流量

Q5: 如何建立有效的故障响应流程？

A: 故障响应流程应该包括：

故障分级：根据影响范围和时间定义故障级别（ P0/P1/P2/P3）
响应团队：建立 On-Call 轮值制度，明确责任人
沟通渠道：建立故障沟通群组，及时同步信息
处理流程：确认故障 → 收集信息 → 定位根因 → 制定方案 → 执行恢复 → 验证修复
升级机制：设置升级规则，严重故障及时上报
事后复盘：每次故障后进行复盘，形成改进措施
工具支持：使用 PagerDuty 、 Opsgenie 等工具管理 On-Call

Q6: GitOps 相比传统部署方式有什么优势？

A: GitOps 的优势：

版本控制：所有配置变更都有 Git 历史记录，可追溯、可回滚
一致性： Git 是唯一事实来源，保证环境一致性
自动化：自动检测和同步配置变更，减少人工操作
安全性：通过 Git 权限控制配置访问，审计所有变更
协作性：通过 Pull Request 进行配置审查，提高质量
可扩展性：易于扩展到多个环境和集群

Q7: 如何选择合适的 CI/CD 工具？

A: 选择 CI/CD 工具考虑因素：

集成能力：与代码仓库、镜像仓库、部署平台的集成
功能需求：是否需要支持多语言、多平台、并行构建等
易用性：配置是否简单，学习曲线是否平缓
扩展性：是否支持插件和自定义扩展
成本：开源免费 vs 商业授权
社区支持：文档、社区活跃度、问题解决速度

Q8: 如何实现基础设施的灾难恢复？

A: 灾难恢复策略：

备份策略：
- 数据库：定期全量备份 + 实时增量备份
- 文件存储：定期快照 + 跨区域复制
- 配置： Git 版本控制 + 配置备份
多区域部署：在多个可用区或区域部署，提高可用性
自动化恢复：编写恢复脚本，自动化恢复流程
定期演练：定期进行灾难恢复演练，验证恢复流程
RTO/RPO 目标：定义恢复时间目标（ RTO）和恢复点目标（ RPO）
监控告警：实时监控系统状态，及时发现故障

Q9: 如何建立有效的 SLO？

A: 建立 SLO 的步骤：

用户需求分析：了解用户对服务的期望
选择 SLI：选择能够反映用户体验的指标
设定目标：基于历史数据和用户需求设定目标
错误预算计算：错误预算 = 1 - SLO
监控和告警：实时监控 SLO，设置告警
持续优化：根据实际情况调整 SLO

示例：

可用性 SLO： 99.9%（每月最多 43 分钟不可用）
延迟 SLO： P95 延迟 < 200ms
错误率 SLO：错误率 < 0.1%

Q10: 云原生架构下的运维与传统运维有什么区别？

A: 主要区别：

基础设施：
- 传统：物理服务器、虚拟机
- 云原生：容器、 Kubernetes 、 Serverless
部署方式：
- 传统：手动部署、脚本部署
- 云原生：容器化部署、自动化 CI/CD
监控方式：
- 传统：服务器监控、应用监控分离
- 云原生：统一监控、可观测性（指标、日志、追踪）
扩展方式：
- 传统：手动扩容、垂直扩展
- 云原生：自动伸缩、水平扩展
故障处理：
- 传统：人工排查、手动恢复
- 云原生：自动化恢复、自愈能力
运维工具：
- 传统： Shell 脚本、配置管理工具
- 云原生： IaC 、 GitOps 、服务网格
团队协作：
- 传统：开发运维分离
- 云原生： DevOps 、 SRE 文化

总结

云运维和 DevOps 实践是一个持续演进的过程。从基础设施即代码到监控告警，从自动化部署到成本优化，每个环节都需要深入理解和实践。关键是要建立适合自己团队的流程和工具链，持续优化和改进。

随着云原生技术的普及，运维工作正在从"救火"转向"预防"，从"手动"转向"自动"，从"被动"转向"主动"。掌握这些实践，不仅能提高运维效率，更能保障业务的稳定性和可靠性。

本文是云计算系列文章的第七篇，后续将继续探讨云安全、云架构设计等主题。