PagerDuty

Mark as read

PagerDuty 完全ガイド：インシデント管理プラットフォームの機能とアーキテクチャ

第1章 PagerDuty概要

1.1 PagerDutyとは

PagerDutyは、エンタープライズ向けのインシデント管理プラットフォームであり、アラートから対応、解決までのライフサイクル全体を管理します。DevOps、SRE、IT運用チームが、システムの問題を迅速に検出し、対応チームに通知し、インシデントを解決するまでのプロセスを自動化および効率化します。

主な機能:

リアルタイムアラート管理
インシデント対応の自動化
オンコール管理
エスカレーションポリシー
インシデント分析とレポート
複数の監視ツールとの統合

1.2 業界における位置づけ

PagerDutyは現代的なクラウドネイティブ環境における標準的なインシデント管理ソリューションです。Datadog、New Relic、Prometheus、CloudWatch などの監視ツールと統合し、これらのツールから発火するアラートを一元管理します。

対応する課題:

複数のアラートストリームの統一管理
重要なアラートの見落とし防止
インシデント対応の属人化排除
SLOに基づいた責任ある対応
インシデント履歴の可視化

第2章コアアーキテクチャ

2.1 システムアーキテクチャ全体

PagerDutyのアーキテクチャは、以下の主要コンポーネントで構成されます：

┌─────────────────────────────────────────────────────────┐
│           監視・ログ収集ツール層                         │
│  (Datadog, New Relic, Prometheus, CloudWatch等)         │
└──────────────────┬──────────────────────────────────────┘
                   │ Events API / Webhooks
                   ▼
┌─────────────────────────────────────────────────────────┐
│        イベント受信・集約層 (Ingest Platform)           │
│  - アラート正規化                                       │
│  - イベント相関                                         │
│  - デデュプリケーション                                 │
└──────────────────┬──────────────────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────────────────┐
│      インシデント管理層 (Incident Management)            │
│  - インシデント作成・更新                               │
│  - ステータス管理                                       │
│  - 割当・エスカレーション                               │
└──────────────────┬──────────────────────────────────────┘
                   │
         ┌─────────┴──────────┐
         ▼                    ▼
┌──────────────────┐  ┌──────────────────┐
│ 通知層           │  │オンコール管理層  │
│ (Notification)   │  │ (Schedule Mgmt)  │
│ SMS/Phone/Email  │  │ Rotations/Rules  │
│ Slack/PagerDuty  │  │ Escalation       │
└──────────────────┘  └──────────────────┘
         │                    │
         └─────────┬──────────┘
                   ▼
┌─────────────────────────────────────────────────────────┐
│        ユーザーインターフェース層                       │
│  - Webコンソール                                        │
│  - モバイルアプリ                                       │
│  - REST API                                             │
└─────────────────────────────────────────────────────────┘

2.2 主要コンポーネント説明

2.2.1 Events API Platform

アラート取集の中核
複数の形式でアラート受け取り可能
イベントの正規化と相関処理
Deduplication (重複排除) エンジン

2.2.2 Incident Management Engine

アラートからインシデント生成
ステータス遷移の管理
SLA追跡
インシデント履歴の保全

2.2.3 Escalation & On-Call Engine

スケジュール管理
エスカレーションルール評価
通知の優先順位付け
割当管理

2.2.4 Notification Delivery

マルチチャネル通知（SMS、電話、Email、Slack等）
確認応答追跡
リトライロジック
配信ステータス監視

2.3 データフロー

1. Event Ingest
   監視ツール → Events API → Normalization

2. Deduplication & Correlation
   複数イベント → Aggregation Engine → Single Incident

3. Incident Creation
   Normalized Event → Incident Object → Storage

4. Routing
   Incident → Service Configuration → Escalation Policy

5. Escalation
   Escalation Policy → On-Call Schedule → Target User

6. Notification
   Target User → Notification Channel → Delivery Confirmation

7. Resolution
   Incident Update → Status Change → Webhook → External System

第3章主要機能詳説

3.1 Services（サービス管理）

Serviceは、PagerDutyにおける管理の最小単位です。各サービスは、監視対象となるアプリケーション、インフラストラクチャコンポーネント、またはビジネスプロセスを表します。

Service設定の主要要素:

Service Configuration:
├── Basic Info
│   ├── Name: "Payment API Service"
│   ├── Description: "Core payment processing API"
│   ├── Status: Active
│   └── Type: Standard / Business Service
│
├── Alert Creation
│   ├── Alert Creation Threshold: 0 (即座にアラート化)
│   ├── Alert Grouping: None / By severity / By alert title / Time-based
│   └── Auto-Resolve: After 30 minutes of no alerts
│
├── Escalation Policy
│   └── Payment Team Escalation (詳細は3.2を参照)
│
├── Integration
│   ├── Email Integration
│   ├── Webhook Integration
│   └── Third-party Tools
│
└── SLA Configuration
    ├── Response Time SLA: 15 minutes
    ├── Resolution Time SLA: 1 hour
    └── SLA Support Hours: 24x7

実装例: Payment APIサービス

{
  "service": {
    "id": "P1Z3Q4R",
    "type": "service",
    "summary": "Payment API Service",
    "self": "https://api.pagerduty.com/services/P1Z3Q4R",
    "html_url": "https://subdomain.pagerduty.com/services/P1Z3Q4R",
    "name": "Payment API Service",
    "description": "Core payment processing service",
    "created_at": "2024-01-15T10:00:00Z",
    "status": "active",
    "teams": [
      {
        "id": "PT123ABC",
        "type": "team_reference",
        "summary": "Payment Platform Team"
      }
    ],
    "escalation_policy": {
      "id": "PE12345",
      "type": "escalation_policy_reference",
      "summary": "Payment Team Escalation"
    },
    "incident_urgency_type": "constant",
    "incident_urgency_values": {
      "type": "constant",
      "urgency": "high"
    },
    "alert_creation": "all_alerts",
    "alert_grouping_parameters": {
      "type": "time",
      "config": {
        "timeout": 300
      }
    },
    "auto_resolve_timeout": 1800,
    "acknowledgement_timeout": 600
  }
}

3.2 Escalation Policies（エスカレーションポリシー）

エスカレーションポリシーは、インシデント発生時に誰に、どの順序で通知するかを定義します。複数のレベルと、各レベルのルールで構成されます。

エスカレーション構造:

Escalation Policy: "Payment Team Escalation"
│
├─ Level 1: First Responder (即座に通知)
│  ├── Schedule: "Payment Primary On-Call"
│  ├── Notification Timeout: 5 minutes
│  └── Action if not acknowledged: Escalate to Level 2
│
├─ Level 2: Team Lead (5分後)
│  ├── Schedule: "Payment Team Lead"
│  ├── Notification Timeout: 10 minutes
│  └── Action if not acknowledged: Escalate to Level 3
│
└─ Level 3: Manager (15分後)
   ├── Schedule: "Payment Manager On-Call"
   ├── Notification Timeout: 15 minutes
   └── Action if not acknowledged: Loop back to Level 1 or Stop

実装例:

{
  "escalation_policy": {
    "id": "PE12345",
    "type": "escalation_policy",
    "summary": "Payment Team Escalation",
    "escalation_rules": [
      {
        "id": "PER1",
        "escalation_delay_in_minutes": 0,
        "targets": [
          {
            "id": "PS1",
            "type": "schedule_reference",
            "summary": "Payment Primary On-Call"
          }
        ]
      },
      {
        "id": "PER2",
        "escalation_delay_in_minutes": 5,
        "targets": [
          {
            "id": "PSER2",
            "type": "schedule_reference",
            "summary": "Payment Secondary On-Call"
          }
        ]
      },
      {
        "id": "PER3",
        "escalation_delay_in_minutes": 15,
        "targets": [
          {
            "id": "SZER1",
            "type": "user_reference",
            "summary": "payments-escalation@company.com"
          }
        ]
      }
    ],
    "repeat_enabled": true,
    "num_loops": 2
  }
}

3.3 On-Call Schedules（オンコールスケジュール）

スケジュールは、ユーザーのオンコール期間を定義します。複数層のレイアウト、固定ローテーション、または高度な割当ルールをサポートします。

スケジュール層タイプ:

Schedule: "Payment Primary On-Call"
│
├─ Final Schedule (最終的なオンコール状態)
│  └── Layer-1: Primary Rotation
│     ├── Users: [Alice, Bob, Charlie, Diana]
│     ├── Rotation: Weekly (Monday 9AM UTC)
│     └── Restrictions: 
│         - Business hours only (9AM-6PM UTC)
│         - Max 1 incident per shift
│
└─ Override Rules
   ├── Exception 1: Alice out (Jan 20-27)
   │   └── Coverage: Bob covers
   ├── Exception 2: Team all-hands (Feb 1)
   │   └── Escalate to Team Lead schedule
   └── Exception 3: Holiday (Dec 25)
       └── Escalate to Manager

実装例:

{
  "schedule": {
    "id": "PS1",
    "type": "schedule",
    "summary": "Payment Primary On-Call",
    "name": "Payment Primary On-Call",
    "time_zone": "UTC",
    "schedule_layers": [
      {
        "id": "PSL1",
        "name": "Primary Rotation",
        "rendered_coverage_percentage": "100.0",
        "start": "2024-01-01T00:00:00Z",
        "users": [
          {
            "id": "PU1",
            "type": "user_reference",
            "summary": "alice@company.com"
          },
          {
            "id": "PU2",
            "type": "user_reference",
            "summary": "bob@company.com"
          },
          {
            "id": "PU3",
            "type": "user_reference",
            "summary": "charlie@company.com"
          }
        ],
        "rotation_virtual_start": "2024-01-01T09:00:00Z",
        "rotation_turn_length_seconds": 604800,
        "restrictions": []
      },
      {
        "id": "PSL2",
        "name": "Business Hours Coverage",
        "rendered_coverage_percentage": "41.67",
        "start": "2024-01-01T00:00:00Z",
        "users": [
          {
            "id": "PU4",
            "type": "user_reference",
            "summary": "diana@company.com"
          }
        ],
        "rotation_virtual_start": "2024-01-01T09:00:00Z",
        "rotation_turn_length_seconds": 28800,
        "restrictions": [
          {
            "type": "daily_restriction",
            "duration_seconds": 32400,
            "start_time_of_day": "09:00:00"
          }
        ]
      }
    ]
  }
}

3.4 インシデント管理ライフサイクル

PagerDutyのインシデントは複数のステータスと状態遷移を持ちます：

ステータス遷移図:

          Event Received
               │
               ▼
        ┌─────────────────────┐
        │  TRIGGERED (発火中) │◄─────────────────┐
        └──────────┬──────────┘                   │
                   │                              │
         (Acknowledged by responder)              │
                   ▼                              │
        ┌─────────────────────┐                   │
        │ ACKNOWLEDGED (認識) │                   │
        └──────────┬──────────┘                   │
                   │                              │
       (Incident Resolved / Auto-resolve timeout) │
                   ▼                              │
        ┌─────────────────────┐                   │
        │  RESOLVED (解決)    │                   │
        └─────────────────────┘                   │
                                                  │
           (New Alert on resolved incident)───────┘

インシデント詳細:

{
  "incident": {
    "id": "INC123456",
    "type": "incident",
    "summary": "[HIGH] Payment API Response Time > 500ms",
    "self": "https://api.pagerduty.com/incidents/INC123456",
    "html_url": "https://subdomain.pagerduty.com/incidents/INC123456",
    "incident_number": 42,
    "title": "[HIGH] Payment API Response Time > 500ms",
    "description": "Alert from Datadog: Payment API response time exceeded 500ms threshold",
    "created_at": "2024-04-07T14:30:00Z",
    "status": "acknowledged",
    "urgency": "high",
    "incident_key": "datadog:payment-api:latency-high",
    "service": {
      "id": "P1Z3Q4R",
      "type": "service_reference",
      "summary": "Payment API Service"
    },
    "assignments": [
      {
        "at": "2024-04-07T14:32:15Z",
        "assignee": {
          "id": "PU1",
          "type": "user_reference",
          "summary": "alice@company.com"
        }
      }
    ],
    "assigned_via": "escalation_policy",
    "first_trigger_log_entry": {
      "id": "ILE1",
      "type": "incident_log_entry_reference",
      "summary": "Triggered through the website"
    },
    "last_status_change_at": "2024-04-07T14:32:15Z",
    "last_status_change_by": {
      "id": "PU1",
      "type": "user_reference",
      "summary": "alice@company.com"
    },
    "escalation_policy": {
      "id": "PE12345",
      "type": "escalation_policy_reference",
      "summary": "Payment Team Escalation"
    },
    "teams": [],
    "urgency_escalation_at": null,
    "urgency_escalation_at": null,
    "alert_counts": {
      "all_triggered": 3,
      "all_resolved": 0
    },
    "last_assigned_at": "2024-04-07T14:32:15Z"
  }
}

3.5 Alert Grouping（アラート集約）

複数のアラートを1つのインシデントに集約する機能です。ノイズ削減と対応効率向上を実現します。

集約戦略:

Alert Grouping Strategies:

1. Time-based Grouping
   - 同じサービスの5分以内のアラート → 1つのインシデント
   - 例: 同じメトリックの複数アラート

2. Severity-based Grouping
   - CRITICALのアラート → 即座にインシデント化
   - WARNINGのアラート → 他のWARNINGと集約

3. Event-based Grouping (Alert Title/Body)
   - 同じタイトルのアラート → 1つのインシデント
   - 例: 「Database connection timeout」が複数回

4. Custom Field-based Grouping
   - カスタムメタデータで集約
   - 例: Environment、Component、Regionで分類

5. Intelligent Grouping (AIベース)
   - 過去のパターンから自動集約
   - 相関関係を検出

実装例: Time-based Grouping設定

{
  "service": {
    "id": "P1Z3Q4R",
    "alert_grouping": "time",
    "alert_grouping_parameters": {
      "type": "time",
      "config": {
        "timeout": 300
      }
    }
  }
}

3.6 インシデント分析・レポート

PagerDutyは詳細なインシデント分析機能を提供します：

分析対象メトリクス:

MTTD (Mean Time To Detect)
├── 定義: アラート発火から検知までの平均時間
├── 目標: < 1 minute
└── 改善方法: 監視ルール調整、アラート感度最適化

MTTR (Mean Time To Resolve)
├── 定義: インシデント発火から解決までの平均時間
├── 目標: < 15 minutes (Critical), < 1 hour (Normal)
└── 改善方法: ランブック自動化、エスカレーション最適化

MTTA (Mean Time To Acknowledge)
├── 定義: インシデント発火から確認までの平均時間
├── 目標: < 5 minutes
└── 改善方法: 通知チャネル改善、オンコール配置最適化

SLA Compliance
├── Response SLA: 15分以内に対応開始したか
├── Resolution SLA: 1時間以内に解決したか
└── 目標: > 95% compliance

Incident Volume
├── インシデント数の推移
├── ピーク時間帯の分析
└── 減少目標: 30% YoY

レポート例:

{
  "incident_stats": {
    "period": "2024-03-01 to 2024-04-01",
    "service": "Payment API Service",
    "total_incidents": 24,
    "acknowledged_incidents": 22,
    "unacknowledged_incidents": 2,
    "resolved_incidents": 24,
    "averages": {
      "mtta": "PT3M24S",
      "mttr": "PT42M15S",
      "mttd": "PT1M30S"
    },
    "sla_compliance": {
      "response_sla_percent": 95.8,
      "resolution_sla_percent": 87.5
    },
    "urgency_breakdown": {
      "high": {
        "count": 8,
        "average_mtta": "PT1M45S",
        "average_mttr": "PT15M30S"
      },
      "low": {
        "count": 16,
        "average_mtta": "PT4M20S",
        "average_mttr": "PT58M45S"
      }
    },
    "top_incidents_by_volume": [
      {
        "summary": "Payment API latency high",
        "count": 8,
        "mtta": "PT2M00S",
        "mttr": "PT20M00S"
      },
      {
        "summary": "Database connection timeout",
        "count": 6,
        "mtta": "PT4M30S",
        "mttr": "PT35M00S"
      },
      {
        "summary": "Memory usage above threshold",
        "count": 5,
        "mtta": "PT5M15S",
        "mttr": "PT65M00S"
      }
    ]
  }
}

第4章インシデント対応フロー

4.1 アラート発火から解決までの詳細フロー

1. Alert Generation (監視ツール側)
   └─ Datadog検知: "Payment API Response Time > 500ms"
   └─ 異常値検知時刻: 2024-04-07 14:30:00 UTC

2. Event Submission
   └─ Datadog → PagerDuty Events API v2
   └─ Incident Key: "datadog:payment-api:latency-high"
   └─ Payload:
      ├─ routing_key: "service-key"
      ├─ event_action: "trigger"
      ├─ dedup_key: "high-latency-incident"
      ├─ payload:
      │  ├─ summary: "[HIGH] Payment API Response Time > 500ms"
      │  ├─ severity: "critical"
      │  ├─ source: "payment-api-prod"
      │  ├─ custom_details:
      │  │  ├─ current_latency_ms: 650
      │  │  ├─ threshold_ms: 500
      │  │  ├─ affected_region: "us-east-1"
      │  │  └─ error_rate: "2.3%"
      │  └─ timestamp: "2024-04-07T14:30:00Z"
      └─ client: "Datadog"

3. PagerDuty Ingest
   ├─ Event Validation (スキーマ検証)
   ├─ Deduplication Check (重複排除)
   │  └─ Dedup Key "high-latency-incident" で検索
   │  └─ 既存インシデントあり → Escalation
   │  └─ 新規 → New Incident
   ├─ Alert Grouping
   │  └─ Time-based: 最後のアラートから5分以内
   │  └─ 前のアラートとGrouping
   └─ Event Enrichment
      └─ メタデータ追加（Service情報等）

4. Incident Creation / Update
   ├─ 新規インシデント作成
   ├─ Status: TRIGGERED
   ├─ Incident ID: INC123456
   ├─ Alert Count: 1
   ├─ Priority: HIGH

5. Routing Decision
   ├─ Service Configuration確認
   │  ├─ Service ID: P1Z3Q4R
   │  ├─ Escalation Policy: PE12345
   │  └─ Alert Creation: all_alerts
   ├─ Escalation Policy評価
   │  └─ Level 1: "Payment Primary On-Call" (即座)
   └─ Target Schedule確認

6. Schedule Evaluation
   ├─ On-Call Schedule: "Payment Primary On-Call"
   ├─ Current Time: 2024-04-07 14:30:00 UTC
   ├─ Active Layer Check:
   │  ├─ Primary Rotation Layer
   │  │  └─ Alice (Mon-Sun 9AM-9AM UTC)
   │  │  └─ 現在 → Aliceがオンコール中
   │  └─ Business Hours Layer
   │      └─ Diana (9AM-6PM UTC)
   │      └─ 現在 → Dianaも対象
   └─ Final: Alice (Primary)

7. Notification Delivery
   ├─ Recipient: alice@company.com
   ├─ Preferred Notification Channels:
   │  1st: Slack (immediate)
   │  2nd: SMS (if not acknowledged in 30s)
   │  3rd: Phone (if not acknowledged in 2min)
   ├─ Notification Content:
   │  ├─ Title: "[HIGH] Payment API Response Time > 500ms"
   │  ├─ Urgency: High
   │  ├─ Time: 2024-04-07 14:30:00 UTC
   │  ├─ Service: Payment API Service
   │  └─ Details: Custom fields
   └─ Delivery Confirmation:
      └─ Slack通知送信完了: 14:30:05 UTC

8. User Acknowledgment
   ├─ Alice Slack で確認 → 「Acknowledge」ボタンクリック
   ├─ Timestamp: 2024-04-07 14:32:15 UTC
   ├─ MTTA: 2分15秒
   ├─ Status遷移: TRIGGERED → ACKNOWLEDGED
   └─ Escalation Timer停止

9. Investigation Phase
   ├─ Alice がPayment API の状態確認
   ├─ 確認内容:
   │  ├─ API Metrics確認 (Datadog)
   │  ├─ Database Connection Poolチェック
   │  ├─ Network Latency測定
   │  └─ ログ分析 (CloudWatch)
   ├─ 原因特定: Database接続遅延
   ├─ ユースケース: 重いクエリがロック

10. Remediation
    ├─ Alice が対応決定
    ├─ Actions:
    │  ├─ Kill heavy query on DB
    │  └─ Trigger DB maintenance window
    ├─ Expected Recovery: 5 minutes
    └─ Timestamp: 2024-04-07 14:35:00 UTC

11. Resolution
    ├─ Alice がPagerDutyで「Resolve」を選択
    ├─ Status遷移: ACKNOWLEDGED → RESOLVED
    ├─ Total Time: 5分00秒
    ├─ MTTR: 5分00秒
    └─ Resolved At: 2024-04-07 14:35:00 UTC

12. Post-Incident Review
    ├─ 自動生成: Incident Summary
    ├─ Root Cause: Database query optimization needed
    ├─ Action Items:
    │  ├─ Index作成 (Database team)
    │  └─ Query monitoring強化 (Platform team)
    └─ Scheduled Review Meeting: 2024-04-09 10:00 UTC

4.2 エスカレーションシナリオ

シナリオ: オンコール者が応答しない場合

Time    Event                          Status
─────────────────────────────────────────────────────
14:30   インシデント発火             TRIGGERED
        Alice に通知

14:35   通知タイムアウト (5分)       TRIGGERED
        Alice から応答なし
        Escalation Policy確認
        Level 1 → Level 2へエスカレート

14:35   Bob に通知                    TRIGGERED
        Escalation Policy Level 2
        (Payment Team Lead)

14:36   Slackで通知確認              (Bob応答確認)
        Bob: "Looking at it now"

14:40   Bob が Acknowledge           ACKNOWLEDGED
        MTTA: 10分

14:45   Bob がIssue軽減              (調査中)
        Database restart実施

14:50   Bob が Resolve               RESOLVED
        MTTR: 20分
        Status → RESOLVED

第5章統合とカスタマイズ

5.1 主要な監視ツール統合

5.1.1 Datadog統合

{
  "datadog_integration": {
    "name": "Datadog Monitoring",
    "integration_type": "datadog",
    "vendor": "Datadog",
    "features": [
      "Alert routing",
      "Alert grouping",
      "Custom metrics"
    ],
    "setup": {
      "step_1": "PagerDuty Integration Key取得",
      "step_2": "Datadog Monitor作成時にPagerDutyを選択",
      "step_3": "Integration Key入力",
      "step_4": "Notification ChannelでPagerDuty選択",
      "step_5": "テストアラート送信"
    },
    "configuration_example": {
      "datadog_monitor": {
        "name": "Payment API Response Time Alert",
        "type": "metric alert",
        "query": "avg:trace.payment_api.duration{*} > 500",
        "threshold": 500,
        "notification": {
          "primary": "{{#is_alert}}[ALERT]{{/is_alert}} @pagerduty-integration",
          "custom_fields": {
            "service": "Payment API",
            "environment": "production"
          }
        }
      }
    }
  }
}

5.1.2 Prometheus統合

# Prometheus Alertmanager Configuration
global:
  resolve_timeout: 5m

route:
  receiver: 'pagerduty-default'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty-critical'
    continue: true
  - match:
      severity: warning
    receiver: 'pagerduty-warning'

receivers:
- name: 'pagerduty-default'
  pagerduty_configs:
  - service_key: '11111111111111111111111111111111'
    description: '{{ .GroupLabels.alertname }}'
    details:
      firing: '{{ range .Alerts.Firing }}{{ .Labels.instance }} {{ end }}'

- name: 'pagerduty-critical'
  pagerduty_configs:
  - service_key: '22222222222222222222222222222222'
    description: '[CRITICAL] {{ .GroupLabels.alertname }}'
    severity: 'critical'

- name: 'pagerduty-warning'
  pagerduty_configs:
  - service_key: '33333333333333333333333333333333'
    description: '[WARNING] {{ .GroupLabels.alertname }}'
    severity: 'warning'

5.1.3 CloudWatch統合

{
  "cloudwatch_integration": {
    "description": "AWS CloudWatch メトリクス監視",
    "setup_steps": [
      {
        "step": 1,
        "action": "SNS Topic作成",
        "details": {
          "name": "pagerduty-alerts",
          "region": "us-east-1"
        }
      },
      {
        "step": 2,
        "action": "PagerDuty Integration URL確認",
        "url_format": "https://events.pagerduty.com/integration/{INTEGRATION_KEY}/enqueue"
      },
      {
        "step": 3,
        "action": "SNS Subscriptionを PagerDuty URLに設定"
      },
      {
        "step": 4,
        "action": "CloudWatch Alarm設定",
        "example": {
          "alarm_name": "Payment-API-HighLatency",
          "metric": "Latency",
          "namespace": "AWS/ApplicationELB",
          "statistic": "Average",
          "period": 60,
          "threshold": 500,
          "comparison": "GreaterThanThreshold",
          "alarm_actions": ["arn:aws:sns:us-east-1:123456789:pagerduty-alerts"]
        }
      }
    ]
  }
}

5.2 Slack統合 (双方向)

PagerDutyはSlackと深く統合し、SlackからPagerDutyの操作が可能です。

Slack Integration Features:

1. Incident Notification in Slack
   ├─ 新規インシデント → Slackチャネル投稿
   ├─ リッチメッセージ表示
   │  ├─ インシデント ID
   │  ├─ ステータス
   │  ├─ Urgency
   │  └─ 担当者情報
   └─ Action Buttons
      ├─ Acknowledge
      ├─ Resolve
      ├─ Escalate
      └─ Note追加

2. Bi-directional Sync
   ├─ Slack: Acknowledge → PagerDuty: Status更新
   ├─ Slack: Resolve → PagerDuty: Closed
   ├─ Slack: Escalate → PagerDuty: Next level assignment
   └─ Slack: Add Note → PagerDuty: Incident Timeline

3. On-Call Query
   └─ Slack Command: /pagerduty who
      └─ Current on-call user表示

4. Incident Search
   └─ Slack Command: /pagerduty search
      └─ 過去インシデント検索

5. Notification Routing
   ├─ #payments-incidents → Payment Serviceのインシデント
   ├─ #infrastructure-alerts → インフラアラート
   └─ @user DM → 個人向けアラート

Slack Workflow Example:

Payment API Alert Flow:
1. Datadog → PagerDuty (アラート受信)
2. PagerDuty → Slack #payments-incidents
   "🔴 HIGH | Payment API Response Time > 500ms
    Status: TRIGGERED
    Service: Payment API Service
    Incident: INC123456
    [Acknowledge] [Resolve] [Escalate]"

3. Slack-User: クリック「Acknowledge」
4. Slack → PagerDuty API: Status update
5. PagerDuty: Status → ACKNOWLEDGED
6. Slack: Message更新
   "🟡 HIGH | Payment API Response Time > 500ms
    Status: ACKNOWLEDGED by alice
    Acknowledged at: 2024-04-07 14:32:15 UTC"

7. Slack-User: /pagerduty who
8. Bot返答: "Alice is on-call for Payment API (until Tue)"

9. Slack-User: Comment投稿
   "Identified issue: heavy DB query, killing now"
10. PagerDuty: Note追加 (Incident Timeline)

11. Issue解決後、User click「Resolve」
12. PagerDuty: Status → RESOLVED
13. Slack: Message表示
    "✅ RESOLVED | Payment API Response Time > 500ms"

5.3 APIによるカスタム統合

PagerDutyはREST APIを提供し、カスタム統合を可能にします。

Events API v2を使用したカスタムアラート:

# Python example: Custom alert submission
import requests
import json

def send_pagerduty_alert(routing_key, summary, severity, component):
    """
    PagerDuty Events APIv2でカスタムアラートを送信
    """
    url = "https://events.pagerduty.com/v2/enqueue"
    
    payload = {
        "routing_key": routing_key,
        "event_action": "trigger",
        "dedup_key": f"{component}-{severity}",
        "payload": {
            "summary": summary,
            "severity": severity,
            "source": "custom-monitoring-system",
            "custom_details": {
                "component": component,
                "service": "Payment API",
                "environment": "production",
                "metric_value": 650,
                "threshold": 500,
                "affected_region": "us-east-1"
            },
            "timestamp": "2024-04-07T14:30:00Z"
        }
    }
    
    headers = {"Content-Type": "application/json"}
    response = requests.post(url, json=payload, headers=headers)
    
    return response.json()

# Usage
result = send_pagerduty_alert(
    routing_key="your-routing-key-here",
    summary="[HIGH] Payment API Response Time > 500ms",
    severity="critical",
    component="payment-api"
)

print(f"Status: {result['status']}")
print(f"Incident ID: {result.get('incident_id', 'N/A')}")

REST API でインシデント確認:

# Incident Query API
import requests

def get_incidents(subdomain, api_token, service_ids=None, statuses=None):
    """
    PagerDuty REST API でインシデント照会
    """
    url = f"https://{subdomain}.pagerduty.com/api/v2/incidents"
    
    params = {
        "limit": 100,
        "include": ["services", "users"],
    }
    
    if service_ids:
        params["service_ids[]"] = service_ids
    if statuses:
        params["statuses[]"] = statuses
    
    headers = {
        "Accept": "application/vnd.pagerduty+json;version=2",
        "Authorization": f"Token token={api_token}"
    }
    
    response = requests.get(url, params=params, headers=headers)
    return response.json()

# Usage
incidents = get_incidents(
    subdomain="mycompany",
    api_token="your-api-token",
    service_ids=["P1Z3Q4R"],
    statuses=["triggered", "acknowledged"]
)

for incident in incidents['incidents']:
    print(f"Incident {incident['incident_number']}: {incident['title']}")
    print(f"  Status: {incident['status']}")
    print(f"  Assigned to: {incident['assignments'][0]['assignee']['summary']}")

第6章ベストプラクティス

6.1 アラート設計のベストプラクティス

1. Alert Fatigue 対策
   ├─ Rule: False Positive Rate < 5%
   ├─ 実装:
   │  ├─ Thresholdを適切に設定
   │  ├─ Static + Dynamic threshold組み合わせ
   │  ├─ Context-aware (営業時間、通常 vs バッチ期間)
   │  └─ 定期的にアラート妥当性レビュー
   └─ Measurement: False positive rate月1回確認

2. Alert Specificity
   ├─ Rule: 各アラート = 対応方法が明確
   ├─ 実装:
   │  ├─ Actionable alert (推測不要)
   │  ├─ 原因特定に必要な情報を含める
   │  ├─ ランブック/実行手順をIncident Body に記載
   │  └─ Custom Fields活用
   └─ Example:
      "Payment API Latency High (p99: 650ms > threshold: 500ms)
       Affected Region: us-east-1
       Error Rate: 2.3%
       Runbook: https://wiki/payment-api-latency"

3. Alert Granularity
   ├─ Rule: Service × Component × Severity matrix
   ├─ 実装:
   │  ├─ Service: Payment API, User Service, Order Service
   │  ├─ Component: API, Database, Cache, Message Queue
   │  ├─ Severity: Critical, Major, Minor
   │  └─ ルーティング: Service × Severity → Escalation Policy
   └─ Result: Focused on-call staffing

4. Threshold Tuning
   ├─ Rule: SLO-based threshold setting
   ├─ 実装:
   │  ├─ SLO: p99 latency < 200ms
   │  ├─ Alert threshold: SLO × 2.5 = 500ms
   │  ├─ Rationale: 十分な対応時間確保
   │  └─ Review: 月1回 (SLA compliance データ確認)
   └─ Safety: Alert threshold > SLO × 2

6.2 オンコール運用のベストプラクティス

1. Schedule Design
   ├─ Primary + Secondary + Manager tier
   ├─ Shift Length:
   │  ├─ Primary: 1週間 (Monday 9AM - Next Monday 9AM UTC)
   │  ├─ Secondary: 2週間 (offset 1週間)
   │  └─ Manager: on-demand (escalation時のみ)
   ├─ Timezone: UTC (coordinated)
   ├─ Blackout Rules:
   │  ├─ Vacation期間は事前登録
   │  ├─ Company holidays
   │  └─ 自動的に次shift に割当変更
   └─ Coverage: 100% target (gap period monitor)

2. Escalation Policy Best Practice
   ├─ Level-1: 5分
   ├─ Level-2: 10分
   ├─ Level-3: 15分
   └─ Loop:
      ├─ 最大2ループ (30分)
      ├─ ループ後: Team全体にslack通知
      └─ Critical incident対応開始

3. Incident Response SLO
   ├─ MTTA Target:
   │  ├─ High Urgency: < 5 minutes
   │  ├─ Medium Urgency: < 15 minutes
   │  └─ Low Urgency: < 30 minutes
   ├─ MTTR Target:
   │  ├─ High Urgency: < 15 minutes
   │  ├─ Medium Urgency: < 1 hour
   │  └─ Low Urgency: < 4 hours
   └─ Quarterly Review: SLA achievement vs. industry benchmark

4. オンコール者サポート
   ├─ Onshadow期間 (新人): 1週間 × 2回
   ├─ トレーニング:
   │  ├─ Runbook熟読
   │  ├─ Test incident演習
   │  └─ チーム全員での incident walkthrough
   ├─ サポート:
   │  ├─ Slack #oncall-support (質問可能)
   │  ├─ Escalation → 上位者確認
   │  └─ Post-incident Review (全員参加)
   └─ Feedback: 月1回 1-on-1 review

6.3 インシデント分析・改善ループ

Continuous Improvement Cycle:

Week 1-2: Collect Data
├─ Incident metrics収集
├─ Alert volume追跡
├─ MTTA/MTTR測定
└─ SLA compliance確認

Week 3: Analysis
├─ Top 5 recurring incidents特定
├─ False positive rate計算
├─ Escalation pattern分析
└─ Team feedback収集

Week 4: Actions & Review
├─ Root cause analysis (Top incidents)
├─ Action items決定
│  ├─ Alert threshold adjustment
│  ├─ Runbook update
│  ├─ Escalation policy change
│  └─ Training needs特定
├─ Public review meeting (team全体)
│  └─ 教訓共有、質問応答
└─ Incident report発行


Monthly Report Example:
─────────────────────────────────────────────
Payment API Service - Incident Report (March 2024)

Total Incidents: 42
- High Urgency: 8 (19%)
- Medium Urgency: 18 (43%)
- Low Urgency: 16 (38%)

Key Metrics:
- MTTA: 3m 24s (target: < 5min) ✓
- MTTR: 42m 15s (target: < 1hr) ✓
- MTTD: 1m 30s

SLA Compliance:
- Response SLA: 95.8% (target: > 95%) ✓
- Resolution SLA: 87.5% (target: > 90%) ✗

Top Incidents:
1. Payment API latency (8 incidents, 19%)
   → Action: DB index optimization
2. Database timeout (6 incidents, 14%)
   → Action: Connection pool tuning
3. Memory spike (5 incidents, 12%)
   → Action: GC tuning review

Alert Quality:
- False Positives: 2.3% (target: < 5%) ✓
- Actionable Alerts: 94% (target: > 90%) ✓

Improvement Actions (April):
□ DB optimization (Database team - Q2)
□ Alert threshold tuning (SRE team - This week)
□ Runbook refresh (Platform team - Next week)
□ Team training on new escalation (April 15)

Next Review: May 1, 2024

第7章実装ガイド

7.1 PagerDuty導入フェーズ

Phase 1: Foundation (Week 1-2)
├─ Step 1: アカウント設定
│  ├─ PagerDuty account作成
│  ├─ Admin user設定
│  └─ SSO統合 (optional)
├─ Step 2: チーム設定
│  ├─ Engineering teams作成
│  ├─ User招待
│  └─ Role割当 (Admin, Manager, Responder)
├─ Step 3: 基本Services作成
│  ├─ Payment API Service
│  ├─ User Service
│  └─ Order Service
└─ Task: 全員のセットアップ確認

Phase 2: Integration (Week 3-4)
├─ Step 1: 監視ツール統合
│  ├─ Datadog integration設定
│  ├─ Prometheus Alertmanager統合
│  └─ CloudWatch SNS統合
├─ Step 2: Slack統合
│  ├─ Slack workspace connection
│  ├─ Channel created (#incidents)
│  └─ Command /pagerduty テスト
├─ Step 3: テストアラート実行
│  ├─ Each service × trigger test
│  └─ Notification channel confirm
└─ Task: 統合動作確認

Phase 3: Schedule & Escalation (Week 5-6)
├─ Step 1: Schedules作成
│  ├─ Payment Team Primary Schedule
│  ├─ Payment Team Secondary Schedule
│  └─ Vacation/override管理
├─ Step 2: Escalation Policies作成
│  ├─ Payment Escalation (L1→L2→L3)
│  ├─ Infrastructure Escalation
│  └─ Database Escalation
├─ Step 3: Service Linking
│  ├─ Service → Escalation Policy紐付け
│  └─ Alert Creation rules設定
└─ Task: Full escalation walkthrough

Phase 4: Optimization (Week 7-8)
├─ Step 1: Alert Tuning
│  ├─ Threshold最適化
│  ├─ Alert Grouping設定
│  └─ False positive削減
├─ Step 2: Runbook Integration
│  ├─ Service custom fields設定
│  ├─ Runbook links追加
│  └─ Response guideline記載
├─ Step 3: Monitoring Setup
│  ├─ PagerDuty health dashboard
│  ├─ Metrics collection
│  └─ SLA tracking
└─ Task: Metrics collection開始

Phase 5: Go Live (Week 9+)
├─ Step 1: Pilot期間 (2週間)
│  ├─ Limited service (1-2 team)
│  ├─ Daily check-in
│  └─ Feedback収集
├─ Step 2: Full Rollout
│  ├─ All services activate
│  ├─ Full team onboarding
│  └─ Production incident対応開始
└─ Step 3: Continuous Improvement
   ├─ Weekly review meeting
   ├─ Metrics dashboard daily check
   └─ Monthly optimization

7.2 完全な設定例

シナリオ: Payment APIチーム向けPagerDuty完全設定

{
  "implementation": {
    "organization": "Acme Corp",
    "team": "Payment Platform",
    "services": [
      {
        "name": "Payment API",
        "description": "Core payment processing API",
        "id": "P1Z3Q4R",
        "escalation_policy": "Payment Team Escalation",
        "alert_creation": "all_alerts",
        "alert_grouping": "time",
        "alert_grouping_timeout": 300,
        "auto_resolve_timeout": 1800,
        "slas": {
          "response_sla": 900,
          "resolution_sla": 3600
        },
        "integrations": [
          {
            "type": "datadog",
            "routing_key": "datadog-payment-api"
          }
        ]
      }
    ],
    "schedules": [
      {
        "name": "Payment Primary On-Call",
        "id": "PS1",
        "timezone": "UTC",
        "layers": [
          {
            "name": "Primary Rotation",
            "users": ["alice", "bob", "charlie", "diana"],
            "rotation_length_days": 7,
            "start_day": "Monday",
            "start_time": "09:00",
            "coverage_percent": 100
          },
          {
            "name": "Extended Hours (5-9PM UTC)",
            "users": ["eve", "frank"],
            "rotation_length_days": 7,
            "restrictions": {
              "daily_start": "17:00",
              "daily_end": "21:00"
            },
            "coverage_percent": 57
          }
        ]
      },
      {
        "name": "Payment Secondary On-Call",
        "id": "PS2",
        "timezone": "UTC",
        "layers": [
          {
            "name": "Secondary Rotation",
            "users": ["grace", "henry", "iris", "jack"],
            "rotation_length_days": 14,
            "start_day": "Monday",
            "start_time": "09:00",
            "coverage_percent": 100
          }
        ]
      }
    ],
    "escalation_policies": [
      {
        "name": "Payment Team Escalation",
        "id": "PE12345",
        "levels": [
          {
            "level": 1,
            "delay_minutes": 0,
            "target": "Payment Primary On-Call",
            "type": "schedule"
          },
          {
            "level": 2,
            "delay_minutes": 5,
            "target": "Payment Secondary On-Call",
            "type": "schedule"
          },
          {
            "level": 3,
            "delay_minutes": 15,
            "target": "payments-manager@acmecorp.com",
            "type": "user"
          }
        ],
        "repeat": {
          "enabled": true,
          "times": 2
        }
      }
    ],
    "alert_routes": [
      {
        "source": "Datadog",
        "pattern": "payment-api",
        "service": "Payment API",
        "severity_mapping": {
          "critical": "high",
          "warning": "low",
          "info": "low"
        }
      },
      {
        "source": "Prometheus",
        "pattern": "payment_api_.*",
        "service": "Payment API",
        "grouping": "by_alert_name"
      }
    ],
    "notifications": {
      "channels": [
        {
          "type": "slack",
          "channel": "#payment-incidents",
          "events": ["triggered", "acknowledged", "resolved"]
        },
        {
          "type": "email",
          "recipient": "payment-team@acmecorp.com",
          "events": ["triggered", "resolved"]
        }
      ]
    }
  }
}

7.3 トラブルシューティングガイド

Common Issues and Solutions:

問題 1: アラートが PagerDuty に到達しない
─────────────────────────────────────
症状:
├─ Monitoring tool でアラート発火確認
├─ PagerDuty では表示なし

診断手順:
├─ Step 1: Integration key確認
│  └─ PagerDuty → Services → Service Settings
├─ Step 2: Routing key正確性確認
│  └─ 監視ツール設定との照合
├─ Step 3: API call log確認
│  └─ Monitoring tool の設定ログ確認
└─ Step 4: Network connectivity確認
   └─ curl -X POST https://events.pagerduty.com/v2/enqueue -d '...'

解決方法:
├─ Integration key を再取得・更新
├─ Routing key を正確に設定
├─ Network firewall rules確認
└─ Support に contact (SID確認)


問題 2: インシデントがエスカレートしない
─────────────────────────────────────
症状:
├─ Level 1 user 応答なし
├─ Level 2 へエスカレートしない

診断手順:
├─ Step 1: Escalation policy確認
│  └─ Services → Escalation Policy
├─ Step 2: Schedule確認
│  └─ On-Call Schedule → Current Coverage
├─ Step 3: Escalation delay設定確認
│  └─ Level ごとの delay_minutes
└─ Step 4: User availability確認
   └─ User → On-Call Restrictions

解決方法:
├─ Escalation policy レビュー
├─ Schedule layer gap埋める
├─ Escalation delay適切か確認
└─ User schedule override (vacation等)確認


問題 3: Slack通知が送信されない
─────────────────────────────────
症状:
├─ PagerDuty インシデント作成 OK
├─ Slack notification なし

診断手順:
├─ Step 1: Slack integration確認
│  └─ PagerDuty → Integrations → Slack
├─ Step 2: Channel permission確認
│  └─ Slack Bot がチャネルに参加しているか
├─ Step 3: Service notification routing確認
│  └─ Service → Notification rules
└─ Step 4: Slack workspace log確認
   └─ App → PagerDuty activity

解決方法:
├─ Slack workspace 再auth
├─ Channel に @PagerDuty bot 招待
├─ Notification rule 再設定
└─ Channel permission確認


問題 4: False positives が多い
───────────────────────────────
症状:
├─ アラート多数発火
├─ 多くが対応不要

診断手順:
├─ Step 1: Alert volume確認
│  └─ Incidents → List → Filter by service
├─ Step 2: Incident content確認
│  └─ 実際に対応が必要か?
├─ Step 3: Alert rule確認 (監視ツール側)
│  └─ Threshold妥当性
└─ Step 4: PagerDuty metrics確認
   └─ False positive rate計測

解決方法:
├─ 監視ツール側 threshold調整
├─ Alert rule修正 (root cause特定)
├─ Dynamic threshold導入 (学習系)
└─ Alert disable (不要な場合)

第8章高度な運用トピック

8.1 複雑なシナリオへの対応

シナリオ 1: 複数チーム × 複数地域構成
─────────────────────────────────────

例: グローバルペイメント企業

Regions:
├─ APAC (Asia-Pacific)
│  ├─ オンコール: Tokyo-based team
│  ├─ 営業時間: 9:00-18:00 JST
│  └─ Escalation: AP regional manager
├─ EMEA (Europe, Middle East, Africa)
│  ├─ オンコール: London-based team
│  ├─ 営業時間: 9:00-18:00 GMT
│  └─ Escalation: EMEA regional manager
└─ AMER (Americas)
   ├─ オンコール: San Francisco-based team
   ├─ 営業時間: 9:00-18:00 PST
   └─ Escalation: AMER regional manager

Implementation:
├─ Service per region
│  ├─ payment-api-apac
│  ├─ payment-api-emea
│  └─ payment-api-amer
├─ On-call schedule per region
│  ├─ APAC timezone (JST)
│  ├─ EMEA timezone (GMT)
│  └─ AMER timezone (PST)
├─ Cross-region escalation (24h coverage)
│  ├─ APAC L3 → EMEA L1
│  ├─ EMEA L3 → AMER L1
│  └─ AMER L3 → APAC L1
└─ Global war room escalation
   └─ All levels exhausted → Global on-call list


シナリオ 2: 依存関係のある複数 Service
──────────────────────────────────────

例: Payment → Fraud Detection → User Service

Dependency:
Payment API Service
  ↓ depends on
Fraud Detection Service
  ↓ depends on
User Service

Incident Correlation:
User Service Down
  → Fraud Detection fails
    → Payment API fails
    → 3つのインシデント生成

Best Practice:
├─ Correlation rules設定
│  └─ 3つのインシデントを link
├─ Root cause insident priority
│  └─ User Service をPRIMARYに
├─ Dependent Service escalation
│  └─ Parent incident escalate → Child も escalate
└─ Single incident view
   └─ 関連インシデント一覧表示


シナリオ 3: SLA ベースの動的ルーティング
──────────────────────────────────────

Severity Level × SLA Target × Escalation

| Severity | SLA Response | SLA Resolution | Escalation        |
|----------|-------------|----------------|------------------|
| P1       | 5 min       | 15 min        | L1→L2→L3 (0,3,8) |
| P2       | 15 min      | 1 hour        | L1→L2 (0,8)      |
| P3       | 30 min      | 4 hours       | L1→L2 (0,15)     |
| P4       | 1 hour      | 8 hours       | L1 only          |

Implementation:
├─ Incident urgency設定
│  ├─ high (P1, P2)
│  └─ low (P3, P4)
├─ Escalation policy per urgency
│  └─ Dynamic routing
├─ SLA tracking per level
│  └─ Compliance dashboard
└─ Auto-escalate if SLA at risk
   └─ 70% of target time → escalate

8.2 テスト・シミュレーション

Incident Drill (毎月実施)

Drill 1: Single Service Incident
──────────────────────────────────
Scenario: Payment API down
├─ Trigger: Manual test incident
├─ Participants: Payment team
├─ Expected flow:
│  1. Incident triggered
│  2. MTTA < 5 min (target)
│  3. Investigation start
│  4. RCA process
│  5. Resolution (simulated)
│  6. Post-incident review
└─ Measurement: MTTA, MTTR, SLA hit


Drill 2: Escalation Chain Test
─────────────────────────────────
Scenario: Primary on-call not responding
├─ Trigger: Primary intentionally ignores
├─ Participants: Full escalation chain
├─ Expected flow:
│  1. L1 notification → no ack (5 min)
│  2. L2 notification → ack success
│  3. L2 responds
│  4. Post-drill review
└─ Measurement: Escalation timing accuracy


Drill 3: Multi-service Correlation
────────────────────────────────────
Scenario: Cascading failure (User Service → Payment)
├─ Trigger: Multiple test incidents
├─ Participants: Multiple teams
├─ Expected flow:
│  1. User Service incident triggered
│  2. Payment Service incident triggered (dependent)
│  3. Correlation detection
│  4. Cross-team communication
│  5. Root cause (User Service)
│  6. Coordinated resolution
└─ Measurement: Team coordination, communication speed


Test Incident Submission (API):
────────────────────────────────

curl -X POST https://events.pagerduty.com/v2/enqueue \
  -H 'Content-Type: application/json' \
  -d '{
    "routing_key": "YOUR_ROUTING_KEY",
    "event_action": "trigger",
    "dedup_key": "test-incident-drill-$(date +%s)",
    "payload": {
      "summary": "[TEST] Payment API Down - Incident Drill",
      "severity": "critical",
      "source": "test-system",
      "custom_details": {
        "test_id": "drill-2024-04-07-001",
        "test_type": "escalation_chain",
        "expected_responders": "alice, bob, charlie"
      }
    }
  }'

まとめ

PagerDutyは、モダンなSRE/DevOps環境において標準的なインシデント管理プラットフォームです。本ガイドで説明した機能とアーキテクチャを正しく理解し、実装することで：

インシデント検知の迅速化 (MTTD削減)
対応者への確実な通知 (MTTA削減)
インシデント解決の迅速化 (MTTR削減)
チーム間のコミュニケーション効率化
継続的な改善サイクル構築

が実現できます。特に、複数チーム、複数地域、複数システムを運用する大規模環境では、PagerDutyのような統一されたインシデント管理プラットフォームが不可欠です。

組織の成長に伴い、段階的にPagerDutyの機能を活用していくことが推奨されます。

参考資料

PagerDuty公式ドキュメント: https://support.pagerduty.com/
REST API v2: https://api-reference.pagerduty.com/
Events API: https://developer.pagerduty.com/docs/events-api-v2/overview/
Integrations: https://www.pagerduty.com/integrations/
Best Practices: https://support.pagerduty.com/docs/