Aller au contenu principal

Incident Response Playbook - MyTelevision API

This playbook provides step-by-step procedures for responding to common incidents detected by the monitoring system.

Incident Severity Levels

LevelResponse TimeExamples
Critical15 minutesAPI down, database down, security breach
High1 hourHigh error rate, payment failures
Medium4 hoursHigh latency, memory warnings
Low24 hoursNon-critical alerts, performance degradation

Quick Reference

AlertPlaybook Section
APIDownAPI Down
PostgreSQLDownDatabase Down
RedisDownRedis Down
HighErrorRateHigh Error Rate
HighResponseTimeHigh Latency
HighMemoryUsageMemory Issues
HighCPUUsageCPU Issues
DiskSpaceCriticalDisk Space
PossibleBruteForceSecurity Incident
PaymentFailuresHighPayment Issues

API Down

Alert: APIDown

Severity: Critical Condition: API not responding for 1+ minute

Immediate Actions

  1. Verify the Alert

    # Check API health
    curl -s http://localhost:3000/api/v2/health | jq

    # Check if container is running
    docker ps | grep mytelevision-api

    # Check container logs
    docker logs mytelevision-api --tail 100
  2. Check Resource Constraints

    # Check container resource usage
    docker stats mytelevision-api --no-stream

    # Check host resources
    free -m
    df -h
  3. Restart if Necessary

    # Soft restart
    docker-compose restart api

    # Hard restart (last resort)
    docker-compose down api && docker-compose up -d api

Investigation

  1. Check application logs for errors:

    docker logs mytelevision-api --since 10m 2>&1 | grep -i error
  2. Check if dependencies are healthy:

    # PostgreSQL
    docker exec mytelevision-postgres pg_isready

    # Redis
    docker exec mytelevision-redis redis-cli ping
  3. Review recent deployments or changes

Escalation

If not resolved within 15 minutes:

  1. Page senior engineer
  2. Notify stakeholders
  3. Consider rollback if recent deployment

Database Down

Alert: PostgreSQLDown

Severity: Critical Condition: PostgreSQL not responding for 1+ minute

Immediate Actions

  1. Verify the Alert

    # Check PostgreSQL status
    docker exec mytelevision-postgres pg_isready -U mytelevision

    # Check container status
    docker ps | grep postgres

    # Check logs
    docker logs mytelevision-postgres --tail 100
  2. Check Connection Pool

    # Check active connections
    docker exec mytelevision-postgres psql -U mytelevision -c \
    "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"

    # Check for blocked queries
    docker exec mytelevision-postgres psql -U mytelevision -c \
    "SELECT pid, now() - pg_stat_activity.query_start AS duration, query
    FROM pg_stat_activity WHERE state = 'active'
    ORDER BY duration DESC LIMIT 10;"
  3. Terminate Blocking Queries (if necessary)

    # Kill long-running queries (> 5 minutes)
    docker exec mytelevision-postgres psql -U mytelevision -c \
    "SELECT pg_terminate_backend(pid) FROM pg_stat_activity
    WHERE state = 'active' AND query_start < now() - interval '5 minutes';"

Investigation

  1. Check disk space:

    docker exec mytelevision-postgres df -h /var/lib/postgresql/data
  2. Check for replication lag (if applicable)

  3. Review recent schema changes or migrations

Recovery

If database needs restart:

# Graceful restart
docker-compose restart postgres

# Wait for recovery
sleep 30

# Verify and run migrations if needed
npm run prisma:migrate:deploy

Redis Down

Alert: RedisDown

Severity: Critical Condition: Redis not responding for 2+ minutes

Immediate Actions

  1. Verify the Alert

    # Check Redis status
    docker exec mytelevision-redis redis-cli ping

    # Check container
    docker ps | grep redis

    # Check logs
    docker logs mytelevision-redis --tail 50
  2. Check Memory

    docker exec mytelevision-redis redis-cli info memory
  3. Restart if Necessary

    docker-compose restart redis

Impact Assessment

Redis being down affects:

  • Session storage
  • Rate limiting
  • Caching (degraded performance)

The application should fallback gracefully, but verify:

# Check API is still responding
curl -s http://localhost:3000/api/v2/health

High Error Rate

Alert: HighErrorRate / HighCriticalErrorRate

Severity: High/Critical Condition: Error rate > 1% (high) or > 5% (critical) for 5+ minutes

Immediate Actions

  1. Identify Error Pattern

    # Check Grafana dashboard
    # Open: http://localhost:3001/d/api-overview

    # Or query Prometheus directly
    curl -s 'http://localhost:9090/api/v1/query?query=sum(rate(mytelevision_http_requests_total{status_code=~"5.."}[5m]))' | jq
  2. Check Application Logs

    docker logs mytelevision-api --since 10m 2>&1 | grep -i "error\|exception\|failed"
  3. Identify Affected Endpoints

    # From Prometheus
    curl -s 'http://localhost:9090/api/v1/query?query=topk(10,sum(rate(mytelevision_http_requests_total{status_code=~"5.."}[5m]))by(path))' | jq

Investigation

  1. Check if specific endpoint is failing
  2. Review recent deployments
  3. Check external dependencies (TMDB, Firebase, etc.)
  4. Verify database connectivity

Resolution

Depending on cause:

  • Bad deployment: Rollback
  • External service: Implement fallback or retry
  • Database issue: See Database Down

High Latency

Alert: HighResponseTime

Severity: Medium Condition: P95 latency > 500ms for 5+ minutes

Immediate Actions

  1. Identify Slow Endpoints

    # Check Grafana - look at "Response Time Percentiles" panel
    # Open: http://localhost:3001/d/api-overview
  2. Check Database Performance

    # Slow queries
    docker exec mytelevision-postgres psql -U mytelevision -c \
    "SELECT query, calls, mean_time, total_time
    FROM pg_stat_statements
    ORDER BY mean_time DESC LIMIT 10;"
  3. Check Cache Hit Rate

    docker exec mytelevision-redis redis-cli info stats | grep hit

Investigation

  1. Profile slow endpoints
  2. Check N+1 query issues
  3. Verify indexes are being used
  4. Check Redis cache effectiveness

Resolution

  • Add missing indexes
  • Optimize queries
  • Implement caching
  • Scale horizontally if needed

Memory Issues

Alert: HighMemoryUsage

Severity: Medium Condition: Memory usage > 85% for 5+ minutes

Immediate Actions

  1. Identify Memory Consumer

    # Check container memory
    docker stats --no-stream

    # Check Node.js heap
    curl -s http://localhost:3000/metrics | grep nodejs_heap
  2. Check for Memory Leaks

    • Review Grafana "Memory Usage" panel trends
    • Look for continuously increasing memory
  3. Force Garbage Collection (temporary)

    # Restart API with GC exposed (if configured)
    # Add to NODE_OPTIONS: --expose-gc

Investigation

  1. Check for memory leaks in code
  2. Review recent changes to services
  3. Check for large data structures in cache

Resolution

  • Fix memory leaks
  • Increase memory limits
  • Implement pagination for large queries

CPU Issues

Alert: HighCPUUsage

Severity: Medium Condition: CPU usage > 80% for 5+ minutes

Immediate Actions

  1. Identify CPU Consumer

    docker stats --no-stream

    # Check process in container
    docker exec mytelevision-api top -b -n 1
  2. Check for Intensive Operations

    • Review active requests
    • Check for runaway processes

Investigation

  1. Profile CPU-intensive endpoints
  2. Check for infinite loops
  3. Review background job activity

Resolution

  • Optimize CPU-intensive code
  • Add rate limiting
  • Scale horizontally

Disk Space

Alert: DiskSpaceCritical

Severity: Critical Condition: Disk usage > 95%

Immediate Actions

  1. Identify Space Usage

    df -h
    du -sh /var/lib/docker/*
  2. Clean Docker Resources

    # Remove unused images
    docker image prune -a

    # Remove unused volumes
    docker volume prune

    # Remove old logs
    docker system prune --volumes
  3. Clean Application Logs

    # Truncate large log files
    truncate -s 0 /var/log/mytelevision/*.log

Investigation

  1. Check Prometheus data retention
  2. Review PostgreSQL WAL size
  3. Check for large temp files

Resolution

  • Reduce Prometheus retention
  • Archive old data
  • Add disk space

Security Incident

Alert: PossibleBruteForce

Severity: High Condition: > 10 failed logins per minute

Immediate Actions

  1. Identify Attack Source

    # Check failed login IPs from logs
    docker logs mytelevision-api --since 30m 2>&1 | grep "failed login" | awk '{print $NF}' | sort | uniq -c | sort -rn
  2. Block Suspicious IPs (if needed)

    # Add to firewall or reverse proxy
    iptables -A INPUT -s <IP> -j DROP
  3. Enable Enhanced Rate Limiting

    • Temporarily reduce rate limits
    • Enable CAPTCHA if available

Investigation

  1. Identify targeted accounts
  2. Check for successful breaches
  3. Review access patterns

Post-Incident

  1. Reset compromised passwords
  2. Review and update rate limits
  3. Document incident

Payment Issues

Alert: PaymentFailuresHigh

Severity: High Condition: Payment failure rate > 5%

Immediate Actions

  1. Check Payment Provider Status

    • Visit provider status page
    • Check provider dashboard
  2. Review Error Messages

    docker logs mytelevision-api --since 30m 2>&1 | grep -i "payment\|stripe\|failed"
  3. Verify API Keys

    • Check payment provider API key validity
    • Verify webhook endpoints

Investigation

  1. Identify failure patterns (card type, region, amount)
  2. Check for provider rate limits
  3. Review recent payment code changes

Resolution

  • Contact payment provider if needed
  • Implement retry logic
  • Add fallback payment methods

Post-Incident Procedures

For All Incidents

  1. Document the Incident

    • Start time and duration
    • Impact (users affected, revenue impact)
    • Root cause
    • Resolution steps
  2. Create Post-Mortem

    • Timeline of events
    • What went wrong
    • What went well
    • Action items
  3. Update Runbooks

    • Add new procedures discovered
    • Update alert thresholds if needed
    • Improve automation

Stakeholder Communication

During Incident:

  • Update status page
  • Notify customer support
  • Send internal updates every 30 minutes

After Incident:

  • Send resolution notice
  • Schedule post-mortem meeting
  • Share learnings with team

Emergency Contacts

RoleContactEscalation Time
On-Call Engineer[email protected]Immediate
Backend Lead[email protected]15 minutes
DevOps Lead[email protected]15 minutes
CTO[email protected]30 minutes