Incident Response Playbook - MyTelevision API

This playbook provides step-by-step procedures for responding to common incidents detected by the monitoring system.

Incident Severity Levels

Level	Response Time	Examples
Critical	15 minutes	API down, database down, security breach
High	1 hour	High error rate, payment failures
Medium	4 hours	High latency, memory warnings
Low	24 hours	Non-critical alerts, performance degradation

Quick Reference

Alert	Playbook Section
APIDown	API Down
PostgreSQLDown	Database Down
RedisDown	Redis Down
HighErrorRate	High Error Rate
HighResponseTime	High Latency
HighMemoryUsage	Memory Issues
HighCPUUsage	CPU Issues
DiskSpaceCritical	Disk Space
PossibleBruteForce	Security Incident
PaymentFailuresHigh	Payment Issues

API Down

Alert: `APIDown`

Severity: Critical Condition: API not responding for 1+ minute

Immediate Actions

Verify the Alert

# Check API health
curl -s http://localhost:3000/api/v2/health | jq

# Check if container is running
docker ps | grep mytelevision-api

# Check container logs
docker logs mytelevision-api --tail 100

Check Resource Constraints

# Check container resource usage
docker stats mytelevision-api --no-stream

# Check host resources
free -m
df -h

Restart if Necessary

# Soft restart
docker-compose restart api

# Hard restart (last resort)
docker-compose down api && docker-compose up -d api

Investigation

Check application logs for errors:

docker logs mytelevision-api --since 10m 2>&1 | grep -i error

Check if dependencies are healthy:

# PostgreSQL
docker exec mytelevision-postgres pg_isready

# Redis
docker exec mytelevision-redis redis-cli ping

Review recent deployments or changes

Escalation

If not resolved within 15 minutes:

Page senior engineer
Notify stakeholders
Consider rollback if recent deployment

Database Down

Alert: `PostgreSQLDown`

Severity: Critical Condition: PostgreSQL not responding for 1+ minute

Immediate Actions

Verify the Alert

# Check PostgreSQL status
docker exec mytelevision-postgres pg_isready -U mytelevision

# Check container status
docker ps | grep postgres

# Check logs
docker logs mytelevision-postgres --tail 100

Check Connection Pool

# Check active connections
docker exec mytelevision-postgres psql -U mytelevision -c \
  "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"

# Check for blocked queries
docker exec mytelevision-postgres psql -U mytelevision -c \
  "SELECT pid, now() - pg_stat_activity.query_start AS duration, query
   FROM pg_stat_activity WHERE state = 'active'
   ORDER BY duration DESC LIMIT 10;"

Terminate Blocking Queries (if necessary)

# Kill long-running queries (> 5 minutes)
docker exec mytelevision-postgres psql -U mytelevision -c \
  "SELECT pg_terminate_backend(pid) FROM pg_stat_activity
   WHERE state = 'active' AND query_start < now() - interval '5 minutes';"

Investigation

Check disk space:

docker exec mytelevision-postgres df -h /var/lib/postgresql/data

Check for replication lag (if applicable)
Review recent schema changes or migrations

Recovery

If database needs restart:

# Graceful restart
docker-compose restart postgres

# Wait for recovery
sleep 30

# Verify and run migrations if needed
npm run prisma:migrate:deploy

Redis Down

Alert: `RedisDown`

Severity: Critical Condition: Redis not responding for 2+ minutes

Immediate Actions

Verify the Alert

# Check Redis status
docker exec mytelevision-redis redis-cli ping

# Check container
docker ps | grep redis

# Check logs
docker logs mytelevision-redis --tail 50

Check Memory

docker exec mytelevision-redis redis-cli info memory

Restart if Necessary
```
docker-compose restart redis
```

Impact Assessment

Redis being down affects:

Session storage
Rate limiting
Caching (degraded performance)

The application should fallback gracefully, but verify:

# Check API is still responding
curl -s http://localhost:3000/api/v2/health

High Error Rate

Alert: `HighErrorRate` / `HighCriticalErrorRate`

Severity: High/Critical Condition: Error rate > 1% (high) or > 5% (critical) for 5+ minutes

Immediate Actions

Identify Error Pattern

# Check Grafana dashboard
# Open: http://localhost:3001/d/api-overview

# Or query Prometheus directly
curl -s 'http://localhost:9090/api/v1/query?query=sum(rate(mytelevision_http_requests_total{status_code=~"5.."}[5m]))' | jq

Check Application Logs

docker logs mytelevision-api --since 10m 2>&1 | grep -i "error\|exception\|failed"

Identify Affected Endpoints

# From Prometheus
curl -s 'http://localhost:9090/api/v1/query?query=topk(10,sum(rate(mytelevision_http_requests_total{status_code=~"5.."}[5m]))by(path))' | jq

Investigation

Check if specific endpoint is failing
Review recent deployments
Check external dependencies (TMDB, Firebase, etc.)
Verify database connectivity

Resolution

Depending on cause:

Bad deployment: Rollback
External service: Implement fallback or retry
Database issue: See Database Down

High Latency

Alert: `HighResponseTime`

Severity: Medium Condition: P95 latency > 500ms for 5+ minutes

Immediate Actions

Identify Slow Endpoints

# Check Grafana - look at "Response Time Percentiles" panel
# Open: http://localhost:3001/d/api-overview

Check Database Performance

# Slow queries
docker exec mytelevision-postgres psql -U mytelevision -c \
  "SELECT query, calls, mean_time, total_time
   FROM pg_stat_statements
   ORDER BY mean_time DESC LIMIT 10;"

Check Cache Hit Rate

docker exec mytelevision-redis redis-cli info stats | grep hit

Investigation

Profile slow endpoints
Check N+1 query issues
Verify indexes are being used
Check Redis cache effectiveness

Resolution

Add missing indexes
Optimize queries
Implement caching
Scale horizontally if needed

Memory Issues

Alert: `HighMemoryUsage`

Severity: Medium Condition: Memory usage > 85% for 5+ minutes

Immediate Actions

Identify Memory Consumer

# Check container memory
docker stats --no-stream

# Check Node.js heap
curl -s http://localhost:3000/metrics | grep nodejs_heap

Check for Memory Leaks
- Review Grafana "Memory Usage" panel trends
- Look for continuously increasing memory

Force Garbage Collection (temporary)

# Restart API with GC exposed (if configured)
# Add to NODE_OPTIONS: --expose-gc

Investigation

Check for memory leaks in code
Review recent changes to services
Check for large data structures in cache

Resolution

Fix memory leaks
Increase memory limits
Implement pagination for large queries

CPU Issues

Alert: `HighCPUUsage`

Severity: Medium Condition: CPU usage > 80% for 5+ minutes

Immediate Actions

Identify CPU Consumer

docker stats --no-stream

# Check process in container
docker exec mytelevision-api top -b -n 1

Check for Intensive Operations
- Review active requests
- Check for runaway processes

Investigation

Profile CPU-intensive endpoints
Check for infinite loops
Review background job activity

Resolution

Optimize CPU-intensive code
Add rate limiting
Scale horizontally

Disk Space

Alert: `DiskSpaceCritical`

Severity: Critical Condition: Disk usage > 95%

Immediate Actions

Identify Space Usage
```
df -h
du -sh /var/lib/docker/*
```

Clean Docker Resources

# Remove unused images
docker image prune -a

# Remove unused volumes
docker volume prune

# Remove old logs
docker system prune --volumes

Clean Application Logs

# Truncate large log files
truncate -s 0 /var/log/mytelevision/*.log

Investigation

Check Prometheus data retention
Review PostgreSQL WAL size
Check for large temp files

Resolution

Reduce Prometheus retention
Archive old data
Add disk space

Security Incident

Alert: `PossibleBruteForce`

Severity: High Condition: > 10 failed logins per minute

Immediate Actions

Identify Attack Source

# Check failed login IPs from logs
docker logs mytelevision-api --since 30m 2>&1 | grep "failed login" | awk '{print $NF}' | sort | uniq -c | sort -rn

Block Suspicious IPs (if needed)

# Add to firewall or reverse proxy
iptables -A INPUT -s <IP> -j DROP

Enable Enhanced Rate Limiting
- Temporarily reduce rate limits
- Enable CAPTCHA if available

Investigation

Identify targeted accounts
Check for successful breaches
Review access patterns

Post-Incident

Reset compromised passwords
Review and update rate limits
Document incident

Payment Issues

Alert: `PaymentFailuresHigh`

Severity: High Condition: Payment failure rate > 5%

Immediate Actions

Check Payment Provider Status
- Visit provider status page
- Check provider dashboard

Review Error Messages

docker logs mytelevision-api --since 30m 2>&1 | grep -i "payment\|stripe\|failed"

Verify API Keys
- Check payment provider API key validity
- Verify webhook endpoints

Investigation

Identify failure patterns (card type, region, amount)
Check for provider rate limits
Review recent payment code changes

Resolution

Contact payment provider if needed
Implement retry logic
Add fallback payment methods

Post-Incident Procedures

For All Incidents

Document the Incident
- Start time and duration
- Impact (users affected, revenue impact)
- Root cause
- Resolution steps
Create Post-Mortem
- Timeline of events
- What went wrong
- What went well
- Action items
Update Runbooks
- Add new procedures discovered
- Update alert thresholds if needed
- Improve automation

Stakeholder Communication

During Incident:

Update status page
Notify customer support
Send internal updates every 30 minutes

After Incident:

Send resolution notice
Schedule post-mortem meeting
Share learnings with team

Emergency Contacts

Role	Contact	Escalation Time
On-Call Engineer	[email protected]	Immediate
Backend Lead	[email protected]	15 minutes
DevOps Lead	[email protected]	15 minutes
CTO	[email protected]	30 minutes

Useful Links

Grafana: http://localhost:3001
Prometheus: http://localhost:9090
Alertmanager: http://localhost:9093
API Docs: http://localhost:3000/api/docs
Status Page: (configure your status page URL)

Incident Severity Levels​

Quick Reference​

API Down​

Alert: APIDown​

Immediate Actions​

Investigation​

Escalation​

Database Down​

Alert: PostgreSQLDown​

Immediate Actions​

Investigation​

Recovery​

Redis Down​

Alert: RedisDown​

Immediate Actions​

Impact Assessment​

High Error Rate​

Alert: HighErrorRate / HighCriticalErrorRate​

Immediate Actions​

Investigation​

Resolution​

High Latency​

Alert: HighResponseTime​

Immediate Actions​

Investigation​

Resolution​

Memory Issues​

Alert: HighMemoryUsage​

Immediate Actions​

Investigation​

Resolution​

CPU Issues​

Alert: HighCPUUsage​

Immediate Actions​

Investigation​

Resolution​

Disk Space​

Alert: DiskSpaceCritical​

Immediate Actions​

Investigation​

Resolution​

Security Incident​

Alert: PossibleBruteForce​

Immediate Actions​

Investigation​

Post-Incident​

Payment Issues​

Alert: PaymentFailuresHigh​

Immediate Actions​

Investigation​

Resolution​

Post-Incident Procedures​

For All Incidents​

Stakeholder Communication​

Emergency Contacts​

Useful Links​

Incident Severity Levels

Quick Reference

API Down

Alert: `APIDown`

Immediate Actions

Investigation

Escalation

Database Down

Alert: `PostgreSQLDown`

Immediate Actions

Investigation

Recovery

Redis Down

Alert: `RedisDown`

Immediate Actions

Impact Assessment

High Error Rate

Alert: `HighErrorRate` / `HighCriticalErrorRate`

Immediate Actions

Investigation

Resolution

High Latency

Alert: `HighResponseTime`

Immediate Actions

Investigation

Resolution

Memory Issues

Alert: `HighMemoryUsage`

Immediate Actions

Investigation

Resolution

CPU Issues

Alert: `HighCPUUsage`

Immediate Actions

Investigation

Resolution

Disk Space

Alert: `DiskSpaceCritical`

Immediate Actions

Investigation

Resolution

Security Incident

Alert: `PossibleBruteForce`

Immediate Actions

Investigation

Post-Incident

Payment Issues

Alert: `PaymentFailuresHigh`

Immediate Actions

Investigation

Resolution

Post-Incident Procedures

For All Incidents

Stakeholder Communication

Emergency Contacts

Useful Links