Incident Response Playbook - MyTelevision API
This playbook provides step-by-step procedures for responding to common incidents detected by the monitoring system.
Incident Severity Levels
| Level | Response Time | Examples |
|---|---|---|
| Critical | 15 minutes | API down, database down, security breach |
| High | 1 hour | High error rate, payment failures |
| Medium | 4 hours | High latency, memory warnings |
| Low | 24 hours | Non-critical alerts, performance degradation |
Quick Reference
| Alert | Playbook Section |
|---|---|
| APIDown | API Down |
| PostgreSQLDown | Database Down |
| RedisDown | Redis Down |
| HighErrorRate | High Error Rate |
| HighResponseTime | High Latency |
| HighMemoryUsage | Memory Issues |
| HighCPUUsage | CPU Issues |
| DiskSpaceCritical | Disk Space |
| PossibleBruteForce | Security Incident |
| PaymentFailuresHigh | Payment Issues |
API Down
Alert: APIDown
Severity: Critical Condition: API not responding for 1+ minute
Immediate Actions
-
Verify the Alert
# Check API health
curl -s http://localhost:3000/api/v2/health | jq
# Check if container is running
docker ps | grep mytelevision-api
# Check container logs
docker logs mytelevision-api --tail 100 -
Check Resource Constraints
# Check container resource usage
docker stats mytelevision-api --no-stream
# Check host resources
free -m
df -h -
Restart if Necessary
# Soft restart
docker-compose restart api
# Hard restart (last resort)
docker-compose down api && docker-compose up -d api
Investigation
-
Check application logs for errors:
docker logs mytelevision-api --since 10m 2>&1 | grep -i error -
Check if dependencies are healthy:
# PostgreSQL
docker exec mytelevision-postgres pg_isready
# Redis
docker exec mytelevision-redis redis-cli ping -
Review recent deployments or changes
Escalation
If not resolved within 15 minutes:
- Page senior engineer
- Notify stakeholders
- Consider rollback if recent deployment
Database Down
Alert: PostgreSQLDown
Severity: Critical Condition: PostgreSQL not responding for 1+ minute
Immediate Actions
-
Verify the Alert
# Check PostgreSQL status
docker exec mytelevision-postgres pg_isready -U mytelevision
# Check container status
docker ps | grep postgres
# Check logs
docker logs mytelevision-postgres --tail 100 -
Check Connection Pool
# Check active connections
docker exec mytelevision-postgres psql -U mytelevision -c \
"SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"
# Check for blocked queries
docker exec mytelevision-postgres psql -U mytelevision -c \
"SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity WHERE state = 'active'
ORDER BY duration DESC LIMIT 10;" -
Terminate Blocking Queries (if necessary)
# Kill long-running queries (> 5 minutes)
docker exec mytelevision-postgres psql -U mytelevision -c \
"SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE state = 'active' AND query_start < now() - interval '5 minutes';"
Investigation
-
Check disk space:
docker exec mytelevision-postgres df -h /var/lib/postgresql/data -
Check for replication lag (if applicable)
-
Review recent schema changes or migrations
Recovery
If database needs restart:
# Graceful restart
docker-compose restart postgres
# Wait for recovery
sleep 30
# Verify and run migrations if needed
npm run prisma:migrate:deploy
Redis Down
Alert: RedisDown
Severity: Critical Condition: Redis not responding for 2+ minutes
Immediate Actions
-
Verify the Alert
# Check Redis status
docker exec mytelevision-redis redis-cli ping
# Check container
docker ps | grep redis
# Check logs
docker logs mytelevision-redis --tail 50 -
Check Memory
docker exec mytelevision-redis redis-cli info memory -
Restart if Necessary
docker-compose restart redis
Impact Assessment
Redis being down affects:
- Session storage
- Rate limiting
- Caching (degraded performance)
The application should fallback gracefully, but verify:
# Check API is still responding
curl -s http://localhost:3000/api/v2/health
High Error Rate
Alert: HighErrorRate / HighCriticalErrorRate
Severity: High/Critical Condition: Error rate > 1% (high) or > 5% (critical) for 5+ minutes
Immediate Actions
-
Identify Error Pattern
# Check Grafana dashboard
# Open: http://localhost:3001/d/api-overview
# Or query Prometheus directly
curl -s 'http://localhost:9090/api/v1/query?query=sum(rate(mytelevision_http_requests_total{status_code=~"5.."}[5m]))' | jq -
Check Application Logs
docker logs mytelevision-api --since 10m 2>&1 | grep -i "error\|exception\|failed" -
Identify Affected Endpoints
# From Prometheus
curl -s 'http://localhost:9090/api/v1/query?query=topk(10,sum(rate(mytelevision_http_requests_total{status_code=~"5.."}[5m]))by(path))' | jq
Investigation
- Check if specific endpoint is failing
- Review recent deployments
- Check external dependencies (TMDB, Firebase, etc.)
- Verify database connectivity
Resolution
Depending on cause:
- Bad deployment: Rollback
- External service: Implement fallback or retry
- Database issue: See Database Down
High Latency
Alert: HighResponseTime
Severity: Medium Condition: P95 latency > 500ms for 5+ minutes
Immediate Actions
-
Identify Slow Endpoints
# Check Grafana - look at "Response Time Percentiles" panel
# Open: http://localhost:3001/d/api-overview -
Check Database Performance
# Slow queries
docker exec mytelevision-postgres psql -U mytelevision -c \
"SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC LIMIT 10;" -
Check Cache Hit Rate
docker exec mytelevision-redis redis-cli info stats | grep hit
Investigation
- Profile slow endpoints
- Check N+1 query issues
- Verify indexes are being used
- Check Redis cache effectiveness
Resolution
- Add missing indexes
- Optimize queries
- Implement caching
- Scale horizontally if needed
Memory Issues
Alert: HighMemoryUsage
Severity: Medium Condition: Memory usage > 85% for 5+ minutes
Immediate Actions
-
Identify Memory Consumer
# Check container memory
docker stats --no-stream
# Check Node.js heap
curl -s http://localhost:3000/metrics | grep nodejs_heap -
Check for Memory Leaks
- Review Grafana "Memory Usage" panel trends
- Look for continuously increasing memory
-
Force Garbage Collection (temporary)
# Restart API with GC exposed (if configured)
# Add to NODE_OPTIONS: --expose-gc
Investigation
- Check for memory leaks in code
- Review recent changes to services
- Check for large data structures in cache
Resolution
- Fix memory leaks
- Increase memory limits
- Implement pagination for large queries
CPU Issues
Alert: HighCPUUsage
Severity: Medium Condition: CPU usage > 80% for 5+ minutes
Immediate Actions
-
Identify CPU Consumer
docker stats --no-stream
# Check process in container
docker exec mytelevision-api top -b -n 1 -
Check for Intensive Operations
- Review active requests
- Check for runaway processes
Investigation
- Profile CPU-intensive endpoints
- Check for infinite loops
- Review background job activity
Resolution
- Optimize CPU-intensive code
- Add rate limiting
- Scale horizontally
Disk Space
Alert: DiskSpaceCritical
Severity: Critical Condition: Disk usage > 95%
Immediate Actions
-
Identify Space Usage
df -h
du -sh /var/lib/docker/* -
Clean Docker Resources
# Remove unused images
docker image prune -a
# Remove unused volumes
docker volume prune
# Remove old logs
docker system prune --volumes -
Clean Application Logs
# Truncate large log files
truncate -s 0 /var/log/mytelevision/*.log
Investigation
- Check Prometheus data retention
- Review PostgreSQL WAL size
- Check for large temp files
Resolution
- Reduce Prometheus retention
- Archive old data
- Add disk space
Security Incident
Alert: PossibleBruteForce
Severity: High Condition: > 10 failed logins per minute
Immediate Actions
-
Identify Attack Source
# Check failed login IPs from logs
docker logs mytelevision-api --since 30m 2>&1 | grep "failed login" | awk '{print $NF}' | sort | uniq -c | sort -rn -
Block Suspicious IPs (if needed)
# Add to firewall or reverse proxy
iptables -A INPUT -s <IP> -j DROP -
Enable Enhanced Rate Limiting
- Temporarily reduce rate limits
- Enable CAPTCHA if available
Investigation
- Identify targeted accounts
- Check for successful breaches
- Review access patterns
Post-Incident
- Reset compromised passwords
- Review and update rate limits
- Document incident
Payment Issues
Alert: PaymentFailuresHigh
Severity: High Condition: Payment failure rate > 5%
Immediate Actions
-
Check Payment Provider Status
- Visit provider status page
- Check provider dashboard
-
Review Error Messages
docker logs mytelevision-api --since 30m 2>&1 | grep -i "payment\|stripe\|failed" -
Verify API Keys
- Check payment provider API key validity
- Verify webhook endpoints
Investigation
- Identify failure patterns (card type, region, amount)
- Check for provider rate limits
- Review recent payment code changes
Resolution
- Contact payment provider if needed
- Implement retry logic
- Add fallback payment methods
Post-Incident Procedures
For All Incidents
-
Document the Incident
- Start time and duration
- Impact (users affected, revenue impact)
- Root cause
- Resolution steps
-
Create Post-Mortem
- Timeline of events
- What went wrong
- What went well
- Action items
-
Update Runbooks
- Add new procedures discovered
- Update alert thresholds if needed
- Improve automation
Stakeholder Communication
During Incident:
- Update status page
- Notify customer support
- Send internal updates every 30 minutes
After Incident:
- Send resolution notice
- Schedule post-mortem meeting
- Share learnings with team
Emergency Contacts
| Role | Contact | Escalation Time |
|---|---|---|
| On-Call Engineer | [email protected] | Immediate |
| Backend Lead | [email protected] | 15 minutes |
| DevOps Lead | [email protected] | 15 minutes |
| CTO | [email protected] | 30 minutes |
Useful Links
- Grafana: http://localhost:3001
- Prometheus: http://localhost:9090
- Alertmanager: http://localhost:9093
- API Docs: http://localhost:3000/api/docs
- Status Page: (configure your status page URL)