Monitoring Stack
Stack de monitoring complet avec Prometheus, Grafana et Alertmanager pour MyTelevision API.
Architecture
┌─────────────────┐
│ Grafana │
│ (Port 3001) │
└────────┬────────┘
│
┌────────▼────────┐
│ Prometheus │
│ (Port 9090) │
└────────┬────────┘
│
┌─────────────────────────────────┼─────────────────────────────────┐
│ │ │
┌──────▼──────┐ ┌──────────▼──────────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ API App │ │ Node Exporter │ │ PG Exporter│ │Redis Exporter│
│ (Port 3000)│ │ (Port 9100) │ │ (Port 9187) │ │ (Port 9121) │
└─────────────┘ └─────────────────────┘ └─────────────┘ └─────────────┘
│ │
┌──────▼──────┐ ┌─────▼─────┐
│ PostgreSQL │ │ Redis │
│ (Port 5432)│ │(Port 6379)│
└─────────────┘ └───────────┘
Quick Start
Demarrer le stack monitoring (Development)
# Demarrer les services de base (PostgreSQL, Redis, API)
docker compose up -d
# Demarrer le stack monitoring
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d
Acces aux dashboards
| Service | URL | Credentials |
|---|---|---|
| Grafana | http://localhost:3001 | admin / admin |
| Prometheus | http://localhost:9090 | - |
| Alertmanager | http://localhost:9093 | - |
| API Metrics | http://localhost:3000/metrics | - |
Composants
1. Prometheus (Collecte de metriques)
Location : monitoring/prometheus/
Fichiers de configuration :
prometheus.yml- Configuration des scrape targetsrules/alerts.yml- Regles d'alertes
Scrape Targets :
| Target | Port | Metriques |
|---|---|---|
| mytelevision-api | 3000 | HTTP requests, business metrics |
| node-exporter | 9100 | Host CPU, memory, disk, network |
| postgres-exporter | 9187 | PostgreSQL connections, queries, cache |
| redis-exporter | 9121 | Redis memory, commands, connections |
| cadvisor | 8080 | Container CPU, memory, network |
Configuration :
# monitoring/prometheus/prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'mytelevision-api'
static_configs:
- targets: ['api:3000']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
2. Grafana (Visualisation)
Location : monitoring/grafana/
Dashboards pre-configures :
| Dashboard | UID | Description |
|---|---|---|
| API Overview | api-overview | Request rate, latency, errors, memory |
| Auth & Sessions | auth-sessions | Logins, sessions, security events |
| Database | database-perf | PostgreSQL performance, queries |
| Business Metrics | business-metrics | Users, payments, engagement |
| Infrastructure | infrastructure | System resources, containers |
Datasources :
- Prometheus (default)
- PostgreSQL (direct queries)
- Redis (cache metrics)
- Alertmanager (alerts)
3. Alertmanager (Routage des alertes)
Location : monitoring/alertmanager/
Alert Receivers :
| Receiver | Channel | Alertes |
|---|---|---|
| critical-alerts | Email + Slack | APIDown, PostgreSQLDown, RedisDown |
| warning-alerts | Email + Slack | High latency, high error rate |
| database-team | Email + Slack | Database-specific alerts |
| security-team | Email + Slack | Auth/security alerts |
| payments-team | Email + Slack | Payment failures |
Configuration :
# monitoring/alertmanager/alertmanager.yml
route:
group_by: ['alertname']
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'critical-alerts'
email_configs:
- to: '[email protected]'
- name: 'warning-alerts'
email_configs:
- to: '[email protected]'
4. Exporters
Node Exporter (Host Metrics)
- CPU usage by mode
- Memory usage and availability
- Disk I/O and space
- Network traffic
PostgreSQL Exporter
- Active connections
- Transaction rates
- Cache hit ratio
- Table statistics
Redis Exporter
- Memory usage
- Connected clients
- Commands per second
- Key statistics
cAdvisor (Container Metrics)
- Container CPU usage
- Container memory usage
- Container network I/O
Metriques applicatives
L'application NestJS expose des metriques Prometheus personnalisees sur /metrics :
HTTP Metrics
mytelevision_http_requests_total{method, path, status_code}
mytelevision_http_request_duration_seconds{method, path, status_code}
mytelevision_http_requests_in_flight
Authentication Metrics
mytelevision_auth_login_total{provider, status}
mytelevision_auth_failed_login_total{reason}
mytelevision_auth_active_sessions_total
mytelevision_auth_token_refresh_total
mytelevision_auth_token_expired_total
mytelevision_rate_limit_hits_total{endpoint}
Business Metrics
mytelevision_content_views_total{content_type, access_type}
mytelevision_active_streams_total{content_type}
mytelevision_reactions_total{reaction_type, content_type}
mytelevision_favorites_added_total{content_type}
mytelevision_favorites_removed_total{content_type}
mytelevision_registrations_total{source}
mytelevision_active_users_total{access_type}
mytelevision_premium_subscribers_total
mytelevision_payment_transactions_total{status, payment_method}
mytelevision_payment_amount_total{status}
mytelevision_subscription_created_total{plan_type}
Profile & Account Metrics
mytelevision_profiles_created_total{profile_type}
mytelevision_profile_switches_total
mytelevision_device_registrations_total{device_type}
mytelevision_risk_events_total{risk_type, action}
Regles d'alertes
Alertes critiques (Immediate)
| Alerte | Condition | Action |
|---|---|---|
| APIDown | API not responding for 1m | Page on-call |
| PostgreSQLDown | Database not responding for 1m | Page DBA + on-call |
| RedisDown | Cache not responding for 2m | Page on-call |
| HighCriticalErrorRate | Error rate > 5% for 2m | Page on-call |
| DiskSpaceCritical | Disk usage > 95% | Page on-call |
# Exemples de regles critiques
- alert: APIDown
expr: up{job="mytelevision-api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: 'API is down'
- alert: PostgreSQLDown
expr: pg_up == 0
for: 1m
labels:
severity: critical
Alertes warning (repetition 4h)
| Alerte | Condition | Action |
|---|---|---|
| HighErrorRate | Error rate > 1% for 5m | Notify team |
| HighResponseTime | P95 latency > 500ms for 5m | Notify team |
| HighMemoryUsage | Memory > 85% for 5m | Notify team |
| HighCPUUsage | CPU > 80% for 5m | Notify team |
| DatabaseConnectionsHigh | DB connections > 80% | Notify DBA |
# Exemples de regles warning
- alert: HighLatency
expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1
for: 10m
labels:
severity: warning
- alert: HighMemoryUsage
expr: process_resident_memory_bytes / 1024 / 1024 > 1024
for: 15m
labels:
severity: warning
Alertes securite
| Alerte | Condition | Action |
|---|---|---|
| PossibleBruteForce | > 10 failed logins/min | Notify security |
| HighTokenExpiration | > 50 expired tokens in 5m | Investigate |
| PaymentFailuresHigh | Payment failure rate > 5% | Notify payments |
Utilisation du MetricsService dans le code
import { Injectable } from '@nestjs/common';
import { MetricsService } from '@infrastructure/metrics';
@Injectable()
export class MyService {
constructor(private readonly metricsService: MetricsService) {}
async processPayment(amount: number) {
try {
// ... payment logic
this.metricsService.recordPayment('success', 'card', amount);
} catch (error) {
this.metricsService.recordPayment('failed', 'card', amount);
throw error;
}
}
async trackContentView(contentType: string, isPremium: boolean) {
this.metricsService.recordContentView(
contentType,
isPremium ? 'premium' : 'free',
);
}
async trackLogin(provider: string, success: boolean) {
this.metricsService.recordLogin(provider, success);
}
}
Variables d'environnement
Grafana
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=your_secure_password
GRAFANA_ROOT_URL=http://localhost:3001
Alertmanager (Email)
SMTP_HOST=smtp.example.com:587
[email protected]
SMTP_USER=your_user
SMTP_PASSWORD=your_password
Alertmanager (Slack)
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxx/yyy/zzz
Alert Recipients
Considerations production
Retention & Stockage
- Retention des donnees Prometheus : 30 jours (configurable dans
docker-compose.monitoring.yml) - Considerer un stockage distant (Thanos, Cortex) pour la retention long terme
prometheus:
deploy:
resources:
limits:
cpus: '2'
memory: 4G
Securite
- Changer les credentials par defaut de Grafana
- Activer HTTPS pour tous les services
- Restreindre l'acces reseau aux ports de monitoring
- Utiliser des secrets pour la configuration sensible
Haute disponibilite
Pour la production, considerer :
- Prometheus avec remote write vers un stockage long terme
- Grafana avec base de donnees PostgreSQL externe
- Cluster Alertmanager pour la HA
Troubleshooting
Prometheus ne scrape pas
- Verifier la sante des targets : http://localhost:9090/targets
- Verifier la connectivite reseau entre containers
- Verifier les regles de firewall
Dashboard Grafana ne charge pas
- Verifier la datasource Prometheus : Settings > Data Sources
- Verifier que Prometheus est en cours d'execution
- Verifier que le JSON du dashboard est valide
Les alertes ne se declenchent pas
- Verifier les regles d'alertes : http://localhost:9090/alerts
- Verifier la config Alertmanager : http://localhost:9093/#/status
- Verifier la configuration SMTP/Slack webhook
Problemes de haute cardinalite
Si Prometheus utilise trop de memoire :
- Revoir les labels dans les metriques (eviter les labels a haute cardinalite)
- Augmenter l'intervalle de scrape
- Utiliser des recording rules pour les metriques frequemment consultees