Aller au contenu principal

Monitoring Stack

Stack de monitoring complet avec Prometheus, Grafana et Alertmanager pour MyTelevision API.

Architecture

                                    ┌─────────────────┐
│ Grafana │
│ (Port 3001) │
└────────┬────────┘

┌────────▼────────┐
│ Prometheus │
│ (Port 9090) │
└────────┬────────┘

┌─────────────────────────────────┼─────────────────────────────────┐
│ │ │
┌──────▼──────┐ ┌──────────▼──────────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ API App │ │ Node Exporter │ │ PG Exporter│ │Redis Exporter│
│ (Port 3000)│ │ (Port 9100) │ │ (Port 9187) │ │ (Port 9121) │
└─────────────┘ └─────────────────────┘ └─────────────┘ └─────────────┘
│ │
┌──────▼──────┐ ┌─────▼─────┐
│ PostgreSQL │ │ Redis │
│ (Port 5432)│ │(Port 6379)│
└─────────────┘ └───────────┘

Quick Start

Demarrer le stack monitoring (Development)

# Demarrer les services de base (PostgreSQL, Redis, API)
docker compose up -d

# Demarrer le stack monitoring
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d

Acces aux dashboards

ServiceURLCredentials
Grafanahttp://localhost:3001admin / admin
Prometheushttp://localhost:9090-
Alertmanagerhttp://localhost:9093-
API Metricshttp://localhost:3000/metrics-

Composants

1. Prometheus (Collecte de metriques)

Location : monitoring/prometheus/

Fichiers de configuration :

  • prometheus.yml - Configuration des scrape targets
  • rules/alerts.yml - Regles d'alertes

Scrape Targets :

TargetPortMetriques
mytelevision-api3000HTTP requests, business metrics
node-exporter9100Host CPU, memory, disk, network
postgres-exporter9187PostgreSQL connections, queries, cache
redis-exporter9121Redis memory, commands, connections
cadvisor8080Container CPU, memory, network

Configuration :

# monitoring/prometheus/prometheus.yml
global:
scrape_interval: 15s

scrape_configs:
- job_name: 'mytelevision-api'
static_configs:
- targets: ['api:3000']

- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']

- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']

- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']

2. Grafana (Visualisation)

Location : monitoring/grafana/

Dashboards pre-configures :

DashboardUIDDescription
API Overviewapi-overviewRequest rate, latency, errors, memory
Auth & Sessionsauth-sessionsLogins, sessions, security events
Databasedatabase-perfPostgreSQL performance, queries
Business Metricsbusiness-metricsUsers, payments, engagement
InfrastructureinfrastructureSystem resources, containers

Datasources :

  • Prometheus (default)
  • PostgreSQL (direct queries)
  • Redis (cache metrics)
  • Alertmanager (alerts)

3. Alertmanager (Routage des alertes)

Location : monitoring/alertmanager/

Alert Receivers :

ReceiverChannelAlertes
critical-alertsEmail + SlackAPIDown, PostgreSQLDown, RedisDown
warning-alertsEmail + SlackHigh latency, high error rate
database-teamEmail + SlackDatabase-specific alerts
security-teamEmail + SlackAuth/security alerts
payments-teamEmail + SlackPayment failures

Configuration :

# monitoring/alertmanager/alertmanager.yml
route:
group_by: ['alertname']
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'

receivers:
- name: 'critical-alerts'
email_configs:
- to: '[email protected]'

- name: 'warning-alerts'
email_configs:
- to: '[email protected]'

4. Exporters

Node Exporter (Host Metrics)

  • CPU usage by mode
  • Memory usage and availability
  • Disk I/O and space
  • Network traffic

PostgreSQL Exporter

  • Active connections
  • Transaction rates
  • Cache hit ratio
  • Table statistics

Redis Exporter

  • Memory usage
  • Connected clients
  • Commands per second
  • Key statistics

cAdvisor (Container Metrics)

  • Container CPU usage
  • Container memory usage
  • Container network I/O

Metriques applicatives

L'application NestJS expose des metriques Prometheus personnalisees sur /metrics :

HTTP Metrics

mytelevision_http_requests_total{method, path, status_code}
mytelevision_http_request_duration_seconds{method, path, status_code}
mytelevision_http_requests_in_flight

Authentication Metrics

mytelevision_auth_login_total{provider, status}
mytelevision_auth_failed_login_total{reason}
mytelevision_auth_active_sessions_total
mytelevision_auth_token_refresh_total
mytelevision_auth_token_expired_total
mytelevision_rate_limit_hits_total{endpoint}

Business Metrics

mytelevision_content_views_total{content_type, access_type}
mytelevision_active_streams_total{content_type}
mytelevision_reactions_total{reaction_type, content_type}
mytelevision_favorites_added_total{content_type}
mytelevision_favorites_removed_total{content_type}
mytelevision_registrations_total{source}
mytelevision_active_users_total{access_type}
mytelevision_premium_subscribers_total
mytelevision_payment_transactions_total{status, payment_method}
mytelevision_payment_amount_total{status}
mytelevision_subscription_created_total{plan_type}

Profile & Account Metrics

mytelevision_profiles_created_total{profile_type}
mytelevision_profile_switches_total
mytelevision_device_registrations_total{device_type}
mytelevision_risk_events_total{risk_type, action}

Regles d'alertes

Alertes critiques (Immediate)

AlerteConditionAction
APIDownAPI not responding for 1mPage on-call
PostgreSQLDownDatabase not responding for 1mPage DBA + on-call
RedisDownCache not responding for 2mPage on-call
HighCriticalErrorRateError rate > 5% for 2mPage on-call
DiskSpaceCriticalDisk usage > 95%Page on-call
# Exemples de regles critiques
- alert: APIDown
expr: up{job="mytelevision-api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: 'API is down'

- alert: PostgreSQLDown
expr: pg_up == 0
for: 1m
labels:
severity: critical

Alertes warning (repetition 4h)

AlerteConditionAction
HighErrorRateError rate > 1% for 5mNotify team
HighResponseTimeP95 latency > 500ms for 5mNotify team
HighMemoryUsageMemory > 85% for 5mNotify team
HighCPUUsageCPU > 80% for 5mNotify team
DatabaseConnectionsHighDB connections > 80%Notify DBA
# Exemples de regles warning
- alert: HighLatency
expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1
for: 10m
labels:
severity: warning

- alert: HighMemoryUsage
expr: process_resident_memory_bytes / 1024 / 1024 > 1024
for: 15m
labels:
severity: warning

Alertes securite

AlerteConditionAction
PossibleBruteForce> 10 failed logins/minNotify security
HighTokenExpiration> 50 expired tokens in 5mInvestigate
PaymentFailuresHighPayment failure rate > 5%Notify payments

Utilisation du MetricsService dans le code

import { Injectable } from '@nestjs/common';
import { MetricsService } from '@infrastructure/metrics';

@Injectable()
export class MyService {
constructor(private readonly metricsService: MetricsService) {}

async processPayment(amount: number) {
try {
// ... payment logic
this.metricsService.recordPayment('success', 'card', amount);
} catch (error) {
this.metricsService.recordPayment('failed', 'card', amount);
throw error;
}
}

async trackContentView(contentType: string, isPremium: boolean) {
this.metricsService.recordContentView(
contentType,
isPremium ? 'premium' : 'free',
);
}

async trackLogin(provider: string, success: boolean) {
this.metricsService.recordLogin(provider, success);
}
}

Variables d'environnement

Grafana

GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=your_secure_password
GRAFANA_ROOT_URL=http://localhost:3001

Alertmanager (Email)

SMTP_HOST=smtp.example.com:587
[email protected]
SMTP_USER=your_user
SMTP_PASSWORD=your_password

Alertmanager (Slack)

SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxx/yyy/zzz

Alert Recipients

Considerations production

Retention & Stockage

  • Retention des donnees Prometheus : 30 jours (configurable dans docker-compose.monitoring.yml)
  • Considerer un stockage distant (Thanos, Cortex) pour la retention long terme
prometheus:
deploy:
resources:
limits:
cpus: '2'
memory: 4G

Securite

  1. Changer les credentials par defaut de Grafana
  2. Activer HTTPS pour tous les services
  3. Restreindre l'acces reseau aux ports de monitoring
  4. Utiliser des secrets pour la configuration sensible

Haute disponibilite

Pour la production, considerer :

  • Prometheus avec remote write vers un stockage long terme
  • Grafana avec base de donnees PostgreSQL externe
  • Cluster Alertmanager pour la HA

Troubleshooting

Prometheus ne scrape pas

  1. Verifier la sante des targets : http://localhost:9090/targets
  2. Verifier la connectivite reseau entre containers
  3. Verifier les regles de firewall

Dashboard Grafana ne charge pas

  1. Verifier la datasource Prometheus : Settings > Data Sources
  2. Verifier que Prometheus est en cours d'execution
  3. Verifier que le JSON du dashboard est valide

Les alertes ne se declenchent pas

  1. Verifier les regles d'alertes : http://localhost:9090/alerts
  2. Verifier la config Alertmanager : http://localhost:9093/#/status
  3. Verifier la configuration SMTP/Slack webhook

Problemes de haute cardinalite

Si Prometheus utilise trop de memoire :

  1. Revoir les labels dans les metriques (eviter les labels a haute cardinalite)
  2. Augmenter l'intervalle de scrape
  3. Utiliser des recording rules pour les metriques frequemment consultees

Documentation connexe