Monitoring Stack

Stack de monitoring complet avec Prometheus, Grafana et Alertmanager pour MyTelevision API.

Architecture

                                    ┌─────────────────┐
                                    │    Grafana      │
                                    │   (Port 3001)   │
                                    └────────┬────────┘
                                             │
                                    ┌────────▼────────┐
                                    │   Prometheus    │
                                    │   (Port 9090)   │
                                    └────────┬────────┘
                                             │
           ┌─────────────────────────────────┼─────────────────────────────────┐
           │                                 │                                 │
    ┌──────▼──────┐   ┌──────────▼──────────┐   ┌──────▼──────┐   ┌──────▼──────┐
    │   API App   │   │  Node Exporter      │   │  PG Exporter│   │Redis Exporter│
    │  (Port 3000)│   │   (Port 9100)       │   │ (Port 9187) │   │ (Port 9121) │
    └─────────────┘   └─────────────────────┘   └─────────────┘   └─────────────┘
                                                       │                │
                                                ┌──────▼──────┐   ┌─────▼─────┐
                                                │  PostgreSQL │   │   Redis   │
                                                │  (Port 5432)│   │(Port 6379)│
                                                └─────────────┘   └───────────┘

Quick Start

Demarrer le stack monitoring (Development)

# Demarrer les services de base (PostgreSQL, Redis, API)
docker compose up -d

# Demarrer le stack monitoring
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d

Acces aux dashboards

Service	URL	Credentials
Grafana	http://localhost:3001	admin / admin
Prometheus	http://localhost:9090	-
Alertmanager	http://localhost:9093	-
API Metrics	http://localhost:3000/metrics	-

Composants

1. Prometheus (Collecte de metriques)

Location : monitoring/prometheus/

Fichiers de configuration :

prometheus.yml - Configuration des scrape targets
rules/alerts.yml - Regles d'alertes

Scrape Targets :

Target	Port	Metriques
mytelevision-api	3000	HTTP requests, business metrics
node-exporter	9100	Host CPU, memory, disk, network
postgres-exporter	9187	PostgreSQL connections, queries, cache
redis-exporter	9121	Redis memory, commands, connections
cadvisor	8080	Container CPU, memory, network

Configuration :

# monitoring/prometheus/prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'mytelevision-api'
    static_configs:
      - targets: ['api:3000']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']

  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']

2. Grafana (Visualisation)

Location : monitoring/grafana/

Dashboards pre-configures :

Dashboard	UID	Description
API Overview	`api-overview`	Request rate, latency, errors, memory
Auth & Sessions	`auth-sessions`	Logins, sessions, security events
Database	`database-perf`	PostgreSQL performance, queries
Business Metrics	`business-metrics`	Users, payments, engagement
Infrastructure	`infrastructure`	System resources, containers

Datasources :

Prometheus (default)
PostgreSQL (direct queries)
Redis (cache metrics)
Alertmanager (alerts)

3. Alertmanager (Routage des alertes)

Location : monitoring/alertmanager/

Alert Receivers :

Receiver	Channel	Alertes
critical-alerts	Email + Slack	APIDown, PostgreSQLDown, RedisDown
warning-alerts	Email + Slack	High latency, high error rate
database-team	Email + Slack	Database-specific alerts
security-team	Email + Slack	Auth/security alerts
payments-team	Email + Slack	Payment failures

Configuration :

# monitoring/alertmanager/alertmanager.yml
route:
  group_by: ['alertname']
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
    - match:
        severity: warning
      receiver: 'warning-alerts'

receivers:
  - name: 'critical-alerts'
    email_configs:
      - to: '[email protected]'

  - name: 'warning-alerts'
    email_configs:
      - to: '[email protected]'

4. Exporters

Node Exporter (Host Metrics)

CPU usage by mode
Memory usage and availability
Disk I/O and space
Network traffic

PostgreSQL Exporter

Active connections
Transaction rates
Cache hit ratio
Table statistics

Redis Exporter

Memory usage
Connected clients
Commands per second
Key statistics

cAdvisor (Container Metrics)

Container CPU usage
Container memory usage
Container network I/O

Metriques applicatives

L'application NestJS expose des metriques Prometheus personnalisees sur /metrics :

HTTP Metrics

mytelevision_http_requests_total{method, path, status_code}
mytelevision_http_request_duration_seconds{method, path, status_code}
mytelevision_http_requests_in_flight

Authentication Metrics

mytelevision_auth_login_total{provider, status}
mytelevision_auth_failed_login_total{reason}
mytelevision_auth_active_sessions_total
mytelevision_auth_token_refresh_total
mytelevision_auth_token_expired_total
mytelevision_rate_limit_hits_total{endpoint}

Business Metrics

mytelevision_content_views_total{content_type, access_type}
mytelevision_active_streams_total{content_type}
mytelevision_reactions_total{reaction_type, content_type}
mytelevision_favorites_added_total{content_type}
mytelevision_favorites_removed_total{content_type}
mytelevision_registrations_total{source}
mytelevision_active_users_total{access_type}
mytelevision_premium_subscribers_total
mytelevision_payment_transactions_total{status, payment_method}
mytelevision_payment_amount_total{status}
mytelevision_subscription_created_total{plan_type}

Profile & Account Metrics

mytelevision_profiles_created_total{profile_type}
mytelevision_profile_switches_total
mytelevision_device_registrations_total{device_type}
mytelevision_risk_events_total{risk_type, action}

Regles d'alertes

Alertes critiques (Immediate)

Alerte	Condition	Action
APIDown	API not responding for 1m	Page on-call
PostgreSQLDown	Database not responding for 1m	Page DBA + on-call
RedisDown	Cache not responding for 2m	Page on-call
HighCriticalErrorRate	Error rate > 5% for 2m	Page on-call
DiskSpaceCritical	Disk usage > 95%	Page on-call

# Exemples de regles critiques
- alert: APIDown
  expr: up{job="mytelevision-api"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: 'API is down'

- alert: PostgreSQLDown
  expr: pg_up == 0
  for: 1m
  labels:
    severity: critical

Alertes warning (repetition 4h)

Alerte	Condition	Action
HighErrorRate	Error rate > 1% for 5m	Notify team
HighResponseTime	P95 latency > 500ms for 5m	Notify team
HighMemoryUsage	Memory > 85% for 5m	Notify team
HighCPUUsage	CPU > 80% for 5m	Notify team
DatabaseConnectionsHigh	DB connections > 80%	Notify DBA

# Exemples de regles warning
- alert: HighLatency
  expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1
  for: 10m
  labels:
    severity: warning

- alert: HighMemoryUsage
  expr: process_resident_memory_bytes / 1024 / 1024 > 1024
  for: 15m
  labels:
    severity: warning

Alertes securite

Alerte	Condition	Action
PossibleBruteForce	> 10 failed logins/min	Notify security
HighTokenExpiration	> 50 expired tokens in 5m	Investigate
PaymentFailuresHigh	Payment failure rate > 5%	Notify payments

Utilisation du MetricsService dans le code

import { Injectable } from '@nestjs/common';
import { MetricsService } from '@infrastructure/metrics';

@Injectable()
export class MyService {
  constructor(private readonly metricsService: MetricsService) {}

  async processPayment(amount: number) {
    try {
      // ... payment logic
      this.metricsService.recordPayment('success', 'card', amount);
    } catch (error) {
      this.metricsService.recordPayment('failed', 'card', amount);
      throw error;
    }
  }

  async trackContentView(contentType: string, isPremium: boolean) {
    this.metricsService.recordContentView(
      contentType,
      isPremium ? 'premium' : 'free',
    );
  }

  async trackLogin(provider: string, success: boolean) {
    this.metricsService.recordLogin(provider, success);
  }
}

Variables d'environnement

Grafana

GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=your_secure_password
GRAFANA_ROOT_URL=http://localhost:3001

Alertmanager (Email)

SMTP_HOST=smtp.example.com:587
[email protected]
SMTP_USER=your_user
SMTP_PASSWORD=your_password

Alertmanager (Slack)

SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxx/yyy/zzz

Alert Recipients

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

Considerations production

Retention & Stockage

Retention des donnees Prometheus : 30 jours (configurable dans docker-compose.monitoring.yml)
Considerer un stockage distant (Thanos, Cortex) pour la retention long terme

prometheus:
  deploy:
    resources:
      limits:
        cpus: '2'
        memory: 4G

Securite

Changer les credentials par defaut de Grafana
Activer HTTPS pour tous les services
Restreindre l'acces reseau aux ports de monitoring
Utiliser des secrets pour la configuration sensible

Haute disponibilite

Pour la production, considerer :

Prometheus avec remote write vers un stockage long terme
Grafana avec base de donnees PostgreSQL externe
Cluster Alertmanager pour la HA

Troubleshooting

Prometheus ne scrape pas

Verifier la sante des targets : http://localhost:9090/targets
Verifier la connectivite reseau entre containers
Verifier les regles de firewall

Dashboard Grafana ne charge pas

Verifier la datasource Prometheus : Settings > Data Sources
Verifier que Prometheus est en cours d'execution
Verifier que le JSON du dashboard est valide

Les alertes ne se declenchent pas

Verifier les regles d'alertes : http://localhost:9090/alerts
Verifier la config Alertmanager : http://localhost:9093/#/status
Verifier la configuration SMTP/Slack webhook

Problemes de haute cardinalite

Si Prometheus utilise trop de memoire :

Revoir les labels dans les metriques (eviter les labels a haute cardinalite)
Augmenter l'intervalle de scrape
Utiliser des recording rules pour les metriques frequemment consultees

Architecture​

Quick Start​

Demarrer le stack monitoring (Development)​

Acces aux dashboards​

Composants​

1. Prometheus (Collecte de metriques)​

2. Grafana (Visualisation)​

3. Alertmanager (Routage des alertes)​

4. Exporters​

Node Exporter (Host Metrics)​

PostgreSQL Exporter​

Redis Exporter​

cAdvisor (Container Metrics)​

Metriques applicatives​

HTTP Metrics​

Authentication Metrics​

Business Metrics​

Profile & Account Metrics​

Regles d'alertes​

Alertes critiques (Immediate)​

Alertes warning (repetition 4h)​

Alertes securite​

Utilisation du MetricsService dans le code​

Variables d'environnement​

Grafana​

Alertmanager (Email)​

Alertmanager (Slack)​

Alert Recipients​

Considerations production​

Retention & Stockage​

Securite​

Haute disponibilite​

Troubleshooting​

Prometheus ne scrape pas​

Dashboard Grafana ne charge pas​

Les alertes ne se declenchent pas​

Problemes de haute cardinalite​

Documentation connexe​