Outils & Frameworks SRE

Les outils essentiels pour garantir la fiabilité des systèmes en production.

📊 Monitoring & Observabilité

Prometheus

Système de monitoring et alerting open source, standard Kubernetes.

Time series DB PromQL Pull model

Open Source CNCF

Grafana

Dashboards et visualisation pour toutes vos sources de données.

Dashboards Alerting Multi-source

Open Source Standard

Datadog

Plateforme SaaS d'observabilité complète : métriques, logs, APM.

SaaS Enterprise

New Relic

Observabilité full-stack avec APM, logs et infrastructure.

SaaS Free tier

📝 Logging

Elastic Stack (ELK)

Elasticsearch, Logstash, Kibana - la stack de référence pour les logs.

Open Source Self-hosted

Loki

Système de logging de Grafana, conçu pour être efficace et simple.

Label-based Grafana native

Open Source

Splunk

Plateforme enterprise pour l'analyse des données machine.

Enterprise On-prem/Cloud

🔍 Distributed Tracing

Jaeger

Distributed tracing open source, natif OpenTelemetry.

Open Source CNCF

Tempo

Backend de tracing de Grafana, facile à opérer.

Open Source Grafana

OpenTelemetry

Standard de collecte de télémétrie (traces, metrics, logs).

Vendor neutral Standard

CNCF

🚨 Incident Management

PagerDuty

Leader de l'incident management et on-call scheduling.

On-call Escalation Integrations

SaaS Leader

OpsGenie

Solution Atlassian pour l'alerting et incident management.

SaaS Atlassian

Incident.io

Incident management moderne avec intégration Slack native.

SaaS Slack-first

Rootly

Plateforme de gestion d'incidents avec automatisation.

SaaS

🎯 SLO Management

Sloth

Générateur de SLO pour Prometheus, définit les SLO en YAML.

Open Source Prometheus

Nobl9

Plateforme SaaS dédiée au SLO management.

SaaS Multi-source

Pyrra

SLO avec Prometheus, dashboards et alertes automatiques.

Open Source

💥 Chaos Engineering

Chaos Monkey

L'outil original de Netflix pour tuer des instances aléatoirement.

Netflix Classic

LitmusChaos

Plateforme de chaos engineering cloud-native pour Kubernetes.

Open Source CNCF

Gremlin

Plateforme enterprise de chaos engineering.

SaaS Enterprise

Chaos Mesh

Plateforme de chaos engineering pour Kubernetes.

Open Source CNCF

⚡ Load Testing

k6

Outil moderne de load testing avec scripts JavaScript.

Open Source Grafana

Locust

Load testing avec scripts Python, facile à utiliser.

Open Source Python

Gatling

Load testing haute performance en Scala.

Open Source

📢 Status Pages

Statuspage

Status pages hébergées par Atlassian.

SaaS Atlassian

Cachet

Status page open source auto-hébergée.

Open Source Self-hosted

Upptime

Status page hébergée sur GitHub Pages, gratuit.

Open Source Free