This document provides an overview of topics related to RabbitMQ monitoring.

Monitoring RabbitMQ and applications that use it is critically important. Monitoring helps detect issues before they affect the rest of the environment and, eventually, the end users. Many aspects of the system can be monitored. This guide will group them into a handful of categories. Log aggregation across all nodes and applications is closely related to monitoring and also mentioned in this guide.

A number of popular tools, both open source and commercial, can be used to monitor RabbitMQ. Prometheus and Grafana are one highly recommended option. In this guide we define monitoring as a process of capturing the behaviour of a system via health checks and metrics over time.

This helps detect anomalies: when the system is unavailable, experiences an unusual load, exhausted of certain resources or otherwise does not behave within its normal (expected) parameters. Monitoring involves collecting and storing metrics for the long term, which is important for more than anomaly detection but also root cause analysis, trend detection and capacity planning. Monitoring systems typically integrate with alerting systems. When an anomaly is detected by a monitoring system an alarm of some sort is typically passed to an alerting system, which notifies interested parties such as the technical operations team.

Having monitoring in place means that important deviations in system behavior, from degraded service in some areas to complete unavailability, is easier to detect and the root cause takes much less time to find. Operating a distributed system without monitoring is a bit like trying to get out of a forest without a GPS navigator device or compass. A Health check is the most basic aspect of monitoring.

It involves a command or set of commands that collect a few essential metrics of the monitored system over time and test them. The metric in this case is "is an OS process running". The normal operating parameters are "the process must be running". Finally, there is an alerting step.

Of course, there are more varieties of health checks. Which ones are most appropriate depends on the definition of a "healthy node" used. So, it is a system- and team-specific decision. RabbitMQ CLI tools provide commands that can serve as useful health checks. They will be covered later in this guide.

While health checks are a useful tool, they only provide so much insight into the state of the system because they are by design focused on one or a handful of metrics, usually on a single node and can only reason about the state of that node at a particular moment in time. For a more comprehensive assessment, collect more metrics over time.



