Monitoring ========== There are two production mechanisms for monitoring components of the correlator: `katcp`_ sensors and `Prometheus`_ metrics. These have different strengths and weaknesses. Prometheus is highly scalable (so can handle a large number of metrics), supports metric labels, and has a rich ecosystem (such as `Grafana`_ for easily visualising metrics). On the other hand, katcp sensors can be more precisely timestamped and support arbitrary string values (often containing structured data) rather than just floating-point. In general, we use katcp sensors for information that should be archived alongside the data to facilitate its interpretation, as well as sensors that are needed by other subsystems in the MeerKAT telescope. Prometheus metrics are used for detailed health and performance monitoring. However, this rule is not hard-and-fast, and some information is reported via sensors for historical reasons. .. _katcp: https://katcp-python.readthedocs.io/en/latest/_downloads/361189acb383a294be20d6c10c257cb4/NRF-KAT7-6.0-IFCE-002-Rev5-1.pdf .. _Prometheus: https://prometheus.io/ .. _Grafana: https://grafana.com There is a third monitoring mechanism (event monitoring) intended for development purposes. .. _monitoring-sensors: katcp sensors ------------- The katcp sensors are self-documenting: issuing a ``?sensor-list`` request to any of the servers will return a list of the sensors with descriptions. We thus limit this section to sensors that need a more detailed explanation. It should be noted that a large number of sensors describing static configuration (number of channels, accumulation length and so on) are provided by the product controller rather than this module. ``steady-state-timestamp`` This sensor is provided by ``dsim``, ``fgpu`` and ``xbgpu``. It can be used to synchronise katcp requests with the data. After issuing a katcp request that will alter the data stream (such as ``?signals``, ``?gain`` or ``?delay``), query the sensor. It will contain an ADC timestamp. Any data received with that timestamp or greater will be up to date with the effects of all prior requests. It should be noted that a ``?delay`` request with a future load time is considered to have taken effect when the delay model has been updated, even if that load time has not yet been reached. It should also be noted that the sensor value from a ``dsim`` does not take into account any delays applied by F-engines. One should add the delay of the corresponding F engine (or an upper bound on it) to obtain a safe timestamp for post–F-engine data streams. ``signals``, ``period`` and ``dither-seed`` (dsim) To reproduce the output of the dsim exactly, it is necessary to save all three of these sensors, and pass the ``dither-seed`` back to the :option:`!--dither-seed` command-line option and the other two to the ``?signals`` request (as well as keeping other command-line arguments the same). Prometheus metrics ------------------ To enable Prometheus metrics for any of the services, pass :option:`!--prometheus-port` to specify the port number. The metrics are then made available at the ``/metrics`` HTTP endpoint on that port. Pointing a web browser at that endpoint will show the available metrics with their documentation. Event monitoring ---------------- It is also possible to perform very detailed monitoring of events occurring within fgpu and xbgpu, particularly related to time spent waiting on queues. This is intended for debugging performance issues rather than production use, as it has much higher overhead than the other monitoring mechanisms. To activate it, pass :option:`!--monitor-log` with a filename to the process. It will write a file with a JSON record per line. The helper script in :program:`scratch/plot.py` can be used to show a visualisation of the various queues over time. It's not recommended for more than a few seconds of data.