Monitoring

There are two production mechanisms for monitoring components of the correlator: katcp sensors and Prometheus metrics.

These have different strengths and weaknesses. Prometheus is highly scalable (so can handle a large number of metrics), supports metric labels, and has a rich ecosystem (such as Grafana for easily visualising metrics). On the other hand, katcp sensors can be more precisely timestamped and support arbitrary string values (often containing structured data) rather than just floating-point.

In general, we use katcp sensors for information that should be archived alongside the data to facilitate its interpretation, as well as sensors that are needed by other subsystems in the MeerKAT telescope. Prometheus metrics are used for detailed health and performance monitoring. However, this rule is not hard-and-fast, and some information is reported via sensors for historical reasons.

There is a third monitoring mechanism (event monitoring) intended for development purposes.

katcp sensors

The katcp sensors are self-documenting: issuing a ?sensor-list request to any of the servers will return a list of the sensors with descriptions. We thus limit this section to sensors that need a more detailed explanation.

It should be noted that a large number of sensors describing static configuration (number of channels, accumulation length and so on) are provided by the product controller rather than this module.

steady-state-timestamp

This sensor is provided by both dsim and fgpu. It can be used to synchronise katcp requests with the data. After issuing a katcp request that will alter the data stream (such as ?signals, ?gain or ?delay), query the sensor. It will contain an ADC timestamp. Any data received with that timestamp or greater will be up to date with the effects of all prior requests.

It should be noted that a ?delay request with a future load time is considered to have taken effect when the delay model has been updated, even if that load time has not yet been reached.

It should also be noted that the sensor value from a dsim does not take into account any delays applied by F engines. One should add the delay of the corresponding F engine (or an upper bound on it) to obtain a safe timestamp for post-F engine data streams.

signals, period and dither-seed (dsim)

To reproduce the output of the dsim exactly, it is necessary to save all three of these sensors, and pass the dither-seed back to the --dither-seed command-line option and the other two to the ?signals request (as well as keeping other command-line arguments the same).

Prometheus metrics

To enable Prometheus metrics for any of the services, pass --prometheus-port to specify the port number. The metrics are then made available at the /metrics HTTP endpoint on that port. Pointing a web browser at that endpoint will show the available metrics with their documentation.

Event monitoring

It is also possible to perform very detailed monitoring of events occurring within fgpu and xbgpu, particularly related to time spent waiting on queues. This is intended for debugging performance issues rather than production use, as it has much higher overhead than the other monitoring mechanisms. To activate it, pass --monitor-log with a filename to the process. It will write a file with a JSON record per line. The helper script in scratch/plot.py can be used to show a visualisation of the various queues over time. It’s not recommended for more than a few seconds of data.