Monitoring
There are two production mechanisms for monitoring components of the correlator: katcp sensors and Prometheus metrics.
These have different strengths and weaknesses. Prometheus is highly scalable (so can handle a large number of metrics), supports metric labels, and has a rich ecosystem (such as Grafana for easily visualising metrics). On the other hand, katcp sensors can be more precisely timestamped and support arbitrary string values (often containing structured data) rather than just floating-point.
In general, we use katcp sensors for information that should be archived alongside the data to facilitate its interpretation, as well as sensors that are needed by other subsystems in the MeerKAT telescope. Prometheus metrics are used for detailed health and performance monitoring. However, this rule is not hard-and-fast, and some information is reported via sensors for historical reasons.
There is a third monitoring mechanism (event monitoring) intended for development purposes.
katcp sensors
The katcp sensors are self-documenting: issuing a ?sensor-list
request to
any of the servers will return a list of the sensors with descriptions. We
thus limit this section to sensors that need a more detailed explanation.
It should be noted that a large number of sensors describing static configuration (number of channels, accumulation length and so on) are provided by the product controller rather than this module.
steady-state-timestamp
This sensor is provided by both
dsim
andfgpu
. It can be used to synchronise katcp requests with the data. After issuing a katcp request that will alter the data stream (such as?signals
,?gain
or?delay
), query the sensor. It will contain an ADC timestamp. Any data received with that timestamp or greater will be up to date with the effects of all prior requests.It should be noted that a
?delay
request with a future load time is considered to have taken effect when the delay model has been updated, even if that load time has not yet been reached.It should also be noted that the sensor value from a
dsim
does not take into account any delays applied by F engines. One should add the delay of the corresponding F engine (or an upper bound on it) to obtain a safe timestamp for post-F engine data streams.signals
,period
anddither-seed
(dsim)To reproduce the output of the dsim exactly, it is necessary to save all three of these sensors, and pass the
dither-seed
back to the--dither-seed
command-line option and the other two to the?signals
request (as well as keeping other command-line arguments the same).
Prometheus metrics
To enable Prometheus metrics for any of the services, pass
--prometheus-port
to specify the port number. The metrics are then
made available at the /metrics
HTTP endpoint on that port. Pointing a web
browser at that endpoint will show the available metrics with their
documentation.
Event monitoring
It is also possible to perform very detailed monitoring of events occurring
within fgpu and xbgpu, particularly related to time spent waiting on queues.
This is intended for debugging performance issues rather than production use,
as it has much higher overhead than the other monitoring mechanisms. To
activate it, pass --monitor-log
with a filename to the process. It
will write a file with a JSON record per line. The helper script in
scratch/plot.py can be used to show a visualisation of the various
queues over time. It’s not recommended for more than a few seconds of data.