Operation ========= There are two main scenarios involved in starting up and interacting with katgpucbf and its constituent engines: #. the instantiation and running of a complete end-to-end correlator, and #. the invocation of individual engines (dsim, fgpu, xbgpu, vgpu) for more fine-grained testing and debugging. The first requires a mechanism to orchestrate the simultaneous spin-up of a correlator's required components — that is, some combination of engines. For this purpose, katgpucbf utilises the infrastructure provided by `katsdpcontroller`_ — discussed in the following section. Regarding the testing and debugging of individual engines, more detailed explanations of their inner workings are discussed in their respective, more dedicated discussion documents. The main thing to note is that, in both methods of invocation (via orchestration and individually), the engines support control via katcp requests issued to their ``:``. ``netcat`` (`nc`_) is likely the most readily-available tool for this job, but `ntsh`_ neatens up these exchanges and generally makes it easier to interact with. .. _katsdpcontroller: https://github.com/ska-sa/katsdpcontroller .. _nc: https://www.commandlinux.com/man-page/man1/nc.1.html .. _ntsh: https://pypi.org/project/ntsh/ .. _katsdpcontroller-discussion: katsdpcontroller ---------------- This package (katgpucbf) provides the components of a correlator (engines and simulators), but not the mechanisms to start up and orchestrate all the components as a cohesive unit. That is provided by `katsdpcontroller`_. For production use it is strongly recommended that katsdpcontroller is used to manage the correlator. Nevertheless, it is possible to run the individual pieces manually, or to implement an alternative controller. The remaining sections in this chapter describe the interfaces that are used by katsdpcontroller to communicate with the correlator components. There are two parts to katsdpcontroller: a :dfn:`master controller` and a :dfn:`product controller`. There is a single product controller per instantiated correlator. It is responsible for: - starting up the appropriate correlator components with suitable arguments, given a high-level description of the desired correlator configuration; - monitoring the health of those components; - registering them with `Consul`_, so that infrastructure such as `Prometheus`_ can discover them; - proxying their :ref:`monitoring-sensors`, so that clients need only subscribe to sensors from the product controller rather than individual components; - in some cases, aggregating or renaming those sensors, to present a correlator-wide suite of sensors, without clients needing to know about the individual engines; - providing additional correlator-wide katcp sensors; - providing correlator-wide katcp requests, which are implemented by issuing similar but finer-grained requests to the individual engines. .. _Consul: https://www.consul.io/ .. _Prometheus: https://prometheus.io/ The master controller manages product controllers (and hence correlators), starting them up and shutting them down on request from the user. In a system supporting subarrays, there will typically be a single master controller and zero or more product controllers at any one time. It is worth noting that katsdpcontroller was originally written to control the MeerKAT Science Data Processor and later extended to control correlators, so it has a number of features, requests and sensors that are not relevant to correlators. Starting the correlator ----------------------- The katgpucbf repository comes with a ``scratch/`` directory, under which you will find handy scripts for correlator and engine invocation. Granted, the layout and usage of these scripts is tailored to SARAO DSP's internal lab development environment (e.g. host and interface names) and don't necessarily go through the same reviewing rigour as the actual codebase. For these reasons, it is recommended that these scripts are used more as an example of how to run components of katgpucbf, rather than set-in-stone modi operandi. End-to-end correlator startup ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you intend on starting up a correlator with :program:`sim_correlator.py`, you will require a running master controller in accordance with :ref:`katsdpcontroller-discussion`. The script itself provides an array of options for you to start your correlator; running ``./sim_correlator.py --help`` gives a brief explanation of the arguments required. Below is an example of a full command to run a 4096-channel, 4-antenna, L-band correlator:: scratch/sim_correlator.py -a 4 -c 4096 -i 0.5 --band l \ --name my_test_correlator lab-mc.sdp.kat.ac.za The execution of this command contacts the master controller to request a new correlator product to be configured. The master controller figures out how many of each respective engine is required based on these input parameters, and launches them accordingly across the pool of processing nodes available. .. _indiv-engine-startup: Individual engine startup ^^^^^^^^^^^^^^^^^^^^^^^^^ The arguments required for individual engine invocation can be seen by running one of ``{dsim, fgpu, xbgpu, vgpu} --help`` in an appropriately-configured terminal environment. There are a few mandatory ones, and ultimately stitching the entire incantation together by hand can become tiresome. For this reason, the scripts under ``scratch/{fgpu, xbgpu}`` have been shipped with the module. The scripts for standalone engine usage are pre-populated with typical configuration values for your convenience, and are usually named :program:`run-{dsim, fsim, fgpu, xbgpu}.sh`. It is important to note that the F- and XB-Engines can run in a standalone manner, but will require some form of stimulus to truly exercise the engine. For example, ``fgpu`` requires a corresponding ``dsim`` to produce data for ingest. Similarly, ``xbgpu`` requires an appropriately-configured ``fsim``. Basically, the engines will do nothing until explicitly asked to. .. todo:: ``NGC-730`` Add comments on the scripts themselves to make them easier to follow. .. note:: Before considering which engine you intend on testing, note the number of GPUs available in the target processing node. The `CUDA`_ library acknowledges the presence of a ``CUDA_VISIBLE_DEVICES`` environment variable, similar to that discussed by :external+katsdpsigproc:std:ref:`katsdpsigproc `. You can simply ``export CUDA_VISIBLE_DEVICES=0`` in your terminal environment for the engine invocation to acknowledge your intention of using a particular GPU. .. _CUDA: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars To test a 4096-channel, 4-antenna XB-Engine processing L-band data, use the following commands in separate terminals on two separate servers. This will launch a single :ref:`feng-packet-sim` on ``host1`` and a single :program:`xbgpu` instance on ``host2``:: [Connect to host1 and activate the local virtual environment] (katgpucbf) user@host1:~/katgpucbf$ spead2_net_raw fsim --interface --ibv \ --array-size 4 --channels 4096 \ --channels-per-substream 1024 \ 239.10.10.10+1:7148 . . . [Connect to host2 and activate the local virtual environment] (katgpucbf) user@host2:~/katgpucbf$ spead2_net_raw numactl -C 1 xbgpu \ --recv-affinity 0 --recv-comp-vector 0 \ --send-affinity 1 --send-comp-vector 1 \ --recv-interface \ --send-interface \ --recv-ibv --send-ibv \ --adc-sample-rate 1712e6 --array-size 4 \ --channels 4096 \ --channels-per-substream 1024 \ --samples-between-spectra 8192 \ --katcp-port 7150 \ 239.10.10.10:7148 239.10.11.10:7148 Naturally, it is up to the user to ensure command-line parameters are consistent across the components under test, e.g. using the same :option:`!--array-size` is for the data generated (in the :program:`fsim`) and the :program:`xbgpu` instance. .. note:: ibverbs requires ``CAP_NET_RAW`` capability on Linux hosts. See :external+spead2:std:ref:`spead2's discussion ` on ensuring this is configured correctly for your usage. Pinning thread affinities """"""""""""""""""""""""" .. todo:: ``NGC-730`` Update ``run-{dsim, fsim, fpgu, xbgpu}.sh`` scripts to standardise over usage of either ``numactl`` or ``taskset``. :external+spead2:doc:`spead2's performance tuning discussion ` outlines the need to set the affinity of all threads that aren't specifically pinned by :option:`!--{src, dst}-affinity`. This is often the main Python thread, but libraries like CUDA tend to spin up helper threads. Testing without a high-speed data network """"""""""""""""""""""""""""""""""""""""" katgpucbf allows the user to develop, debug and test its engines without the use of a high-speed e.g. 100GbE data network. The omission of :option:`!--{src, dst}-ibv` command-line parameters avoids receiving data via the Infiniband Verbs API. This means that if you wish to e.g. capture engine data on a machine that doesn't support ibverbs, you could use :manpage:`tcpdump(8)`. .. note:: The data rates you intend to process are still limited by the NIC in your host machine. To truly take advantage of running engines without a high-speed data network, consider reducing the :option:`!--adc-sample-rate` by e.g. a factor of ten as this value greatly affects the engine's data transmission rate. Testing without astropy downloading the leap seconds table """""""""""""""""""""""""""""""""""""""""""""""""""""""""" If you need `ASTROPY`_ to run offline, set the IERS auto-download settings in the Astropy config file. .. code-block:: ini :caption: astropy/astropy.cfg :linenos: [utils.iers.iers] auto_download = False Set ``auto_download = False`` in the astropy.cfg resouce file as illustrated above. Use the environment variable `XDG_CONFIG_HOME` to point to the folder containing the astropy config directory. If the environment variable :envvar:`XDG_CONFIG_HOME` is not set, Astropy will use the configuration found at `~/.astropy/config/` in the home directory of the user running the process. .. _ASTROPY: https://docs.astropy.org/en/stable/api/astropy.config.get_config_dir.html Controlling the correlator -------------------------- The correlator components are controlled using `katcp`_. A user can connect to the ``:`` and issue a ``?help`` to see the full range of requests available. The ```` and ```` values for individual engines are configurable at runtime, whereas the ```` and ```` values for the correlator's *product controller* are yielded by the master controller after startup. Standard katcp requests (such as querying and subscribing to sensors) are not covered here; only application-specific requests are listed. Sensors are described in :ref:`monitoring-sensors`. .. _katcp: https://katcp-python.readthedocs.io/en/latest/_downloads/361189acb383a294be20d6c10c257cb4/NRF-KAT7-6.0-IFCE-002-Rev5-1.pdf dsim ^^^^ This describes version ``katgpucbf-dsim-1.0`` of the katcp-device API. :samp:`?signals {spec} [{period}]` Change the signals that are generated. The signal specification is described in :ref:`dsim-dsl`. The resulting signal will be periodic with a period of :samp:`{period}` samples. The given period must divide into the :option:`!--max-period` command-line argument, which is also the default period if none is specified. The dither that is applied is cached on startup, but is independent for the different streams. Repeating the same request thus gives the same results, provided any randomised terms (such as ``wgn``) use fixed seeds. It returns an ADC timestamp, which indicates the next sample which is generated with the new signals. This is kept for backwards compatibility, but the same information can be found in the ``steady-state-timestamp`` sensor. ``?time`` Return the current UNIX timestamp on the server running the dsim. This can be used to get an approximate idea of which data is in flight, without depending on the dsim host and the client having synchronised clocks. fgpu ^^^^ This describes version ``katgpucbf-fgpu-1.0`` of the katcp-device API. :samp:`?gain {stream} {input} [{values}...]` Set the complex gains. This has the same semantics as the equivalent katsdpcontroller request, but :samp:`{input}` must be 0 or 1 to select the input polarisation. :samp:`?gain-all {stream} {values}...` Set the complex gains for both inputs. This has the same semantics as the equivalent katsdpcontroller request. :samp:`?delays {stream} {start-time} {values}...` Set the delay polynomials. This has the same semantics as the equivalent katsdpcontroller request, but takes exactly two delay model specifications (for the two polarisations). xbgpu ^^^^^ This describes version ``katgpucbf-xbgpu-1.0`` of the katcp-device API. :samp:`?capture-start {stream} [{timestamp}]`, :samp:`?capture-stop {stream}` Enable or disable transmission of output data. This does not affect transmission of descriptors, which cannot be disabled. In the initial state transmission is disabled, unless the :option:`!--send-enabled` command-line option has been passed. If :samp:`{timestamp}` is specified, heaps with timestamps less than this ADC timestamp will not be transmitted. :samp:`?beam-weights {stream} {weights}...`, :samp:`?beam-delays {stream} {delays}...`, :samp:`?beam-quant-gains {stream} {gain}` These have the same semantics as the equivalent katsdpcontroller requests. vgpu ^^^^ This describes version ``katgpucbf-vgpu-1.0`` of the katcp-device API. :samp:`?capture-start [{timestamp}]`, :samp:`?capture-stop` Enable or disable transmission of output data. In the initial state transmission is disabled. If :samp:`{timestamp}` is specified, input heaps with timestamps less than this ADC timestamp will be discarded. :samp:`?vlbi-delay {delay}` This has the same semantics as the equivalent katsdpcontroller request. Shutting down the correlator ---------------------------- End-to-end correlator shutdown ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ A user can issue a ``?product-deconfigure`` request to the correlator's product controller by connecting to its ``:``. This request triggers the stop procedure of all engines running in the target correlator. More specifically: * the product controller instructs the orchestration software to stop the containers running the engines, * which is received by the engines as a ``SIGTERM``, * finally triggering a ``halt`` in the engines for a graceful shutdown. The shutdown procedures are broadly similar between the engines. Ultimately they all: * finish calculations on data currently in their pipelines, * stop the transmission of their SPEAD descriptors, and * in the case of engines that receive data, stop their ``spead2`` receivers, which allows for a more natural ending of internal processing operations. Individual engine shutdown ^^^^^^^^^^^^^^^^^^^^^^^^^^ Once you've sufficiently tested, debugged and/or reached the desired level of confusion, there are two options for engine shutdown: #. simply issue a ``Ctrl + C`` in the terminal window where the engine was invoked, or #. connect to the engine's ``:`` and issue a ``?halt``. After either of these approaches are executed, the engine will shutdown cleanly and quietly according to their common :ref:`engines-shutdown-procedure`.