Operation
There are two main scenarios involved in starting up and interacting with katgpucbf and its constituent engines:
the instantiation and running of a complete end-to-end correlator, and
the invocation of individual engines (dsim, fgpu, xbgpu) for more fine-grained testing and debugging.
The first requires a mechanism to orchestrate the simultaneous spin-up of a correlator’s required components - that is, some combination of dsim(s), F-Engine(s) and XB-Engine(s). For this purpose, katgpucbf utilises the infrastructure provided by katsdpcontroller - discussed in the following section.
Regarding the testing and debugging of individual engines, more detailed explanations of their inner-workings are discussed in their respective, more dedicated-discussion documents.
The main thing to note is that, in both methods of invocation (via
orchestration and individually), the engines support control via katcp commands
issued to their <host>:<port>
. netcat
(nc) is likely the most
readily-available tool for this job, but ntsh neatens up these exchanges
and generally makes it easier to interact with.
katsdpcontroller
This package (katgpucbf) provides the components of a correlator (engines and simulators), but not the mechanisms to start up and orchestrate all the components as a cohesive unit. That is provided by katsdpcontroller.
For production use it is strongly recommended that katsdpcontroller is used to manage the correlator. Nevertheless, it is possible to run the individual pieces manually, or to implement an alternative controller. The remaining sections in this chapter describe the interfaces that are used by katsdpcontroller to communicate with the correlator components.
There are two parts to katsdpcontroller: a master controller and a product controller. There is a single product controller per instantiated correlator. It is responsible for:
starting up the appropriate correlator components with suitable arguments, given a high-level description of the desired correlator configuration;
monitoring the health of those components;
registering them with Consul, so that infrastructure such as Prometheus can discover them;
proxying their katcp sensors, so that clients need only subscribe to sensors from the product controller rather than individual components;
in some cases, aggregating or renaming those sensors, to present a correlator-wide suite of sensors, without clients needing to know about the individual engines;
providing additional correlator-wide katcp sensors;
providing correlator-wide katcp requests, which are implemented by issuing similar but finer-grained requests to the individual engines.
The master controller manages product controllers (and hence correlators), starting them up and shutting them down on request from the user. In a system supporting subarrays, there will typically be a single master controller and zero or more product controllers at any one time.
It is worth noting that katsdpcontroller was originally written to control the MeerKAT Science Data Processor and later extended to control correlators, so it has a number of features, requests and sensors that are not relevant to correlators.
Starting the correlator
The katgpucbf repository comes with a scratch/
directory, under which you
will find handy scripts for correlator and engine invocation. Granted, the
layout and usage of these scripts is tailored to SARAO DSP’s internal lab
development environment (e.g. host and interface names) and don’t necessarily
go through the same reviewing rigour as the actual codebase. For these reasons,
it is recommended that these scripts are used more as an example of how to run
components of katgpucbf, rather than set-in-stone modi operandi.
End-to-end correlator startup
If you intend on starting up a correlator with sim_correlator.py,
you will require a running master controller in accordance with
katsdpcontroller. The script itself provides an array of
options for you to start your correlator; running ./sim_correlator.py --help
gives a brief explanation of the arguments required. Below is an example of a
full command to run a 4k, 4-antenna, L-band correlator:
./sim_correlator -a 4 -c 4096 -i 0.5
--adc-sample-rate 1712e6
--name my_test_correlator
--image-override katgpucbf:harbor.sdp.kat.ac.za/dpp/katgpucbf:latest
lab5.sdp.kat.ac.za
The execution of this command contacts the master controller to request a new correlator product to be configured. The master controller figures out how many of each respective engine is required based on these input parameters, and launches them accordingly across the pool of processing nodes available.
Individual engine startup
The arguments required for individual engine invocation can be seen by
running one of {dsim, fgpu, xbgpu} --help
in an appropriately-configured
terminal environment. There are a few mandatory ones, and ultimately stitching
the entire incantation together by hand can become tiresome. For this reason,
the scripts under scratch/{fgpu, xbgpu}
have been shipped with the module.
The scripts for standalone engine usage are prepopulated with typical
configuration values for your convenience, and are usually named
run-{dsim, fgpu, xbgpu}.sh. It is important to note that the F- and
XB-Engines can run in a standalone manner, but will require some form of
stimulus to truly exercise the engine. For example, fgpu
requires a
corresponding dsim
to produce data for ingest. Similarly, xbgpu
requires an appropriately-configured fsim
. Basically, the engines will do
nothing until explicitly asked to.
Todo
NGC-730
Update scratch directory to have a single config sub-directory. Also add
comments on the scripts themselves to make it easier to follow.
Note
Before considering which engine you intend on testing, note the number of GPUs
available in the target processing node. The CUDA library acknowledges the
presence of a CUDA_VISIBLE_DEVICES
environment variable, similar to that
discussed by katsdpsigproc.
You can simply export CUDA_VISIBLE_DEVICES=0
in your terminal environment
for the engine invocation to acknowledge your intention of using a particular
GPU.
To test a 4k, 4-antenna XB-Engine processing L-band data, use the following
commands in separate terminals on two separate servers. This will launch a
single F-Engine Packet Simulator on host1
and a single xbgpu
instance on host2
:
[Connect to host1 and activate the local virtual environment]
(katgpucbf) user@host1:~/katgpucbf$ spead2_net_raw fsim --interface <interface name> --ibv \
--array-size 4 --channels 4096 \
--channels-per-substream 1024 \
239.10.10.10+1:7148
.
.
.
[Connect to host2 and activate the local virtual environment]
(katgpucbf) user@host2:~/katgpucbf$ spead2_net_raw numactl -C 1 xbgpu \
--src-affinity 0 --src-comp-vector 0 \
--dst-affinity 1 --dst-comp-vector 1 \
--src-interface <interface name> \
--dst-interface <interface name> \
--src-ibv --dst-ibv \
--adc-sample-rate 1712e6 --array-size 4 \
--channels 4096 \
--channels-per-substream 1024 \
--samples-between-spectra 8192 \
--katcp-port 7150 \
239.10.10.10:7148 239.10.11.10:7148
Naturally, it is up to the user to ensure command-line parameters are
consistent across the components under test, e.g. using the same
--array-size
is for the data generated (in the fsim) and
the xbgpu instance.
Note
ibverbs requires CAP_NET_RAW
capability on Linux hosts. See
spead2’s discussion on
ensuring this is configured correctly for your usage.
Pinning thread affinities
Todo
NGC-730
Update run-{dsim, fpgu, xbgpu}.sh
scripts to standardise over usage
of either numactl
or taskset
.
spead2’s performance tuning discussion outlines
the need to set the affinity of all threads that aren’t specifically pinned by
--{src, dst}-affinity
. This is often the main Python thread, but
libraries like CUDA tend to spin up helper threads.
Testing without a high-speed data network
katgpucbf allows the user to develop, debug and test its engines without the
use of a high-speed e.g. 100GbE data network. The omission of
--{src, dst}-ibv
command-line parameters avoids receiving data via
the Infiniband Verbs API. This means that if you wish to e.g. capture engine
data on a machine that doesn’t support ibverbs, you could use
tcpdump(8).
Note
The data rates you intend to process are still limited by the NIC in your
host machine. To truly take advantage of running engines without a
high-speed data network, consider reducing the --adc-sample-rate
by e.g. a factor of ten as this value greatly affects the engine’s data
transmission rate.
Controlling the correlator
The correlator components are controlled using katcp. A user can connect to
the <host>:<port>
and issue a ?help
to see the full range of commands
available. The <host>
and <port>
values for individual engines are
configurable at runtime, whereas the <host>
and <port>
values for the
correlator’s product controller are yielded by the master controller after
startup. Standard katcp requests (such as querying and subscribing to sensors)
are not covered here; only application-specific requests are listed. Sensors
are described in katcp sensors.
dsim
?signals spec [period]
Change the signals that are generated. The signal specification is described in Signal specification. The resulting signal will be periodic with a period of
period
samples. The given period must divide into the--max-period
command-line argument, which is also the default period if none is specified.The dither that is applied is cached on startup, but is independent for the different streams. Repeating the same command thus gives the same results, provided any randomised terms (such as
wgn
) use fixed seeds.It returns an ADC timestamp, which indicates the next sample which is generated with the new signals. This is kept for backwards compatibility, but the same information can be found in the
steady-state-timestamp
sensor.?time
Return the current UNIX timestamp on the server running the dsim. This can be used to get an approximate idea of which data is in flight, without depending on the dsim host and the client having synchronised clocks.
fgpu
?gain stream input [values...]
Set the complex gains. This has the same semantics as the equivalent katsdpcontroller command, but
input
must be 0 or 1 to select the input polarisation.?gain-all stream values...
Set the complex gains for both inputs. This has the same semantics as the equivalent katsdpcontroller command.
?delays stream start-time values...
Set the delay polynomials. This has the same semantics as the equivalent katsdpcontroller command, but takes exactly two delay model specifications (for the two polarisations).
xbgpu
?capture-start
,?capture-stop
Enable or disable transmission of output data. This does not affect transmission of descriptors, which cannot be disabled. In the initial state transmission is disabled, unless the
--tx-enabled
command-line option has been passed.
Shutting down the correlator
End-to-end correlator shutdown
A user can issue a ?product-deconfigure
command to the correlator’s
product controller by connecting to its <host>:<port>
. This command
triggers the stop procedure of all engines and dsims running in the target
correlator. More specifically:
the product controller instructs the orchestration software to stop the containers running the engines,
which is received by the engines as a
SIGTERM
,finally triggering a
halt
in the engines for a graceful shutdown.
The shutdown procedures are broadly similar between the dsim, fgpu and xbgpu. Ultimately they all:
finish calculations on data currently in their pipelines,
stop the transmission of their SPEAD descriptors, and
in the case of
fgpu
andxbgpu
, stop theirspead2
receivers, which allows for a more natural ending of internal processing operations.
Individual engine shutdown
Once you’ve sufficiently tested, debugged and/or reached the desired level of confusion, there are two options for engine shutdown:
simply issue a
Ctrl + C
in the terminal window where the engine was invoked, orconnect to the engine’s
<host>:<port>
and issue a?halt
.
After either of these approaches are executed, the engine will shutdown cleanly
and quietly according to their common Shutdown procedures. As the
F-Engine Packet Simulator is a simple CLI utility, the fsim just
requires a Ctrl + C
to end operations - no katcp
commands supported
here.