Beamforming
GPU kernel
For the low number of beams required for MeerKAT (8 single-pol beams), the beamforming operation is entirely limited by the GPU DRAM bandwidth. There is thus no benefit to using tensor cores, and potentially significant downsides (implementation complexity and reduced accuracy for steering coefficients). The kernel thus uses standard single-precision floating-point operations, and is written to maximise bandwidth usage.
The beamforming operation consists of two phases: coefficient generation and coefficient application. To minimise use of system memory, the coefficients are generated on the fly in the same kernel, rather than computed in a separate kernel and communicated through global memory. This design allows for more dynamic coefficients in the future, such as delays that are updated much more frequently according to a formula.
The external interface to the beamformer has four parameters per beam-antenna pair: a weight, a quantisation gain (common to all antennas), a delay and a fringe-rate. All except the delay are combined by the CPU into a single (complex) weight per beam-antenna pair, and the delay is scaled into a convenient unit for computing the phase slope. The final coefficient applied to channel \(c\), beam \(b\), antenna \(a\) is
where \(w_{ab}\) and \(d_{ab}\) are the weight and delay values passed to the kernel.
Each workgroup of the kernel handles multiple spectra and all beams and antennas, but only a single channel. Conceptually, the kernel first computes \(W_{abc}\) for all antennas and beams and stores it to local memory, then applies it to all antennas and beams. Each input sample is loaded once before it is used for all beams. An accumulator is maintained for each beam. Since each coefficient is used many times (the number depends on the work group size, which is a tuning parameter, but 64-256 is reasonable) after it is computed, the cost for computing coefficients is amortised.
In practice, this would cause local memory usage to scale without bound as the number of antennas increases. To keep it bounded (for a fixed number of beams), the antennas are processed in batches, computing then applying \(W_{abc}\) for each batch before starting the next batch. Larger batch sizes have two advantages:
The two phases in each batch need to be separated by a barrier to coordinate access to the shared memory. Larger batches reduce the number of barriers.
If the batch size is small, the number of coefficients to compute is also small, and there is not enough work to keep all the work items busy, making the coefficient computation less efficient.
Higher beam counts
The design above works well for small numbers of beams (up to about 64 single-pol beams), but the register usage scales with the number of beams and eventually the registers spill to memory, causing very poor performance.
To handle more beams, the kernel batches over beams, just as it does over antennas. The beam batch loop becomes an outer loop, with the rest of the kernel operating as before but only on a single batch.
This does mean that the inputs are loaded multiple times, but caches help significantly here, and the kernel tends to be be more compute-bound in this domain.
Dithering
To improve linearity, a random value in the interval (-0.5, 0.5) is added to
each component (real and imaginary) before quantisation. These values are
generated using curand, with its underlying XORWOW generator. It is
designed for parallel use, with each thread having the same seed but a
different sequence parameter to curand_init()
. This minimises
correlation between sequences generated by different threads. The sequence
numbers are also chosen to be distinct between the different engines, to avoid
correlation between channels.
Floating-point rounding issues make it tricky to get a perfectly zero-mean
distribution. While it is probably inconsequential, simply using
curand_uniform(state) - 0.5f
will not give zero mean. We solve this by
mapping the \(2^{32}\) possible return values of curand()
to
the range \((-2^{31}, 2^{31})\) with zero represented twice, before
scaling to convert to a real value in \((-0.5, 0.5)\). While this is
still a deviation from uniformity, it does give a symmetric distribution.