Beamforming

GPU kernel

For the low number of beams required for MeerKAT (8 single-pol beams), the beamforming operation is entirely limited by the GPU DRAM bandwidth. There is thus no benefit to using tensor cores, and potentially significant downsides (implementation complexity and reduced accuracy for steering coefficients). The kernel thus uses standard single-precision floating-point operations, and is written to maximise bandwidth usage.

The beamforming operation consists of two phases: coefficient generation and coefficient application. To minimise use of system memory, the coefficients are generated on the fly in the same kernel, rather than computed in a separate kernel and communicated through global memory. This design allows for more dynamic coefficients in the future, such as delays that are updated much more frequently according to a formula.

The external interface to the beamformer has four parameters per beam-antenna pair: a weight, a quantisation gain (common to all antennas), a delay and a fringe-rate. All except the delay are combined by the CPU into a single (complex) weight per beam-antenna pair, and the delay is scaled into a convenient unit for computing the phase slope. The final coefficient applied to channel \(c\), beam \(b\), antenna \(a\) is

\[W_{abc} = w_{ab} e^{j\pi cd_{ab}}\]

where \(w_{ab}\) and \(d_{ab}\) are the weight and delay values passed to the kernel.

Each workgroup of the kernel handles multiple spectra and all beams and antennas, but only a single channel. Conceptually, the kernel first computes \(W_{abc}\) for all antennas and beams and stores it to local memory, then applies it to all antennas and beams. Each input sample is loaded once before it is used for all beams. An accumulator is maintained for each beam. Since each coefficient is used many times (the number depends on the work group size, which is a tuning parameter, but 64-256 is reasonable) after it is computed, the cost for computing coefficients is amortised.

In practice, this would cause local memory usage to scale without bound as the number of antennas increases. To keep it bounded (for a fixed number of beams), the antennas are processed in batches, computing then applying \(W_{abc}\) for each batch before starting the next batch. Larger batch sizes have two advantages:

The two phases in each batch need to be separated by a barrier to coordinate access to the shared memory. Larger batches reduce the number of barriers.
If the batch size is small, the number of coefficients to compute is also small, and there is not enough work to keep all the work items busy, making the coefficient computation less efficient.

Higher beam counts

The design above works well for small numbers of beams (up to about 64 single-pol beams), but the register usage scales with the number of beams and eventually the registers spill to memory, causing very poor performance.

To handle more beams, the kernel batches over beams, just as it does over antennas. The beam batch loop becomes an outer loop, with the rest of the kernel operating as before but only on a single batch.

This does mean that the inputs are loaded multiple times, but caches help significantly here, and the kernel tends to be be more compute-bound in this domain.

Dithering

To improve linearity, a random value in the interval (-0.5, 0.5) is added to each component (real and imaginary) before quantisation. These values are generated using curand, with its underlying XORWOW generator. It is designed for parallel use, with each thread having the same seed but a different sequence parameter to curand_init(). This minimises correlation between sequences generated by different threads. The sequence numbers are also chosen to be distinct between the different engines, to avoid correlation between channels.

Floating-point rounding issues make it tricky to get a perfectly zero-mean distribution. While it is probably inconsequential, simply using curand_uniform(state) - 0.5f will not give zero mean. We solve this by mapping the \(2^{32}\) possible return values of curand() to the range \((-2^{31}, 2^{31})\) with zero represented twice, before scaling to convert to a real value in \((-0.5, 0.5)\). While this is still a deviation from uniformity, it does give a symmetric distribution.