Beamforming =========== GPU kernel ---------- For the low number of beams required for MeerKAT (8 single-pol beams), the beamforming operation is entirely limited by the GPU DRAM bandwidth. There is thus no benefit to using tensor cores, and potentially significant downsides (implementation complexity and reduced accuracy for steering coefficients). The kernel thus uses standard single-precision floating-point operations, and is written to maximise bandwidth usage. The beamforming operation consists of two phases: coefficient generation and coefficient application. To minimise use of system memory, the coefficients are generated on the fly in the same kernel, rather than computed in a separate kernel and communicated through global memory. This design allows for more dynamic coefficients in the future, such as delays that are updated much more frequently according to a formula. The external interface to the beamformer has four parameters per beam-antenna pair: a weight, a quantisation gain (common to all antennas), a delay and a fringe-rate. All except the delay are combined by the CPU into a single (complex) weight per beam-antenna pair, and the delay is scaled into a convenient unit for computing the phase slope. The final coefficient applied to channel :math:`c`, beam :math:`b`, antenna :math:`a` is .. math:: W_{abc} = w_{ab} e^{j\pi cd_{ab}} where :math:`w_{ab}` and :math:`d_{ab}` are the weight and delay values passed to the kernel. Each work-group of the kernel handles multiple spectra and all beams and antennas, but only a single channel. Conceptually, the kernel first computes :math:`W_{abc}` for all antennas and beams and stores it to local memory, then applies it to all antennas and beams. Each input sample is loaded once before it is used for all beams. An accumulator is maintained for each beam. Since each coefficient is used many times (the number depends on the work-group size, which is a tuning parameter, but 64-256 is reasonable) after it is computed, the cost for computing coefficients is amortised. In practice, this would cause local memory usage to scale without bound as the number of antennas increases. To keep it bounded (for a fixed number of beams), the antennas are processed in batches, computing then applying :math:`W_{abc}` for each batch before starting the next batch. Larger batch sizes have two advantages: 1. The two phases in each batch need to be separated by a barrier to coordinate access to the shared memory. Larger batches reduce the number of barriers. 2. If the batch size is small, the number of coefficients to compute is also small, and there is not enough work to keep all the work-items busy, making the coefficient computation less efficient. Higher beam counts ^^^^^^^^^^^^^^^^^^ The design above works well for small numbers of beams (up to about 64 single-pol beams), but the register usage scales with the number of beams and eventually the registers spill to memory, causing very poor performance. To handle more beams, the kernel batches over beams, just as it does over antennas. The beam batch loop becomes an outer loop, with the rest of the kernel operating as before but only on a single batch. This does mean that the inputs are loaded multiple times, but caches help significantly here, and the kernel tends to be more compute-bound in this domain. Data flow --------- The host side of the beamforming is simpler than for correlation because there is no accumulation. For simplicity, the output chunk size (in time) is set to the same as the input chunk size. Because the kernel operates on all beams together, there is only one instance of the :class:`.BPipeline` class, and it handles all the beams. This is in contrast to :class:`.XPipeline`, which only handles a single :class:`Output`. Unlike the correlation pipeline, the beamformer pipeline can be controlled dynamically by setting weights and gains. Transferring these to the GPU is somewhat expensive and should not be done for every chunk. Instead, there is an associated version number. When the weights/gains are updated, the master version number is incremented. Before processing a chunk on the GPU, this version number is compared to a version number associated with the GPU copy. If they differ, an update is performed. Missing data handling ^^^^^^^^^^^^^^^^^^^^^ When some input data is not received, we wish to exclude it from the beam sums. At present this is handled by zeroing out missing data when it is received in :meth:`.XBEngine._receiver_loop`. This is potentially expensive if a lot of data goes missing. Should it prove too expensive, we could zero the data on the GPU, or pass metadata to the kernel to indicate which values should not be used. The presence of heaps is also tracked to allow the ``beam_ants`` SPEAD item to be populated on transmission. This item is a compromise: ideally we'd like to indicate exactly which antennas were present, but this would require more than the 48 bits available in a SPEAD immediate item.