Correlation

Implementation details of correlation kernel

The correlation kernel of the XB-engine is implemented using a modified version of code originally written by John Romein of ASTRON (which can be accessed here).

For an overview of the tensor-core correlator compute kernel, see the GTC presentation. This section gets into the low-level details of the implementation. Familiarity with CUDA (including warp matrix multiplies) as well as the function of a correlator are assumed.

Complex multiplications

The tensor cores can only perform calculations on real numbers. The correlation kernel takes complex numbers as adjacent pairs of real and imaginary parts, and constructs matrices in a way that allows complex multiplications to be performed. The GTC presentation shows diagrams of this, so only a brief explanation will be given.

Consider a correlation product \(A \times B^H\). The real part of \((a_r + ja_i)(\overline{b_r + jb_i})\) is \(a_rb_r + a_ib_i\), so simply by placing the real and imaginary parts in alternate columns of \(A\) and \(B\), each dot product between rows will give the desired real part.

The imaginary part is \(a_ib_r - a_rb_i\), so we could construct a second matrix \(B'\) containing \((-b_i, b_r)\) in each pair of adjacent real entries (instead of \((b_r, b_i)\)), and then the product would give the imaginary part. If instead of using two separate matrices, we interleave the rows of \(B\) and \(B'\), the resulting product will interleave the original products (by column) and hence place the real and imaginary components together again.

In the code this transformation is implemented by conj_perm(). It uses some bit manipulations to perform the calculation on a 32-bit integer that potentially has multiple samples. This code assumes a little-endian architecture, which is all that CUDA supports. Let’s consider the code for NR_BITS of 8:

__byte_perm(v, 0x00FF00FF - (v & 0xFF00FF00), 0x2705);

In this case v contains two complex numbers. The mask selects the imaginary components. These are effectively subtracted from 0 to negate them; the left-hand side of the subtraction is 0x00FF00FF rather than 0 so that the lower component doesn’t cause a carry that affects the higher component. The other two bytes of the result are not relevant. Next, __byte_perm() selects the four desired bytes in the appropriate order.

It should be noted that this transformation will map -128 to itself rather than +128, because +128 cannot be represented in a two’s-complement int8. The caller is responsible for ensuring that this input value is not used.

Parameters and constants

NR_BITS specifies the type of the incoming samples:

4 means each byte contains a packed 4-bit real and 4-bit imaginary.
8 means the real and imaginary components are signed bytes.
16 means half-precision float.

NR_SAMPLES_PER_CHANNEL is the number of samples in time processed in a single call to the kernel. These are divided into groups of NR_TIMES_PER_BLOCK, which is the number loaded into shared memory at a time before computing with them. Changing NR_TIMES_PER_BLOCK would require substantial changes to the loading code: the fetches are hard-coded to use a certain number of bits of the thread ID to index this dimension. Increasing it significantly (e.g., to match the 256 that is native to MeerKAT) would probably require too much shared memory.

NR_RECEIVERS_PER_BLOCK refers to the size of the subsets of antenna data in the input matrix which will be correlated per thread block. It has three possible values (32, 48 and 64) which correspond to processing 32×32, 48×48 or 64×32 (not 64×64) regions of the correlation matrix. The kernel uses NR_RECEIVERS_PER_BLOCK_X for the second dimension.

NR_CHANNELS is the number of channels over which to correlate, but there seems to be little need for this to be baked into the kernel. It only forms the outermost dimension of the inputs and outputs, and the Y axis of the thread grid, and could just as easily be dynamic.

NR_RECEIVERS_PER_TCM_X and NR_RECEIVERS_PER_TCM_Y are the number of (dual-pol) receivers per warp matrix multiply. Keeping in mind that the “Y” receiver corresponds to rows (and to aSamples temporary storage, with “X” corresponding to bSamples), this is 8×4 (8×2 for 4-bit samples). With dual-pol receivers that equates to 16×8 inputs. The reason it is not 16×16 (to match the matrix shape supported by the tensor cores) is the expansion of the B matrix for complex multiplication as described above.

In doCorrelateRectangle(), nrFragmentsX and nrFragmentsY indicate the number of “fragments” (tensor-core matrices) that the warp (not the thread block) is responsible along each dimension.

Thread indexing

There is a hard-coded value of 4 warps per block, arranged as 32×2×2. The first axis simply determines the position within a warp. The other two axes are used for different purposes in different parts of the code. Most typically, they subdivide the output block into quadrants (so for example a 64×32 output block is divided into four 32×16 output blocks, with one warp responsible for computing each). In loading code, the threadIdx is flattened into a 1D index (tid).

The thread grid is 2D. The y axis indicates the channel, while the x axis selects an output block within the output triangle. Some trickery with square roots is used to perform this mapping.

When NR_RECEIVERS_PER_BLOCK is 32 or 48, the output space is dealt with in square blocks, in doCorrelateRectangle(). The correlation matrix is conjugate symmetric, so this involves computing some redundant elements, which are discarded as part of storeVisibilities(). When it is 64, things get more complicated: certain blocks are processed with doCorrelateTriangle(), which is optimised for blocks that lie on the main diagonal.

Figure made with TikZ

Block, warp and fragment layout when NR_RECEIVERS_PER_BLOCK is 64 and NR_BITS is 8 or 16.

The figure above illustrates the arrangement for a 192-antenna array. The numbers in white boxes are the block IDs (blockIdx.x). Each green block is processed with doCorrelateRectangle(); it is shown divided into four quadrants (corresponding to the warps) and further subdivided into the fragments computed by each warp. The red/blue blocks are processed with doCorrelateTriangle(). The three blue regions are processed using warps 1-3 (a lookup table indicates the starting position), while the three red areas in each triangle are handled by warp 0.

When NR_BITS is 4 the situation is very similar, but the fragments are 8×2 instead of 8×4.

Data loading

A batch of voltage samples is loaded into shared memory, then used from there. Since each warp is computing multiple output fragments, each voltage is used by multiple matrix multiplies, and so caching them in shared memory reduces global memory traffic. The shared memory is also double-buffered, which is presumably to increase instruction-level parallelism and reduce the number of synchronisations required.

Rather than perform loads using the natural type of the samples, they are performed using wide types like int4, presumably to make more efficient use of the memory type, and type-casts pointers to access the raw memory. It should be noted that this sort of type-punning is undefined behaviour in C++, but there doesn’t seem to be a safer alternative (memcpy is safe but it works one byte at a time, which destroyed performance).

Loading is implemented using the FetchData class. At construction time it takes thread-specific offsets to the receiver (antenna), polarisation and time. The load() member functions takes base channel, time and receiver that are uniform across the block. If the specific element to access is outside the bounds, the data is not loaded and left as zero.

Asynchronous loading

Note

The asynchronous loading support has been removed in the katgpucbf fork, as it was not really compatible with the axis reordering. This section is left as a reference should it be brought back in future.

When there is support in hardware (Compute Capability 8.0 or later, i.e., Ampere) and a new enough CUDA version, an asynchronous memory copy is used for extra latency hiding (or possibly to reduce register pressure). It’s implemented using an experimental (and deprecated) version of the API; for reference one needs to read the 11.1 CUDA programming guide rather than the latest version.

The READ_AHEAD macro is slightly confusing. Let’s assume a large enough NR_SAMPLES_PER_CHANNEL that READ_AHEAD is 2 and NR_SHARED_BUFFERS is 4. Then the following can all be occurring simultaneously:

Reading from shared buffer i to do the computations.
Asynchronous copies to shared buffers i + 1 to i + 3, inclusive (note that accounts for 3 buffers, not 2).

Within a single thread there can only be two async copies outstanding while doing the computations, because before starting computation on a buffer it waits for the copy targeting that buffer to complete. But because there is no call to __syncthreads() between the end of computation and the scheduling of the following copy, the scenario above can occur overall, with different threads in different parts of the loop. This explains why 4 buffers are needed.

Result storage

The result storage is particularly complicated in an attempt to optimise the process. CUDA says that the fragment type has implementation-defined memory layout, and the individual matrix elements can only be portably read by using store_matrix_sync() to write the results to shared or global memory. The memory layouts supported by this function don’t correspond to the packed triangular shape the kernel wants, so some extra steps are required.

For a set of recognised architectures, the elements of the fragment class are read directly, using knowledge of the architecture-specific memory layout. In the fallback case (where PORTABLE is defined), the fragment is written to shared-memory scratch space then read back to extract the elements.

The upstream code is designed to do all the accumulation inside the kernel, by passing in all the data for the entire dump. While this is efficient (only writing results to global memory once), it would limit the dump period based on the available memory. In katgpucbf, the code has been modified so that results are added to the existing values in global memory.

Accumulations, Dumps and Output Data

The input data is accumulated before being output. For every output heap, multiple input heaps are received.

A heap from a single F-Engine consists of a set number of spectra, referred to as spectra_per_heap, where the spectra are time samples. Each of these time samples is part of a different spectrum, meaning that the timestamp difference per sample is equal to the value of --samples-between-spectra. The timestamp difference between two consecutive heaps from the same F-Engine is equal to:

heap_timestamp_step = spectra_per_heap * samples_between_spectra.

The value of spectra_per_heap is not set explicitly on the command line, but rather inferred from the --jones-per-batch argument. The latter is the product of spectra_per_heap and the stream’s channel count (and thus, the number of Jones vectors in an F-engine output batch).

A batch of heaps is a collection of heaps from different F-Engines with the same timestamp. A chunk consists of multiple consecutive batches (the number is given by the option --heaps-per-fengine-per-chunk). Correlation generally occurs on a chunk at a time, accumulating results. The correlation kernel is modified in several ways to support this:

The FetchData class splits the time index into a batch index and an offset within the batch, so that the rest of the code can be oblivious to batches, and just work in time “blocks” (NR_TIMES_PER_BLOCK spectra).
The various functions take a runtime range of blocks to process.
To provide more parallelism (important when each engine is processing only a few channels), the grid has an extra Z dimension. The time blocks to be processed are divided amongst the values of blockIdx.z. There is a trade-off here: one wants to serially accumulate over as many blocks as possible to reduce the final global memory traffic for writing back results. To avoid race conditions in accumulation, each value of blockIdx.z uses a separate global-memory accumulator.

An accumulation period is called an accumulation and the data output from that accumulation is normally called a dump — the terms are used interchangeably. Once all the data for a dump has been correlated, the separate accumulators are added together (“reduced”) to produce a final result. This reduction process also converts from 64-bit to 32-bit integers, saturating if necessary, and counts the number of saturated visibilities.

The number of batches to accumulate in an accumulation is equal to the --heap-accumulation-threshold flag. The timestamp difference between successive dumps is therefore equal to:

timestamp_difference = spectra_per_heap * samples_between_spectra * heap_accumulation_threshold

The output heap timestamp is aligned to an integer multiple of timestamp_difference (equivalent to the current SKARAB “auto-resync” logic). The total accumulation time is equal to:

accumulation_time_s = timestamp_difference * adc_sample_rate(Hz) seconds.

The output heap contains multiple packets and these packets are distributed over the entire accumulation_time_s interval to reduce network burstiness. The default configuration in katgpucbf.xbgpu.main is for 0.5 second dumps when using the MeerKAT 1712 MSps L-band digitisers.

The dump boundaries are aligned to whole batches, but may fall in the middle of a chunk. In this case, each invocation of the correlation kernel will only process a subset of the batches in the chunk.

Output Heap Payload Composition

Each correlation product contains a real and imaginary sample (both 32-bit integer) for a combined size of 8 bytes per baseline. The ordering of the correlation products is given in the xeng-stream-name-bls-ordering sensor in the product controller, but can be calculated deterministically: get_baseline_index() indicates the ordering of the baselines, and the four individual correlation products are always ordered aa, ba, ab, bb, where a and b can either be vertical or horizontal polarisation (v or h), depending on the configuration of the instrument.

All the baselines for a single channel are grouped together contiguously in the heap, and each X-engine correlates a contiguous subset of the entire spectrum. For example, in an 80-antenna, 8192-channel array with 64 X-engines, each X-engine output heap contains 8192/64 = 128 channels.

The heap payload size in this example is equal to

channels_per_heap * correlation_products * complex_sample_size = 128 * 12960 * 8 = 13,271,040 bytes or 12.656 MiB.

Missing Data Handling

As with fgpu, metadata indicating present or missing input heaps are passed down the pipeline alongside the data. If some input data is missing, processing is performed as normal. Unlike fgpu which suppresses transmissions for which some input data was missing, xbgpu will replace affected baselines with a “magic number” of (-2**31, 1), so that unaffected baselines can still be transmitted, but the receiver will know that those baselines are invalid. If a dump is affected by missing data on all antennas, it will still be transmitted but will contain only the magic value and no useful data.