DSP Engine Design

Terminology

We will use OpenCL terminology, as it is more generic. If you’re more familiar with CUDA terminology, katsdpsigproc’s introduction has a table mapping the most important concepts. For definitions of the concepts, refer to chapter 2 of the OpenCL specification. A summary of the most relevant concepts can also be found here.

Glossary

This section serves (hopefully) to clarify some potentially confusing terms used within the source code.

Chunk: An array of data and associated metadata, including a timestamp. Chunks are the granularity at which data is managed within an engine (e.g., for transfer between CPU and GPU). To amortise per-chunk costs, chunks typically contain many SPEAD heaps.
Command Queue: Channel for submitting work to a GPU. See katsdpsigproc.abc.AbstractCommandQueue.
Device: GPU or other OpenCL accelerator device (which in general could even be the CPU). See katsdpsigproc.abc.AbstractDevice.
Engine: A single process which consumes and/or produces SPEAD data, and is managed by katcp. An F-engine processes data for one antenna; an XB-Engine processes data for a configurable subset of the correlator’s bandwidth. It is expected that a correlator will run more than one engine per server.

Event

Used for synchronisation between command queues or between a command queue and the host. See katsdpsigproc.abc.AbstractEvent.

Heap

Basic message unit of SPEAD. Heaps may comprise one or more packets.

Queue

See asyncio.Queue. Not to be confused with Command Queues.

Queue Item

See QueueItem. These are passed around on Queues.

Stream

A stream of SPEAD data. The scope is somewhat flexible, depending on the viewpoint, and might span one or many multicast groups. For example, one F-engine sends to many XB-engines (using many multicast groups), and this is referred to as a single stream in the fgpu code. Conversely, an XB-engine receives data from many F-engines (but using only one multicast group), and that is also called “a stream” within the xbgpu code.

This should not be confused with a CUDA stream, which corresponds to a Command Queue in OpenCL terminology.

Stream Group

A group of incoming streams whose data are combined in chunks (see spead2.recv.ChunkStreamRingGroup). Stream groups can be logically treated like a single stream, but allow receiving to be scaled across multiple CPU cores (with one member stream per thread).

Timestamp

Timestamps are expressed in units of ADC (analogue-to-digital converter) samples, measured from a configurable “sync time”. When a timestamp is associated with a collection of data, it generally reflects the timestamp of the first ADC sample that forms part of that data.

Operation

The general operation of the DSP engines is illustrated in the diagram below:

Figure made with TikZ

Data Flow. Double-headed arrows represent data passed through a queue and returned via a free queue.

The F-engine uses two input streams and aligns two incoming polarisations, but in the XB-engine there is only one.

There might not always be multiple processing pipelines. When they exist, they are to support multiple outputs generated from the same input, such as wide- and narrow-band F-engines, or correlation products and beams. Separate outputs use separate output streams so that they can interleave their outputs while transmitting at different rates. They share a thread to reduce the number of cores required.

Chunking

GPUs have massive parallelism, and to exploit them fully requires large batch sizes (millions of elements). To accommodate this, the input packets are grouped into “chunks” of fixed sizes. There is a tradeoff in the chunk size: large chunks use more memory, add more latency to the system, and reduce LLC (last-level cache) hit rates. Smaller chunks limit parallelism, and in the case of the F-engine, increase the overheads associated with overlapping PFB (polyphase filter bank) windows.

Chunking also helps reduce the impact of slow Python code. Digitiser output heaps consist of only a single packet, and while F-engine output heaps can span multiple packets, they are still rather small and involving Python on a per-heap basis would be far too slow. We use spead2.recv.ChunkRingStream or spead2.recv.ChunkStreamRingGroup to group heaps into chunks, which means Python code is only run per-chunk.

Queues

Both engines consist of several components which run independently of each other — either via threads (spead2’s C++ code) or Python’s asyncio framework. The general pattern is that adjacent components are connected by a pair of queues: one carrying full buffers of data forward, and one returning free buffers. This approach allows all memory to be allocated up front. Slow components thus cause back-pressure on up-stream components by not returning buffers through the free queue fast enough. The number of buffers needs to be large enough to smooth out jitter in processing times.

A special case is the split from the receiver into multiple processing pipelines. In this case each processing pipeline has an incoming queue with new data (and each buffer is placed in each of these queues), but a single queue for returning free buffers. Since a buffer can only be placed on the free queue once it has been processed by all the pipelines, a reference count is held with the buffer to track how many usages it has. This should not be confused with the Python interpreter’s reference count, although the purpose is similar.

Transfers and events

To achieve the desired throughput it is necessary to overlap transfers to and from the GPU with its computations. Transfers are done using separate command queues, and an CUDA/OpenCL event (see the glossary) is associated with the completion of each transfer. Where possible, these events are passed to the device to be waited for, so that the CPU does not need to block. The CPU does need to wait for host-to-device transfers before putting the buffer onto the free queue, and for device-to-host transfers before transmitting results, but this is deferred as long as possible.

The above concepts are illustrated in the following figure:

Figure made with TikZ

GPU command queues, showing the upload, processing and download command queues, and the events (shown in green) used for synchronisation.

Common features

Shutdown procedures

The dsim, fgpu and xbgpu all make use of the aiokatcp server’s on_stop feature which allows for any engine-specific clean-up to take place before coming to a final halt.

The on_stop procedure is broadly similar between the dsim, fgpu and xbgpu.

The dsim simply stops its internal calculation and sending processes of data and descriptors respectively.
fgpu and xbgpu both stop their respective spead2 receivers, which allows for a more natural ending of internal processing operations.
- Each stage of processing passes a None-type on to the next stage,
- Eventually resulting in the engine sending a SPEAD stop heap across its output streams.