V-Engine Design

Note

This has not yet been implemented yet, so is still speculative.

The processing is largely handled by katcbf-vlbi-resample. This means that GPU memory is managed by cupy rather than all allocated up front. The design is also much simpler than the other engines because we do not try to overlap CPU-GPU transfers with GPU computations [1]. The processing steps are the same as those documented for the mk_vlbi_resample script, except that there is no mixer because the incoming data stream is expected to already have the correct centre frequency.

To avoid unbounded memory usage if transmission cannot keep up with reception, a pull model is used. The main loop repeatedly obtains a completed chunk from an asynchronous iterator then transmits it. That iterator obtains input from an upstream iterator, which uses another iterator and so on, until eventually an iterator blocks on being able to get a chunk from the receive queue.

While there is no overlap between GPU work and CPU-GPU transfers, CPU work (specifically networking) must still be overlapped rather than paused during transfers: on the receive side we do not want to miss packets arriving, and on the send side we want to send packets continuously at a steady rate rather than bursts with gaps between. On the receive side that happens naturally due to spead2’s design, with a separate thread handling reception of packets and assembly into chunks. On the send side the overlap is created by the AsNumpy class, which proactively starts transferring chunks from upstream before they are requested.

Delays

The ?vlbi-delay request is implemented simply by adjusting the time base (sync epoch) of the first iterator in the processing chain.

One consequence of this choice is that the delay cannot be changed without discarding the processing chain and creating a new one, because the rechunking steps require contiguous data. This is handled by using a new _CaptureSession object for each ?capture-start / ?capture-stop cycle, which has its own processing chain. This is different to the F- and XB-engines, which run the processing continuously and only use ?capture-start and ?capture-stop to gate the output transmission.

An alternative approach would have been a model similar to the F-engine, where contiguous data is sourced by reading from an input buffer with an offset (possibly skipping or duplicating samples); in retrospect this may have been a better approach.

Reception

Initially we also gated the receiver on ?capture-start and ?capture-stop, but this meant that when not capturing there was no way to tell whether the V-engine had a healthy network connection capable of receiving the full incoming bandwidth. To address this, a somewhat complicated DiscardingIterator wrapper is used to allow data to still be received but discarding when not capturing.

Transmission

While the other engines use spead2 to send SPEAD data, the output of the V-Engine is VDIF and so we are not able to use the high-speed kernel bypass and packet pacing capabilities of spead2. Instead, packet pacing is re-implemented in Python, following essentially the same design as used by spead2. There are a few changes to specialise things to the use case:

When the time to sleep is less than a threshold (1ms at the time of writing), we omit the sleep, as the wakeup overheads can be quite high in Python and cause significant overhead.
Instead of buffering up packets to a given burst size, we treat each frameset (group of frames with the same timestamp but different thread IDs) as a burst that is transmitted without intervening sleeps. This will typically create smaller such bursts than the default in spead2, but combined with the point above the actual number of bytes between bursts can be quite large.
The burst (catch-up) rate is set significantly higher than the default in spead2 to compensate for potentially long pauses. This can be due to Python’s stop-the-world garbage collector, and asyncio multiplexing work onto a single kernel thread (rather than having a dedicated thread for transmission).

Initially we tried to perform transmission serially with the iterator over the processed frames, on the assumption that the asynchronous buffering in AsNumpy would allow GPU work to proceed in parallel with data transmission. However, we found that this did not work well, as some requests for the next frame would block for hundreds of milliseconds, during which no packets were being transmitted. Instead, VDIFSender uses a queue of packets and a background task to service them concurrently with data processing.