V-Engine Design
Note
This has not yet been implemented yet, so is still speculative.
The processing is largely handled by katcbf-vlbi-resample. This means that GPU memory is managed by cupy rather than all allocated up front. The design is also much simpler than the other engines because we do not try to overlap CPU-GPU transfers with GPU computations [1]. The processing steps are the same as those documented for the mk_vlbi_resample script, except that there is no mixer because the incoming data stream is expected to already have the correct centre frequency.
To avoid unbounded memory usage if transmission cannot keep up with reception, a pull model is used. The main loop repeatedly obtains a completed chunk from an asynchronous iterator then transmits it. That iterator obtains input from an upstream iterator, which uses another iterator and so on, until eventually an iterator blocks on being able to get a chunk from the receive queue.
While there is no overlap between GPU work and CPU-GPU transfers, CPU work
(specifically networking) must still be overlapped rather than paused during
transfers: on the receive side we do not want to miss packets arriving, and on
the send side we want to send packets continuously at a steady rate rather
than bursts with gaps between. On the receive side that happens naturally due
to spead2’s design, with a separate thread handling reception of packets and
assembly into chunks. On the send side the overlap is created by the
AsNumpy class, which proactively starts transferring chunks from
upstream before they are requested.
Reception
One difference from the F- and XB-engines is that capture-start and
capture-stop gate the entire processing chain rather than just
transmission. This approach was chosen because changing the VLBI delay can
make samples non-contiguous in time, which interacts poorly with rechunking
steps in katcbf-vlbi-resample. An alternative approach would have been a model
similar to the F-engine, where contiguous data is sourced by reading from an
input buffer with an offset (possibly skipping or duplicating samples); in
retrospect this may have been a better approach.
Initially we also gated the receiver on capture-start and
capture-stop, but this meant that when not capturing there was no way to
tell whether the V-engine had a healthy network connection capable of
receiving the full incoming bandwidth. To address this, a somewhat complicated
DiscardingIterator wrapper is used to allow data to still be
received but discarding when not capturing.
Transmission
While the other engines use spead2 to send SPEAD data, the output of the V-Engine is VDIF and so we are not able to use the high-speed kernel bypass and packet pacing capabilities of spead2. Instead, packet pacing is re-implemented in Python, following essentially the same design as used by spead2. There are a few changes to specialise things to the use case:
When the time to sleep is less than a threshold (1ms at the time of writing), we omit the sleep, as the wakeup overheads can be quite high in Python and cause significant overhead.
Instead of buffering up packets to a given burst size, we treat each frameset (group of frames with the same timestamp but different thread IDs) as a burst that is transmitted without intervening sleeps. This will typically create smaller such bursts than the default in spead2, but combined with the point above the actual number of bytes between bursts can be quite large.
The burst (catch-up) rate is set significantly higher than the default in spead2 to compensate for potentially long pauses. This can be due to Python’s stop-the-world garbage collector, and asyncio multiplexing work onto a single kernel thread (rather than having a dedicated thread for transmission).
Initially we tried to perform transmission serially with the iterator over the
processed frames, on the assumption that the asynchronous buffering in
AsNumpy would allow GPU work to
proceed in parallel with data transmission. However, we found that this did
not work well, as some requests for the next frame would block for hundreds of
milliseconds, during which no packets were being transmitted. Instead,
VDIFSender uses a queue of packets and a background task to service
them concurrently with data processing.