Background

Some backends in larod have support for DMA (Direct Memory Access) buffers. The idea behind such buffers is to minimize copying of data between buffers when running larod jobs. The way that DMA buffers are handled is up to the backend itself, but does in general cause an increase in performance; refer to preprocessing.md and nn-inference.md to find out which backends have dma-buf support. For more detailed information about DMA buffers, refer to Linux's official documentation.

larod is able to process DMA buffers as either input- or output tensors when running jobs; these are the only two cases where DMA buffers are relevant for larod. This buffer type is identified with LAROD_FD_TYPE_DMA property, which is often specified when creating the buffer.

Usage

Even though DMA buffers are faster in general, they introduce more complexity. The owner of the buffer that wants to either read or write to any DMA buffer must make sure to sync the CPU cache in the right moment.

In general, the following steps are required when running a preprocessing or inference job and dealing with DMA buffers:

Start CPU access of the input buffer if it is DMA, otherwise skip this step.
Write the desired data to the input buffer by e.g. using mempcy.
End CPU access of the input buffer if it is DMA, otherwise skip this step. The input buffer is now flushed and ready to be processed by larod.
Run the preprocessing- or inference job.
Start CPU access of the output buffer if it is DMA, otherwise skip this step.
Read from the output buffer by e.g. using memcpy.
End CPU access of the output buffer if it DMA, otherwise skip this step.

The following code snippet shows an example of how to do steps 1-3 in the numbered list above.

bool copyToTensor(larodTensor* tensor,
                    size_t tensorSz,
                    uint8_t* data, size_t dataSz) {
    uint32_t props = 0;
    if (!larodGetTensorFdProps(tensor, &props, NULL))
        return FALSE;
    int fd = larodGetTensorFd(input, NULL);
    if (fd < 0)
        return FALSE;
    int64_t tensorOffset = larodGetTensorFdOffset(tensor, NULL);
    if (tensorOffset < 0)
        return FALSE;
    void* tensorData = mmap(NULL, tensorSz, PROT_WRITE, MAP_SHARED, fd,
                            (off_t) tensorOffset);
    if (data == (void*) MAP_FAILED)
        return FALSE;
    if (props & LAROD_FD_PROP_DMABUF) {
        struct dma_buf_sync sync;
        memset(&sync, 0, sizeof(sync));
        sync.flags = DMA_BUF_SYNC_START | DMA_BUF_SYNC_WRITE;
        int tries = 5;
        int ret = -1;
        do {
            ret = ioctl(fd, DMA_BUF_IOCTL_SYNC, &sync);
        } while (--tries > 0 && ret == -1
                    && (errno == EAGAIN || errno == EINTR));
        if (ret)
            return FALSE;
    }
    memcpy(tensorData, data, (dataSz <= tensorSz) ? dataSz : tensorSz);
    if (props & LAROD_FD_PROP_DMABUF) {
        struct dma_buf_sync sync;
        memset(&sync, 0, sizeof(sync));
        sync.flags = DMA_BUF_SYNC_END | DMA_BUF_SYNC_WRITE;
        int tries = 5;
        int ret = -1;
        do {
            ret = ioctl(fd, DMA_BUF_IOCTL_SYNC, &sync);
        } while (--tries > 0 && ret == -1
                    && (errno == EAGAIN || errno == EINTR));
        if (ret)
            return FALSE;
    }
    munmap(data, tensorSz);
    return TRUE;
}