Skip to main content

Optimization tips

This document provides optimization tips for each Deep Learning Processing Unit (DLPU). It is meant for a more advanced audience, that want to obtain the best performance possible from their model investing more time in the optimization process. If you are not an expert, we recommend using only the models provided in the recommended model architecture section of this documentation or in the Axis Model Zoo.

ARTPEC-7

The ARTPEC-7 DLPU has a dedicated memory for the DLPU. To maximize the DLPU's performance, it is recommended to use lightweight models that can fit in this memory, such as SSD MobileNet v2 300x300.

When converting your model to EdgeTPU using edgetpu-converter, you may receive warnings about instructions that cannot be executed by the TPU. It is important to avoid using these instructions in your model. Both per-channel and per-tensor quantization offer similar performance, but per-channel quantization is recommended for better accuracy. For more details on how to optimize your model for ARTPEC-7 DLPU, refer to the EdgeTPU documentation.

ARTPEC-8

The ARTPEC-8 DLPU performs well with models of any size, as long as they fit in the device's memory. This allows for better performance with larger models compared to ARTPEC-7. Here are some additional optimizations to enhance DLPU performance:

  • Use per-tensor quantization.
  • Prefer regular convolutions over depth-wise convolutions, which means that architecture like RegNet-18 are more efficient than MobileNet.
  • Optimal kernel size is 3x3.
  • Use stride 2 whenever possible as it is natively supported by the convolution engine. For other cases, consider using pooling.
  • Ensure the number of filters per convolution block is a multiple of 6.
  • Applying ReLU as the activation function after a convolution will result in a faster fused layer.
  • Sparsification can improve the performance of the model.

ARTPEC-9

The ARTPEC-9 DLPU performs well with models of any size, as long as they fit in the device's memory, but it has some limits in the input/output tensor depths. Here are some additional optimizations to enhance DLPU performance:

  • Average pooling is most efficient when size is 3x3 with stride 1 and padding 1.
  • If concatenating along the channel dimension, the channel dimension of every input tensor must be a multiple of 16.
  • For convolution 2D and depthwise convolution 2D, the supported kernel heights and widths are: 9. The supported strides (the height and width stride have to match) are: 2. For kernels with height or width >7, only a stride of 1 is supported.
  • Ensure the number of filters per convolution block is a multiple of 16.
  • Max pooling is most efficient with these configurations:
    • 1 x 1 pooling size, 2, 2 stride (equivalent to downsampling 2 x 2).
    • 2 x 2 pooling size, 2, 2 stride, VALID padding, input sizes must be even.
    • 2 x 2 pooling size, 2, 2 stride, SAME padding, input sizes must be odd.
    • 3 x 3 pooling size, 2, 2 stride, VALID padding, input sizes must be even, maximum tensor width is 417.
    • 3 x 3 pooling size, 2, 2 stride, SAME padding, input sizes must be odd, maximum tensor width is 417.
    • 1, 1 stride with pooling sizes up to 9x9 for VALID padding
    • 1, 1 stride with pooling sizes up to 17x17 for SAME padding. Input size must not be smaller than the pooling size.
  • Padding:
    • Only zero padding in the H and W dimensions is supported.
    • Padding of up to 7 on each side of the tensor in those dimensions is supported.
    • Padding amounts can differ per side, e.g. pad of 1 before the tensor in the H dimension and a pad of 3 after the tensor in the H dimension.
    • Quantization for input and output tensors must be identical.
    • For more optimizations, see Arm operator support.

CV25

For CV25, model quantization and optimizations are mainly performed by the compiler. It is recommended to refer to the documentation provided with the Ambarella SDK for more information.

One important consideration is to have a model with an input size that is a multiple of 32. Otherwise, padding will be required for the input, making the conversion process slightly more complex, and the model less efficient.