The integration of the DMA350 into the nanosoc re-usable SoC architecture will improve the transfer bandwidth on DMA channels within the SoC. This project integrates the DMA 350 into nanosoc, validates the integration and functionality of the DMA 350, and compares the performance of the DMA 350 to the PL230, that was the initial DMA controller integrated into nanosoc.
Configuring the DMA
The DMA 350 has options for configuration. these include: bus data & address widths, number of channels, number of input/output triggers, number of stream interfaces, number of GPOs, the use of an additional AXI master port and the use of TrustZone security. All these optional features come with tradeoff of power and area. To keep the area to a sensible size for a nanosoc implementation we have configured the DMA as follows:
|No. of channels
|Channel FIFO depth
|AXI M1 present
With this configuration, each of the 2 channels has stream and trigger interfaces. It can also use a different AXI master for each channel, or use both AXI masters on a single channel. With choice of FIFO depth, some tests were done with different FIFO depths and a transfer size of 256 and the results are as follows:
The 32 deep gives the best performance, and increasing any further does not see any change for this transfer size at least.
AXI to AHB
Arm provide the XHB-500 for AXI to AHB conversion. The configuration for this is relatively minimal, just ensuring the data and address widths are consistent, and the ID widths are correct for the configuration of the DMA-350. The XHB-500 uses AXI5 - AHB5, but nanosoc is an AHB lite system. To achieve this conversion, the hnonsec, hexcl, hqos, hregion, and hnsaid can be ignored. The hexokay signal must be tied to 0.
The AHB signals from the sldma350_ahb.v module are then connected into the nanosoc_ss_dma.v module. To still allow for users to chose between PL230 and DMA-350, some defines are used. This allows the choice between 1x DMA-350, 1x PL230 or 2x PL230 (set by DMAC_DMA350, DMAC_0_PL230, DMAC_1_PL230)
Changes to nanoSoC
To allow the integration of the DMA350 into nanoSoC, the APB address map had to be altered. Until this point all the APB completers required 12 bit address widths, where the DMA350 requires 13 bit addresses. To accommodate this extra space the DMA350 spans 2 APB regions. The changes are highlighted in the address map below:
The sysctrl address space is offset to 0x40000000. With the DMA 350 at 0x4000C000 - 0x4000DFFF
Some additional changes were made to allow the use of stream interfaces. These interfaces are passed to the expansion space so they can be used by your accelerator.
Validating the integration
To validate the functional behaviour of the DMA-350 within the nanosoc test code has been written for the Cortex M0 to run. The tests currently included are:
- 1D transfer from EXPRAM0 to EXPRAM1 with interrupts
- 1D transfer from EXPRAM0 to EXPRAM1 without interrupts
- 1D transfer from EXPRAM0 to EXPRAM1 without interrupts, and disabling burst transfers
- 1D transfer from EXPRAM0 to EXPRAM1 without interrupts, using M1 AXI interface
- 1D transfer from EXPRAM0 to EXPRAM1 without interrupts, using software triggering
Additional test include, the use of both M0 and M1 together, and the hardware triggering interface. To use the M1 interface the memory must be addressed at an offset of 0x08000000.
To compare performance between the DMA350 and PL230 tests with 64 x 32 bit transfers on both and measured the transfer time in simulation were undertaken. These used 4 different permutations of DMA transfers, firstly the PL230 as provided from ARM, secondly an upgraded version of the PL230 - this version includes caches for the source destination and control registers in the PL230 which avoids having to fetch data from the DMA data structure as often. Next with the DMA350, one permutation using just a single AXI master, and another using 2 masters.
|Cycles for 64 transfers
The soclabs PL230 version outperforms the ARM version, this is because in the default ARM supplied version of the PL230, the dma first fetches information for the source and destination from the DMA data structure, then performs a single read and single write, it has to repeat this for each word transferred. In the soclabs version, it creates a cached copy of the transfer information and then performs 16 read-write cycles. This gives a significant enhancement as it does not have to fetch data from the DMA data structure as often.
The DMA350 however, has an internal data structure so does not have to fetch any information whilst running. It also has FIFOs for each channel and so can do sequential reads followed by writes rather than doing single read-writes as the PL230 does. This gives about a 15% enhancement in bandwidth, however the DMA350 does have a significantly larger synthesizes area (about 30x larger than the enhanced PL230). The best performance comes from using both of the AXI masters. Because nanosoc has a 4x7 bus matrix, each master effectively has its own bus, meaning reading and writing can happen concurrently. This gives over 100% increase in bandwidth compared to the enhanced PL230, and 85% increase compared to the DMA350 with a single master.
Using the DMA350 in your project
Currently, the default option for DMA in nanosoc is the enhanced PL230. If you would like to use the DMA350 in your project. You can include this by adding the DMAC_DMA350 definition to NANOSOC_DEFINES in the makefile in nanosoc_tech. This will replace the PL230, you cannot instantiated both at the same time. You will also need to include the DMA350 ip in the system level flist (accelerator-project/flist/system.flist). This can be done by uncommenting the line including the sldma350_ahb.flist. It is suggested that the sldma230_ip.flist is commented out to avoid the inclusion of that IP in simulation etc.
There are some drivers provided for testcode development in the nanosoc_tech/software/drivers/ directory, and you can use the nanosoc_tech/software/common/validation/dma350_tests.c as a baseline for firmware development