Real-Time Edge AI SoC: High-Speed Low Complexity Reconfigurable-Scalable Architecture for Deep Neural Networks
Modern Convolutional Neural Networks (CNNs) are known to be computationally and memory costly owing to the deep structure that is constantly growing. A reconfigurable design is crucial in tackling this difficulty since neural network requirements are always evolving. The suggested architecture is adaptable to the needs of the neural network. The suggested architecture is flexible enough to compute various networks with any size input image. System-on-Chip (SoC) development for this accelerator is required in order to provide a single-chip solution for CNN edge inference that incorporates preprocessing of the data and layerwise control over the inputs and outputs. The tight coupling between the accelerator and other components, such as memory interfaces and peripherals, is made possible by integrating our accelerator, also known as the NPU (Neural Processing Unit), into a SoC. This makes data movement more effective and lowers the latency, which enhances the overall performance of the deep learning tasks. Also the host processing system, such as a CPU or GPU, can be seamlessly integrated with the NPU as a result of a SoC.
Through this connection, it is possible to transfer computationally demanding Deep Learning (DL) activities to the accelerator, offloading the host CPU for other tasks and enhancing system efficiency. The advantage of future extension and upgrades is available with a SoC-based NPU. The SoC can be upgraded to include superior accelerator architectures or other hardware components to keep up with the developments in DL capabilities as algorithms and technology evolve. By using customized algorithms on CPU to fragment the image and a dedicated controller to keep track of the different scheduling tasks to be implemented on the NPU, this SoC will be able to handle a range of image sizes as inputs to the CNN, proving that it is an image-scalable and reconfigurable NPU-based SoC. Additionally, the SoC may increase its access to cloud resources with the aid of an Ethernet connector, allowing it to apply effective neural network preprocessing methods like network adaptation, post-training quantization and pruning for CNNs with billions of weights in a digital-twin setup.
Project Milestones
-
Design upgradtions of NPUTarget DateCompleted Date
- Designing dedicated data paths for NPU's control logic, kernels, and input activation ports with AXI4-Lite and AXI4-Stream.
- Adding DMA support to the NPU for real-time data transfers from the processor and DDR memory.
- Distributed memory banks to process a higher number of feature kernels per image tile for better throughput.
- Creation of an address map in the main memory of NPU for burst transactions to enable load and fetch from the processor.
-
Unit-level verification of NPUTarget DateCompleted Date
- Generation of the golden reference to build the verification benches for various test cases.
- Verifying NPU with Xilinx Vivado and Cadence Genus with layerwise benchmarking of CNN at the simulation level.
- Identification of delays (WNS/Hold) with worst paths with necessary updates in the RTL.
- Final verification of NPU with the test vectors after interface-level integration for power and area estimates.
-
FPGA Prototyping of NPU; Backend Runs of NPUTarget DateCompleted Date
- Benchmarking of NPU with the CNNs for throughput and latency analysis.
- Optimization of the RTL based on the benchmark results and re-run of unit-level verification.
- Gate-level simulations of NPU with Pre- and Post-Route netlists with timing analysis.
- Reoptimization of RTL and memories with verification and FPGA prototyping.
-
Configuration of Corstone and DMA350; Development of Drivers of NPUTarget DateCompleted Date
- Generation of valid configuration of corstone with pros and cons analysis.
- Out-of-Box testing of the entire SoC with Arm's Test Benches.
- Identification of integration port for NPU and DMA in the address space NIC450.
- Driver development of NPU to convert the user instructions to ISA of NPU.
- Improvement of NPU's performance with optimizations in data path from the host.
-
NPU-SoC Compiler with Neural Network OptimizationTarget DateCompleted Date
- Scaling down total MACs of CNN based on hardware resources available on NPU-SoC through novel model compression techniques.
- Latency improvement with reduction in data precision of input and intermediate activations.
- Performance and accuracy improvement by eliminating computations which are not important for current layer outputs through back-tracing of layer dependencies.
- We concluded that there is significant reduction in total MACs thereby lowering power consumption.
-
DMA350 and NPU Integration with SoC (with Verification)Target DateCompleted Date
- DMA350 integrated to the external system harness port of the Corstone-SSE thereby adding it to the address space of NIC450.
- Functionality verified of AXI4 Stream Channels of DMA350 in loopback mode.
- Integration of NPU with DMA350 through AXI4 Stream Channels, with functionality check with stimulus from host0 of Corstone.
- Testing is performed from the host-OCVM-NIC450-DMA350-NPU and then writing back the computed values to the OCVM.
- Entire CNN has been tested through the above mentioned method to get performance analysis.
-
Arm Cortex A53 Configuring and SynthesisTarget DateCompleted Date
- ASIC synthesis of standalone Arm Cortex A53 is completed.
- Area estimates has been reported as 4.9 sq. mm without using Tech Dependent cells.
-
DMC340 (DDR Controller) and PLL Integration into SoCTarget Date
- Adding APB and AXI Full interfaces in NIC450 using AMBA Designer for integrating DMC340 IP.
- Development of corresponding unit-level test cases for it with full system verification.
- Performance Analysis of the system with practical work loads.
- Integration of PLL IP to support multiple frequencies support.
-
Final RTL ReadyTarget Date
- Replacement of generic cells with Tech-dependent memories.
- External Interface based data transfers for real-time data acquisition
- Removal of unwanted subsystem blocks for and releasing Final version of RTL.
- ASIC synthesis & Preroute gatesims (including LEC, Lint etc.)
-
DRC checks, STA and Physical Design [Backend Design]Target Date
-
TapeoutTarget Date
Team
Comments
Use of CoreLink NIC-450
The CoreLink NIC-450 you have identified provides some Arm blocks such as QoS-400 and QVN-400 that support Quality of Service protocols for specific latency concerns and other capabilities to better manage different data flows.
You have identified data movement efficiency and latency as key issues to overall performance of your deep learning tasks. You also mention using the CPU to fragment the image. Do you see the Quality of Service as important to your design?
Add new comment
To post a comment on this article, please log in to your account. New users can create an account.
Welcome and an exciting project
Thanks for joining the contest and this looks like an exciting project. We look forward to seeing it develop.