Competition 2023
Competition: Hardware Implementation

Real-Time Edge AI SoC: High-Speed Low Complexity Reconfigurable-Scalable Architecture for Deep Neural Networks

Modern Convolutional Neural Networks (CNNs) are known to be computationally and memory costly owing to the deep structure that is constantly growing. A reconfigurable design is crucial in tackling this difficulty since neural network requirements are always evolving. The suggested architecture is adaptable to the needs of the neural network. The suggested architecture is flexible enough to compute various networks with any size input image. System-on-Chip (SoC) development for this accelerator is required in order to provide a single-chip solution for CNN edge inference that incorporates preprocessing of the data and layerwise control over the inputs and outputs.  The tight coupling between the accelerator and other components, such as memory interfaces and peripherals, is made possible by integrating our accelerator, also known as the NPU (Neural Processing Unit), into a SoC. This makes data movement more effective and lowers the latency, which enhances the overall performance of the deep learning tasks. Also the host processing system, such as a CPU or GPU, can be seamlessly integrated with the NPU as a result of a SoC.

Through this connection, it is possible to transfer computationally demanding Deep Learning (DL) activities to the accelerator, offloading the host CPU for other tasks and enhancing system efficiency. The advantage of future extension and upgrades is available with a SoC-based NPU. The SoC can be upgraded to include superior accelerator architectures or other hardware components to keep up with the developments in DL capabilities as algorithms and technology evolve. By using customized algorithms on CPU to fragment the image and a dedicated controller to keep track of the different scheduling tasks to be implemented on the NPU, this SoC will be able to handle a range of image sizes as inputs to the CNN, proving that it is an image-scalable and reconfigurable NPU-based SoC. Additionally, the SoC may increase its access to cloud resources with the aid of an Ethernet connector, allowing it to apply effective neural network preprocessing methods like network adaptation, post-training quantization and pruning for CNNs with billions of weights in a digital-twin setup.

Project Milestones

  1. Design upgradtions of NPU

    Target Date
    Completed Date
    • Designing dedicated data paths for NPU's control logic, kernels, and input activation ports with AXI4-Lite and AXI4-Stream.
    • Adding DMA support to the NPU for real-time data transfers from the processor and DDR memory.
    • Distributed memory banks to process a higher number of feature kernels per image tile for better throughput.
    • Creation of an address map in the main memory of NPU for burst transactions to enable load and fetch from the processor.
  2. Unit-level verification of NPU

    Target Date
    Completed Date
    • Generation of the golden reference to build the verification benches for various test cases.
    • Verifying NPU with Xilinx Vivado and Cadence Genus with layerwise benchmarking of CNN at the simulation level.
    • Identification of delays (WNS/Hold) with worst paths with necessary updates in the RTL.
    • Final verification of NPU with the test vectors after interface-level integration for power and area estimates.
  3. FPGA Prototyping of NPU; Backend Runs of NPU

    Target Date
    Completed Date
    • Benchmarking of NPU with the CNNs for throughput and latency analysis.
    • Optimization of the RTL based on the benchmark results and re-run of unit-level verification.
    • Gate-level simulations of NPU with Pre- and Post-Route netlists with timing analysis.
    • Reoptimization of RTL and memories with verification and FPGA prototyping.
  4. Configuration of Corstone and DMA350; Development of Drivers of NPU

    Target Date
    Completed Date
    • Generation of valid configuration of corstone with pros and cons analysis.
    • Out-of-Box testing of the entire SoC with Arm's Test Benches.
    • Identification of integration port for NPU and DMA in the address space NIC450.
    • Driver development of NPU to convert the user instructions to ISA of NPU.
    • Improvement of NPU's performance with optimizations in data path from the host.
  5. NPU-SoC Compiler with Neural Network Optimization

    Target Date
    Completed Date
    • Scaling down total MACs of CNN based on hardware resources available on NPU-SoC through novel model compression techniques.
    • Latency improvement with reduction in data precision of input and intermediate activations.
    • Performance and accuracy improvement by eliminating computations which are not important for current layer outputs through back-tracing of layer dependencies.
    • We concluded that there is significant reduction in total MACs thereby lowering power consumption.
  6. DMA350 and NPU Integration with SoC (with Verification)

    Target Date
    Completed Date
    • DMA350 integrated to the external system harness port of the Corstone-SSE thereby adding it to the address space of NIC450.
    • Functionality verified of AXI4 Stream Channels of DMA350 in loopback mode.
    • Integration of NPU with DMA350 through AXI4 Stream Channels, with functionality check with stimulus from host0 of Corstone.
    • Testing is performed from the host-OCVM-NIC450-DMA350-NPU and then writing back the computed values to the OCVM.
    • Entire CNN has been tested through the above mentioned method to get performance analysis.
  7. Arm Cortex A53 Configuring and Synthesis

    Target Date
    Completed Date
    • ASIC synthesis of standalone Arm Cortex A53 is completed.
    • Area estimates has been reported as 4.9 sq. mm without using Tech Dependent cells.
  8. DMC340 (DDR Controller) and PLL Integration into SoC

    Target Date
    • Adding APB and AXI Full interfaces in NIC450 using AMBA Designer for integrating DMC340 IP.
    • Development of corresponding unit-level test cases for it with full system verification.
    • Performance Analysis of the system with practical work loads.
    • Integration of PLL IP to support multiple frequencies support.
  9. Final RTL Ready

    Target Date
    • Replacement of generic cells with Tech-dependent memories​.
    • External Interface based data transfers​ for real-time data acquisition
    • Removal of unwanted subsystem blocks for and releasing Final version of RTL.
    • ASIC synthesis & Preroute gatesims (including LEC, Lint etc.)
  10. DRC checks, STA and Physical Design [Backend Design]​

    Target Date
  11. Tapeout

    Target Date

Team

Name
Research Area
Neuromorphic IC Design & Hardware Acceleration of Deep Learning
Role
Research Scholar
Research Area
Neuromorphic Chip Designing, VLSI architecture designing, AI/ML
Role
Research Scholar
Research Area
Neural Networks Acceleration
Role
Research Assistant

Comments

The CoreLink NIC-450 you have identified provides some Arm blocks such as QoS-400 and QVN-400 that support Quality of Service protocols for specific latency concerns and other capabilities to better manage different data flows.

You have identified data movement efficiency and latency as key issues to overall performance of your deep learning tasks. You also mention using the CPU to fragment the image. Do you see the Quality of Service as important to your design?

Add new comment

To post a comment on this article, please log in to your account. New users can create an account.

Project Creator
Amit Acharyya

Professor at Indian Institute of Technology Hyderabad (IITH)
Research area: VLSI systems resource-constrained applications, Low Power Design Techniques, Machine learning hardware design, Signal Processing Algorithm and VLSI Architectures, Digital Arithmetic, Biomedical Devices. AI/ML, Nanoscience & Technology
ORCID Profile

Related Articles

Submitted on

Actions

Log-in to Join the Team