Competition 2024
Competition: Hardware Implementation
Accelerated Tiny-Transformer IP

FPGA-Powered Acceleration for NLP Tasks

Project Overview:

Natural Language Processing (NLP) transforms how machines understand and interact with human language. Whether predicting the next word in a sentence, translating languages in real-time, or understanding contextual information from a body of text, NLP applications are increasingly prevalent in various fields such as virtual assistants, translation services, and automated customer support. To meet the growing demand for efficient and real-time NLP processing in embedded systems, we propose designing and implementing a Tiny Transformer Intellectual Property (IP) core. This core will be integrated with an ARM Cortex IP, leveraging the strengths of both the processor system (PS) and programmable logic (PL) parts of a System on Chip (SoC) to create a highly efficient solution for real-time NLP tasks.

Objectives:

1. Design and Implementation of Tiny Transformer IP:
  - Develop a compact and efficient transformer IP core using high-level synthesis (HLS), tailored for resource-constrained environments.
  - Include essential components such as an encoder, decoder, attention blocks, normalization layers, and feed-forward neural networks.

2. Integration with ARM Cortex IP:
  - Utilize the ARM Cortex IP as the processing system (PS) for handling high-level control and preprocessing tasks.
  - Integrate the Tiny Transformer IP as the programmable logic (PL) part to accelerate computationally intensive transformer operations.
  - Establish seamless communication between the PS and PL using the AXI interface.

3. System Architecture Development:
  - Implement a host CPU that interacts with the Tiny Transformer IP via PCIe and manages data flow.
  - Integrate BRAM for intermediate storage and a DDR controller for main memory access.
  - Optimize the data path and memory hierarchy to ensure low-latency and high-throughput processing.

4. Performance Evaluation:
  - Benchmark the integrated system against conventional CPU-only implementations to demonstrate improved performance.
  - Assess power consumption and resource utilization to validate the efficiency of the Tiny Transformer IP in embedded scenarios.

Expected Outcomes:

The successful completion of this project will result in a highly optimized Tiny Transformer IP core integrated with an ARM Cortex IP. The project will generate a complete RTL to GDSII flow, enabling the tape-out of our accelerator on a 65nm technology node. This integration will provide a robust solution for deploying transformer-based models in resource-constrained devices, enabling real-time processing of NLP tasks with significantly reduced latency and power consumption. This advancement will pave the way for sophisticated applications in IoT devices, edge computing, and mobile platforms, making advanced NLP capabilities more accessible and efficient.
 

Project Milestones

  1. Architectural Design

    Target Date
    Completed Date

    Project Kickoff:

    • Define project objectives and scope.
    • Review existing technologies and research relevant to Tiny Transformers and ARM Cortex integration.
  2. Behavioural Design

    Target Date
    Completed Date

    Design Phase:

    • Develop initial architecture for Tiny Transformer IP.
    • Begin high-level synthesis (HLS) of essential transformer components (encoder, decoder, attention blocks).

       

  3. Behavioural Design

    Target Date
    Completed Date

    Implementation Phase:

    • Complete HLS of Tiny Transformer block components of input embedding.
    • Develop communication protocols between PS and PL parts of the SoC.
  4. Behavioural Design

    Target Date
    Completed Date

    System Architecture Development:

    • Implement attention block and normalization block.
  5. Accelerator Design Flow

    Target Date
    Completed Date

    implemented encoder block with hardware utilization as 22 % LUTS and 7 % BRAM.

    ip
  6. Milestone #6

    Target Date
  7. Milestone #7

    Target Date
  8. Milestone #8

    Target Date
  9. Milestone #9

    Target Date
  10. Milestone #10

    Target Date
  11. Milestone #11

    Target Date
  12. Milestone #12

    Target Date
  13. Milestone #13

    Target Date

Team

Comments

This item on Cortex M voice solutions might be useful in determining the system requirements and some of the decisions needed in establishing the SoC Architecture.

Breaking down the processing to core parts such as inputs and data requirements of the ML processing to identify system resource requirements.

The data transfers for input and model through the system. 

There are some approximate figures,  "minimum of 300KB for model storage (assuming 
100s of domain specific utterances)". Do you have any views on model sizes and through put?

We look forward to hearing from you.

On chip memory is a critical aspect of SoC area/cost. A recent paper on Optimizing the Deployment of Tiny Transformers on Low-Power MCUs considered the issue of high memory footprint of intermediate results and frequent data marshaling. The paper discusses the issue of memory constraints and techniques for data movement such as the use of Direct Memory Access.

If you can provide some idea of data sizes and movement requirements it will help. 

 

 

Prototype with Zynq: We'll start by prototyping the IP on the Zynq MPSOC FPGA with integrating HLS generated IP with the Zynq and other IP .
PYNQ Overlay: Create a PYNQ overlay for easy design space exploration and benchmarking.
Application Testing: Run the application to evaluate performance and make necessary adjustments.
This approach will allow us to iterate quickly and gather critical insights before moving to the physical design process.
 

As of now the model size is quite large  in MB's so we are trying to reduce the model size using some techniques and want to know what should  be the ideal  model size. We should be good to go as there would be some trade offs in accuracy ?

As you develop the design options for the communication protocols between various parts of the SoC you should consider the necessary SoC infrastructure beyond the FPGA implementation when the project moves towards the stages that will 'generate a complete RTL to GDSII flow, enabling the tape-out of our accelerator on a 65nm technology node'. 

Accelerator Design Flow milestone diagram

In later stages there is no separation between PS and PL parts. The operational characteristics of BRAM are different than those of actual on chip 65nm SRAM blocks. In the diagram above the interaction between the DRAM (do we assume this memory is off chip, not on the same ASIC die as the main system?) and the memory (SRAM blocks) close to the transformer IP needs a clearer data movement strategy for the SoC. The current taped out nanoSoC designs on 65nm mode using nanoSoc reference design considers for small data volumes and small accelerators the on chip SoC architecture and has used a variety of DMA approaches. Depending on the model sizes you are planning a more capable architecture may be required. You might want to look at last year's entry from IITH and the comment I made about on chip memory requirements.

I hope this helps.

Following up the comments from John D, I suggest there are three distinct phases for this project:

  1. HLS synthesis and validation and benchmarking in (Zynq) FPGA which has ample Processor and DDR Memory resources
  2. Partitioning analysis for accelerator, bulk memory and control processing - both for bandwidth and area
  3. Feasibility study for PCIe (hardware and driver stack), DDR memory and on-chip memory/processing (which would likely require specialized PHY and advanced semiconductor technology)

The project proposal as it stands I think would be very hard to fully complete successfully within a limited time (and budget) if the assumption is simply to map the transformer and memory system architecture to a 65nm process technology(?)

Prototype with Zynq: We'll start by prototyping the IP on the Zynq MPSOC FPGA with integrating HLS generated IP with the Zynq and other IP .
PYNQ Overlay: Create a PYNQ overlay for easy design space exploration and benchmarking.
Application Testing: Run the application to evaluate performance and make necessary adjustments.
This approach will allow us to iterate quickly and gather critical insights before moving to the physical design process.
 

As of now the model size is quite large  in MB's so we are trying to reduce the model size using some techniques and want to know what should  be the ideal  model size. We should be good to go as there would be some trade offs in accuracy ?

Add new comment

To post a comment on this article, please log in to your account. New users can create an account.

Project Creator
Abhishek

at Indian Institute of Technology Jodhpur

Technology

Accelerators Accelerators

Submitted on

Actions

Log-in to Join the Team