Member for
3 months 3 weeks
Name
Points
50
SoC Labs Roles
Registered User

Projects

Title Updated date Comment count
FPGA-Powered Acceleration for NLP Tasks 3 weeks 2 days ago 12

Articles

Interests

Design Flow

Technology

Authored Comments

Subject Comment Link to Comment
Tiny-Trans

Team Members:

Abhishek Yadav (yadav.49@iitj.ac.in)

Ayush Dixit (m23eev006@iitj.ac.in)

Binod Kumar (binod@iitj.ac.in)

view
Tiny-Trans

Team Members:

Abhishek Yadav (yadav.49@iitj.ac.in)

Ayush Dixit (m23eev006@iitj.ac.in)

Binod Kumar (binod@iitj.ac.in)

view
Related to project

Prototype with Zynq: We'll start by prototyping the IP on the Zynq MPSOC FPGA with integrating HLS generated IP with the Zynq and other IP .
PYNQ Overlay: Create a PYNQ overlay for easy design space exploration and benchmarking.
Application Testing: Run the application to evaluate performance and make necessary adjustments.
This approach will allow us to iterate quickly and gather critical insights before moving to the physical design process.
 

As of now the model size is quite large  in MB's so we are trying to reduce the model size using some techniques and want to know what should  be the ideal  model size. We should be good to go as there would be some trade offs in accuracy ?

view
FPGA Prototyping

Prototype with Zynq: We'll start by prototyping the IP on the Zynq MPSOC FPGA with integrating HLS generated IP with the Zynq and other IP .
PYNQ Overlay: Create a PYNQ overlay for easy design space exploration and benchmarking.
Application Testing: Run the application to evaluate performance and make necessary adjustments.
This approach will allow us to iterate quickly and gather critical insights before moving to the physical design process.
 

As of now the model size is quite large  in MB's so we are trying to reduce the model size using some techniques and want to know what should  be the ideal  model size. We should be good to go as there would be some trade offs in accuracy ?

view
Query regarding interfacing ARM soc with accelerator

I have been working with PYNQ Xilinx and with Zynq-7000 IP to implement an accelerator using memory-mapped AXI interfaces between the PS and PL parts of the SoC.  
I would like to understand how to interface my accelerator with the SoC architecture in this new setup. Specifically:
1) How can I replicate the memory-mapped AXI communication approach that I used on the FPGA in an ASIC flow?
2) Which Arm SoC would be apt and what will be the procedure to interface the Soc with the custom accelerator 
3) What modifications are needed to efficiently integrate the accelerator with the SoC's memory and CPU subsystems?
4) Which SoC would be most suitable for integrating a custom accelerator, ensuring good support for AXI interfaces, data transfers, and power/performance optimizations?
5) How can I replicate the memory-mapped AXI communication approach that I used on the FPGA in an ASIC flow?
6) What modifications are needed for efficient integration of the accelerator with the SoC's memory and CPU subsystems?

Additionally, I am familiar with HLS tools like HLS4ML and frameworks like TensorFlow for developing accelerators but they don't support transformers right now. 
We are okay to schedule a video call as per your convenience regarding the interfacing of ARM Soc with the accelerator

view

User statistics

My contributions
:
6
My comments
:
5
Overall contributor
:
#21

Comments

Hello,

It would be good to understand what is of interest for you in SoC Labs. We look forward to hearing from you. You can simply reply to this comment to let us know.

John.

You asked the question on the model size in MB's and the ideal size to target versus the trade off in accuracy.

On chip SRAM is a limiting factor in SoC design due to the high cost of the area for SRAM. While the hierarchical memory system for classical compute has been optimised, from the off chip DRAM all the way through the cache levels, it is has not for custom acceleration. One approach we are using to reduce fabrication costs is the use chiplet based SRAM die which can be added to a SoC from a stock of pre-fabricated die as opposed to adding to the die cost of a custom accelerator. 

Classical compute has caches in low MB.

Add new comment

To post a comment on this article, please log in to your account. New users can create an account.