FastkNN: A hardware implementation of a kNearestNeighbours classifier for accelerated inference
Project Motivation and Goals
The kNearestNeighbours (kNN) algorithm is a popular Machine Learning technique that can be used for a variety of supervised classification tasks. In contrast to other machine learning algorithms which "encode" the knowledge gained from training data to a set of parameters, such as weights and biases, the parameter set of a kNN classifier consists of just labelled training examples. Classification of an unlabelled example takes place by calculating its Euclidean distance (or any other type of distance metric) from all the stored training examples. The label of the unknown example is determined by the majority of the labels of the k closest examples.
In contrast to other Machine Learning methods, kNN classification doesn't require an explicit training phase, which makes it a potential candidate for learning directly from the data captured by an embedded system. However, the computation of the distance metrics between a new unclassified example and the labelled examples stored as the classifier's parameter set acts as a bottleneck in the classification process, if it is implemented in software. Through the development of dedicated hardware for calculating the required distance metrics, it is possible to reduce inference time.
The goal of this project is to build a hardware implementation of a kNN classifier that can achieve reduced inference time when compared to a kNN classifier implemented fully in software, through the parallelisation of distance metric computations. The project has an educational purpose and it expected that by its completion, the contributors will have gone through the complete design cycle of a SoC.
Design Specification
The first step in defining the design specification for the accelerator was to determine the type of data that need to be processed by it. The FashionMNIST dataset has been selected as the primary data source. Therefore, the accelerator should be able to handle grayscale image vectors with a size of 28 x 28 pixels. We assume that some of these images will be labelled (training set), while some of them won't (test set). The accelerator should help with computing the distance between all the labelled images and each unknown image in the unlabelled set.
Data to be processed: N 28 x 28 8bit (grayscale) labelled image vectors and M unlabelled image vectors, stored in SRAM.
Image vectors a^{(i)}, b^{(j)} have D elements, where D = 28 x 28 = 784. Each element is represented as an 8bit unsigned integer value. If N = 200 and M = 50, the total storage requirements for matrices A and B are:
N x 785 + M x 784 = 200 x 785 + 50 x 784 = 196,200 bytes, which is approximately equal to 192 kB. It is assumed that one byte of storage is required for storing every label. This is the reason why the number of labelled examples N is multiplied by 785 and not 784.
We would like to compute the Squared Euclidean distance between each pair of vectors a^{(i)}, b^{(j)}. This computation is given by the sum of 3 dot products, as shown below:
The selfdot products a^{(i)}⋅a^{(i)} and b^{(j)}⋅b^{(j)} can only be computed once and then reused for another computation.
Design Rationale
Because of the high dimensionality of the image vectors, it is proposed to divide the computation of the full dot products into a series of rounds. In each round, partial dot products are computed over a small number of elements. The partial dot products from each round, can be cached into an accumulator and reused in the next round to incrementally calculate the final full dot product. In the last round, the fulldot products are moved to SRAM.
In the case, where the full selfdot products are known, the modules of the accelerator that compute them should be turned off through configuration bits.
Accelerator Input Data Packet
Each round of computation can take place on 4 vector elements from each image vector. With 4 vector elements per round, the complete computation is expected to be completed after 784 / 4 = 196 rounds. Each input packet for the accelerator wrapper is built from 32bit words. Therefore, there is a need for (2 x 4 x 8) / 32 = 2 words per data packet.
Project Milestones

First abstract description of project goalsTarget Date
Rough outline of the project's goals, hardware accelerator idea and motivation.

Architectural description and Familirisation with design toolsTarget Date
 Specify the number format used for the arithmetic operations of the accelerator (i.e. 8bit unsigned integers for grayscale images).
 Select the distance metric used for kNN classification (i.e. Euclidean distances).
 Determine the type of building blocks (i.e. adders, multipliers) required to perform the arithmetic operations.
 Capture the architectural design in block diagram format, specifying the necessary inputs, outputs, registers and the overall data flow.
 Complete the lab exercises from the "Introduction to SoC design" course to gain familiarity with Xilinx Vivado.

Milestone #3Target Date

Milestone #4Target Date

Milestone #5Target Date

Milestone #6Target Date
Team
Comments
How did you get on with your May 1st milestone?
Hi,
It would be great if you could let us know how you are getting on with your May 1st Milestone and if there is any help you need for planning the next milestones and support in terms of generic reference design technology or design flow environment help?
Hi, I've been having some…
Hi, I've been having some issues while trying to familiarise myself with the tools introduced by the labs of the Intro to SoC design course offered by Arm Education Media. The presented workflow introduces Xilinx Vivado and Arm Keil, both of which require a Windows environment (I think Vivado can also run on Linux). I am a Mac user and as a consequence I had trouble using the software. I have tried using the Soclabs servers, but the interface seems to become irresponsive sometimes. I have been told that this might be caused by a screensaver setting, but I still need to check.
Hopefully the advice on the…
Hopefully the advice on the local server has got you past that problem. Hopefully you can move on to the process of design. The process of design, like the design itself, can benefit from reuse. Design is a conceptual process and we often reuse techniques such as top down decomposition or bottom up. We can iterate between them to build out understanding. We can arm ourselves with questions that help rationalise the conceptual design.
In this case we have a known approach, kNearestNeighbours (kNN) algorithm, and you have started to decompose this (top down). You also need to consider how you will calculate distances (bottom up). What is needed for each calculation and how does the SoC deliver that to allow the calculations to efficiently be undertaken. What are the limitations that are imposed by the SoC system and how best do we translate the mathematical algorithm onto it.
In the example accelerator we show some custom logic for performing calculations. We have data moving across the bus of the SoC and into the accelerator and results coming back out. In your example how symmetric is the data needed versus the result and does that have any implications? What is the pattern of data flowing across the bus? Asking such questions helps develop our understanding of the design. Design is as much about generating questions as it is about generating answers.
Perhaps you can share how you are developing your conceptual understanding of the Architectural Design space?
Design Flow help
David and Daniel have added some items in the https://soclabs.org/designflow/socdesigncontest2023exampleflow that have some helpful questions such as can we at this stage identify and 'system bottlenecks'. Do other design decisions such as say the use of the 'reference design' constrain our design space, eg. how much data can we flow through the accelerator, what memory constrains/organisation. What area are we looking to deploy? How many calculations can we undertake in parallel? We can start to lay out questions and answers as part of conceptual understanding of a design.
Architectural design phase
It is good to see you working through the Architectural design phase and building the conceptual understanding of the design and design space for potential alternative designs.
It is interesting to consider the flow of 32bit words over the bus into the accelerator and how these are structured compared to the original view for the labelled and unknown examples. How you can sequence the calculations and any intermediate results you can hold. Unlike the example SHA2 accelerator that needs data aggregated from 32bit words before it can undertake calculation in this case you may be able to undertake calculation more efficiently.
Patterns and how they help
Identifying both Data Patterns and Design Patterns can really help in the design phase. Look at how the data in structured in the calculations in relation to the labelled data set and unknown examples and how they interact and the implication on their storage in memory and flow across the SoC. That will help identify the Design Patterns for the IP blocks.
Data Patterns
Does the FashionMNIST dataset provide any natural advantages for processing. For example does the background being black mean the value 0 represents the background?
Add new comment
To post a comment on this article, please log in to your account. New users can create an account.
Milestone Planning
Hi,
It would be good to try and lay out your initial plans for Milestones. You might want to look at the steps in the generic design flows and allocate them to Milestones. You might also want to look at the SoC Design Contest 2023 example flow for some help.
John.