Artificial Neural Network Implementation in FPGA

Ramya BN1 Dr.Seema Singh 2

1.Assistant Professor,EWIT,Bangalore

2.Associate Professor,BMSIT,Bangalore

Abstract–Neural Network’s capability to mimic the structures and operating principles found in the information processing systems possessed by humans ; other living creatures has made today Artificial Neural Network (ANN) a technical folk legend.The usage of the FPGA (Field Programmable Gate Array) for neural Network implementation provides flexibility in programmable systems. For the neural network based instrument prototype in real time application, conventional specific VLSI neural chip design suffers the limitation in time and cost. Convolutional neural networks (CNN) outperform older methods in accuracy, but require vast amounts of computation and memory.Here we are implementing basic ANN in FPGA. FPGA implementation utilize feature of parallelism which speeds up the processing time. Also, the hardware implementation will save more power.Research in machine learning demonstrates the potential of very low precision CNNs i.e., CNNs with binarized weights and activations. In this paper, we present the design of a BNN accelerator that is synthesized using Verilog. Finally the results are simulated in Xilinx to evaluate cost , performance and power consumption.

Introduction

Artificial neural network (ANN) is a model that inspired by the biological neural network of human brain. ANN based on a set of algorithms that attempt to model high-level abstraction in data by using multiple

processing layers, which has the ability to automatically infer rules for expected results. One of the most arguments in hardware is the exploitation of the parallelism in neural network, which can be very fast, especially for well- defined signal processing usages. The rest of the paper is organized as follows. Section II describes the artificial neural networks. The proposed design is given in section III. Section IV gives System architecture.Section V gives experimental resultsFinally ,section VI gives conclusion and future enhancement

2.Artificial Neural Network

ANN is an information-processing system where in neurons process information .The main components of an artificial neuron are illustrated in Figure 2.1.The total synaptic input, a, to the neuron is given by the inner product of the input and weight vectors:

a=j=1nwjpj

where we assume that the threshold of the activation is incorporated in the weight vector.

Fig 2.1: Artificial Neural Network

One of the important characteristics of ANN network is inducing the modification behavior with existing neuron connections. The process of modification behavior implies some values which changes in that connection. The use of weights is very important because ANN learns modifying the values of the weights of the network. Here weights are associated to the synapses and can increase or decrease the signals that arrive to a synapse. When a synapse arrives, the neuron has basically two functions, to generate the potential action according to the input and to compare if the addition of all potential actions in this instance of time is over a threshold, in that case, it will have to generate a pulse through the axon.

Proposed Method:

Proposed design is divided into four major units such as Input Unit, Firing Unit, Control Unit and Output Unit as shown in figure 3.1.

Figure3.1 Proposed design of ANN

Our proposed design consists of 5 basic inputs. They are IN1 and IN2 referred as inputs, W1 and W2 referred as dendrites from others neurons along with relation weight and finally a clock signal. All the input signals are in digital, so that it is recommended to our overall system for synchronization. One output is referred to the axon. When the output would be fired, the output will have to generate a pulse to propagate it to other neuron through the axon. The aim of the processor unit of a neuron is to generate a potential according to the arrival of spikes and evaluate these potential to generate an output potential.

Firing Unit-This unit is determinate by the expositional function as

f (x) = x × e-x, where x is known as input.

Hence the neuron is said to be fired or potentially activated.

Control Unit-Inputs IN1 and IN2 were summed together and as a result, pulse is generated. If the potential is over the threshold, then voltage returns a value and the resting potential is activated for refractory time. Input Unit-Initial unit consists of input IN1 and IN2 with control clock. Timer concept is also used in this unit. During refractory time, the neuron ignores the arrival of other spikes from other cells. To do this work, this timer wants to wait for arrival of pulses in 2 mille second. The pseudocode for ANN firing is as follows:

Pseudo Code:

Fire= 2 +? ;.

If (Fire)

Begin

{

Neuron = 1;

Else

Neuron =0;

}

End

End if

The algorithm of our design focuses on Q-learning.Q-learning is a reinforcement learning technique that looks for and selects an optimal action based on an action selection function called the Q-function. The Q function determines the utility of selecting an action, which takes into account optimal action (a) selected in a state (s) that leads to a maximum discounted future reward when optimal action is selected from that point. The Q-function can be represented as follows:

(??,??)=max??+1

The future rewards are obtained through continuous iterations selecting one of the actions based on existing Q- values, performing an action in the future and updating the Q-function in the current state. Let us assume that we pick an optimal action in the future based on the maximum Q-value. The equation demonstrates the action policy selected based on the maximum Q-value:

(?)=arg(?ax??+1)

After selecting an optimal action, we move on to a new state, obtaining rewards during this process. Let us assume that new state to be st+1, the optimal value for next state is found out by iterating through all the Q-values in next state.

3.1 Optimized BNN Model

As with the design of conventional CNN accelerators, a key optimization we made to the BNN model is parameter quantization. While the weights are already binarized, the biases and batch norm parameters are real numbers. The bias is less than 1when bias was quantized. Given that the inputs have magnitude 1, we tried setting the biases to zero and observed no effect on ac-curacy. We then retrained the network with biases removed from the model, and reached a test error of 11.32%. For the rest of the paper we use this as the baseline error rate.

Furthermore, the BNN al-ways binarizes immediately after batch norm. Thus we do not need the magnitude of y, only the sign, allowing us scale k and h by any multiplicative constant. We exploit this property during quantization by scaling each k and h to be within the representable range of our fixed-point implementation. Empirical testing showed that k and h can be quantized to 16 bits with negligible accuracy loss while being a good fit for power-of-2 word sizes.

One complication in the BinaryNet model is the interaction between binarization and edge padding. The model binarizes each activation to -1 or +1, but each input fmap is edge padded with zeros, meaning that a convolution can see up to 3 values: -1, 0, or +1. Thus the BinaryNet model actually requires some 2-bit operators (though the fmap data can still be stored in binary form This +1 padded BNN achieves a test error of 12.27% in FPGA.

For our FPGA implementation we used the 0 padded BNN as the resource savings of the +1 padded version was not particularly relevant for the target device.

Source Model Padding Test Error

FPGA no-bias, fixed-point 0 11.46% FPGA no-bias, fixed-point +1 12.27% Table 1: Accuracy of the BNN with various changes — no-bias refers to retraining after removing biases from all layers and fixed-point refers to quantization of the inputs and batch norm parameters.

4 .System Architecture

4.1 Perceptron Q Learning Accelerator

A Perceptron is as a single neuron in a multi-layer neural network. It has a multi-dimensional input and a single output. A perceptron contains weights for each of the inputs and a single bias, as shown in Figure 4.1.

Figure 4.1. Perceptron architecture

The weighted sum of all the inputs is calculated using the Equation 5, which is modeled as a combination of a multiplier and an accumulator in hardware:

? =(?i€N?? ?? )

The output of a perceptron, also called the firing rate is calculated by passing the value of ? through the activation function as follows:

?=f(?)= 1/(1+???) (6)

There are many number of hardware schematics existing for implementing the activation function The size of ROM plays a major role in the accuracy of the output value. As the sensitivity of the stored values increases, the lookup time also increases.

Figure 4.2 Feed Forward Step Hardware Schematic

The feed forwards step is run twice for updating one single Q-value, once to calculate the q values in current state, and then to calculate the q values in next state. During step 4, error is determined by buffering out all the FIFO Q-values of the current and next state in parallel by calculating the maximum value of next state Q-values, followed by applying Equation 4. The hardware schematic of our implementation is shown in Figure 4.3.

Figure 4.3 Error Capture Hardware Schematic Block

The control path and data implementation path for the module is shown in Figure 4.4. Two buffers have been implemented to store the Q-values for all actions in a state. One of them stores the Q-values for current state and the other stores Q-values for the next state. For a perceptron, a single backpropagation block is implemented to update the weights and biases of the neural network.

Figure 4.4 Control and Data path for single neuron architecture

Q-learning differs from other supervised learning from the fact that the back propagation happens not through predefined errors but estimates of the error. The error value propagated is given by the Equation 7, where f'(?) is the derivative of a sigmoid function. The derivative of the sigmoid is also implemented using a Look-up Table (ROM) with pre-calculated set of values.

?=??(?)(?error) (7)

Q-error is propagated backwards, and is used to train the neural network, the value of the Q-error is determined by Equation 8, where Q(t+1) are the Q-values present in the next state buffer .

??rror= ??+ ?.?pt ?(?+1)??(?,?) (8)

The weights are updated using the following equation, where C is the learning factor

??=(?O ?) (9)

??ew=??rew+?? (10)

The weights are updated by reading the weight values from the buffer, updating them using Equation 9 and writing them back to the FIFO buffer. A schematic of the implemented architecture for Q-learning is presented in Figure 4.5. Throughput values differ between fixed and floating point implementation of the architecture as shown in below Table 2.

Figure 4.5 Single Neuron Q-learning accelerator architecture

Architecture Throughput

Fixed Point Simple

Floating Point Simple

Fixed Point Complex

Floating Point complex 2340 kQ/second

290 kQ/second

530 kQ/second

10 kQ/second

Table 2 : Throughput Calculation

4.2 Binarization model

N integer maps

Pool

Binarize

Batch Norm

Convolution

M binary maps

Figure: 4.2.1 Binarization Model

A BNN is essentially a CNN whose weights and fmap pixels are binarized to -1 or +1; they can be seen as an extreme example of the quantized, reduced precision CNN models commonly used for hardware acceleration. In this paper weights are binirized along with fmaps. We focus on the latter version and refer to it as the BinaryNet architecture/model. This architecture achieves near state-of-the-art results on both CIFAR-10.In the BinaryNet model, the weights and outputs of both conv and FC layers are binarized using the Sign function (i.e., positive weights are set to +1 and negatives to -1).

The system architecture consists of three compute units, data and weight buffers, a direct memory access (DMA) system for off-chip memory transfer, and an FSM controller. The three compute units work on different types of layers: the FP-Conv unit for the (non-binary) first conv layer, the Bin-Conv unit for the five binary conv layers, and the Bin-FC unit for the three binary FC layers.

fin Convolversfout output streams

-240665-204470BitSel ? … Pooling, + + Bnorm, Binarize BitSel ? Integer buffer Variable-width fout Conv Line Buffer Weights Fig 4.2.2:Architecture of the Bin-Conv unit with input and output parallelization factors fin = 2 and fout = 3.

Bin-Conv :- The binary conv unit is the most critical component of the accelerator, as it will be responsible for the five binary conv layers which take up the vast majority of the runtime. The unit must maintain high throughput and resource efficiency while handling different input widths at runtime, and can support larger power-of-two widths with minor changes. To efficiently compute a convolution, multiple rows of input pixels need to be buffered for simultaneous access.

The storage of intermediate data in our accelerator differs from most existing designs. In full-precision CNN accelerators, the size of a single set of fmaps between two layers typically exceeds the size of FPGA onchip storage. This necessitates the continuous transfer of fmaps to and from off-chip RAM. Our design uses two in-out data buffers A and B of equal size. One layer reads from A and write its outputs to B; then (without any off-chip data transfers) the next layer can read from B and write to A. Thus, off-chip memory transfers are only needed for the input image, output prediction, and loading each layer’s weights.

Unlike the fmaps, there is only enough memory on-chip to store a portion of a layer’s weights. Multiple accelerator invocations may be needed for a layer; in each invocation we load in a new set of weights and produce a new set of fmaps. The next invocation produces the next set of fmaps, and etc, until all output fmaps have been generated and stored in the on-chip data buffer. Invoking the accelerator requires passing it arguments such as pointers to the weights, the layer type and size, the fmap size, and whether pooling should be applied. Inside the accelerator, the controller decodes these inputs and coordinates the other modules.

4.2.1 Compute Unit Architectures

In our accelerator, each compute unit must store binarized data to the on-chip RAMs at the end of its execution. The first operation of a conv or FC layer transforms the binary inputs to integers; and each unit will also perform the subsequent batch norm, pooling, and binarization before writing data out to the buffers. One of our design goals is to limit the amount of integer valued intermediate data buffered inside each compute unit.The below pseudocode implements a pipeline which reads and performs convolution on one input word each cycle.

1 VariableLineBuffer linebuf;

2 ConvWeights wts;

3 IntegerBuffer outbuf;

4

5 for (i = 0; i < n_input_words; i++) {

7

8// read input word, update linebuffer

WordType word = input_datai;

10BitSel(linebuf, word, input_width);

11// update the weights each time we

// begin to process a new fmap

if (i % words_per_fmap == 0)

wts = weightsi / words_per_fmap;

// perform conv across linebuffer

for (c = 0; c < LINE_BUF_COLS; c++) {

#pragma HLS unroll

outbufi % words_per_fmapc +=

conv(c, linebuf, wts);

} }

Xilinx SDSoC has been used as the primary design tool for our BNN application. SDSoC takes as input a software program with certain functions marked as “hardware”. It invokes Vi-vado HLS under the hood to synthesize the “hardware” portion into RTL. In addition, it automatically generates the data motion network and DMA necessary for memory transfer between CPU and FPGA based on the specified software-hardware partitioning.

5.Experimental Results

The presented architectures are simulated using Xilinx Tools on Vertex 7 FPGA.The following results were obtained.

Architecture Power (W) Advantage

FPGA – Virtex 7, Fixed

5.6

1.3x

FPGA – Virtex 7, Floating

7.1

1x

Figure 5.1 Power Consumption for Simple Multilayer Perceptron (MLP)

The design was on a ZedBoard, which uses a low-cost Xilinx Zynq-7000 SoC containing an XC7Z020 FPGA alongside an ARM Cortex-A9 embedded processor.

Performance comparison — Conv1 is the first FP conv layer, Conv2-5 are the binary conv layers, FC1-3 are the FC layers. A – indicates a value that is not measured. Numbers with * are sourced from datasheets.

The last row shows power efficiency in throughput per Watt.

Execution time per image (ms)

mGPU CPU GPU FPGA

Conv1 – 0.68 0.01 1.13

Conv2-5 – 13.2 0.68 2.68

FC1-3 – 0.92 0.04 2.13

Total 90 14.8 0.73 5.94

Speedup 1.0x 6.1x 123x 15.1x

Power (Watt) 3.6 95* 235* 4.7

imgs/sec/Watt 3.09 0.71 5.83 35.8

The BNN accelerator beats the best known FPGA accelerators in pure throughput, and is also much more resource and power efficient.

6.Conclusions and Future Work

BNNs feature potentially reduced storage requirements and binary arithmetic operations, making them well suited to the FPGA fabric. However, these characteristics also render CNN design constructs such as input tiles and line buffers ineffective. New design constructs were introduced such as a variable-width line buffer to address these challenges, creating an accelerator radically different

Future BNN work should focus both on algorithmic and architectural improvements. From the architectural side one action item is to implement a low-precision network for ImageNet, which would involve a much larger and more complicated accelerator design.

7. References

1 Dejan Markovic, Borivoje Nikolic, Robert W. Brodersen;”Power and Area Efficient VLSI Architectures for Communication Signal Processing”,IEEE 2006 .

2 Dhananjay Kumar, Dileep Kumar, J.R.Shinde, Amit Kumar, Vineet Kumar;” VLSI Architecture for Neural Network”.International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 International Conference on Industrial Automation and Computing ;ICIAC-12-13th April 2014.

3 Fengbin Tu, Shouyi Yin,Peng Ouyang, Shibin Tang,Leibo Liu, and Shaojun Wei;”Deep Convolutional Neural Network Architecture With Reconfigurable Computation Patterns”, IEEE VOL. 25, NO. 8, AUGUST 2015.

4 Jitesh R.Shinde and S.Salankar;”Multi-objective Optimization for VLSI Implementation of Artificial Neural Network”,International Conference on Advances in Computing, Communications and Informatics ISSN : 2279- 0535. Volume : 04, Issue : 03,page no 1694-1700, 2015 .

5 Maurizio Valle;”Analog VLSI Implementation of Artificial Neural Networks with Supervised On-Chip Learning”2002 ISSN 1004-23.

6 Prashant D.Deotale and Lalit Dole;”Design of FPGA Based General Purpose Neural Networks” ISBN No.978-1-4799-3834-6 ICICES IEEE 2014(Conference)

7 Ritchie Zhao, Weinan Song2, Wentao Zhang , Tianwei Xing , Jeng-Hau Lin , Mani Srivastava, Rajesh Gupta , Zhiru Zhang; “Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs ” ,in proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,page no 15-24,2017 .

8 Sridhar K.P, B. Vignesh, S. Saravanan,M. Lavanya and V. Vaithiyanathan;” Design and Implementation of Neural Network Based circuits for VLSI” ,Data Mining and Soft Computing Techniques World Applied Sciences Journal 29 113-117, 2014.ISSN 1818-4952.

9Pranay Reddy Gankidi,Jekan Thangaveltham;”FPGA Architecture for Deep Learning and its Application to Planetary Robotics”,Space and Terrestrial Robotic Exploration(SpaceTREx) Lab,Arizona State Universiy.

10 Y.-H. Chen, T. Krishna, J. Emer, and V. Sze. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. Int’l Symp. on Computer Architecture (ISCA), Jun 2016

11 M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi. Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks. Asia and South Pacific Design Automation Conf. (ASP-DAC), pages 575–580, Jan 2016

12 J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. Int’l Symp. on Field-Programmable Gate Arrays (FPGA), pages 26–35, Feb 2016

13 T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and Temam. Diannao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-earning. Int’l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Mar 2014.

14 C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing FPGA-based Accelerator Design for Deep Convo-lutional Neural Networks. Int’l Symp. on Field-Programmable Gate Arrays (FPGA), pages 161–170, Feb 2015.

15 A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems (NIPS), pages 1097–1105, 2012