# Programming FPGAs for Economics An Introduction to Electrical Engineering Economics

# TUTORIAL

[WORK IN PROGRESS]

by

# **BHAGATH CHEELA**

University of Pennsylvania, Electrical and System Engineering cheelabhagath@gmail.com

# ANDRÉ DeHON

University of Pennsylvania, Electrical and System Engineering andre@seas.upenn.edu

# JESÚS FERNÁNDEZ-VILLAVERDE

University of Pennsylvania, Economics jesusfv@econ.upenn.edu

# **ALESSANDRO PERI**

University of Colorado, Boulder, Economics alessandro.peri@colorado.edu

Last Update: Friday 6th January, 2023

# Acknowledgements

First, we wish to thank Syed Ahmed (UPenn, Electrical and System Engineering). The material in Chapters 2 and 3 is built on the teaching material created by Syed for the ESE 532 Class offered at UPenn. Chapter 3 draws on the tutorial created by Xilinix, Inc. Second, we wish to thank Lucas Ladenburger and Marina Leah Mccann (CU Boulder, Economics) for helping building this tutorial. Last but not least, we wish to thank Giuseppe Bruno and Riccardo Russo (Bank of Italy) for their help in testing a previous version of this tutorial. This project was funded by the Undergraduate Research Experiences for Diversity Grant, 2021, Institute of Behavioral Science, University of Colorado, USA. This project used the RMACC Summit supercomputer, supported by the National Science Foundation (awards ACI-1532235 and ACI-1532236), the University of Colorado Boulder, and Colorado State University. This project was also supported by the Undergraduate Research Experiences for Diversity Grant, 2021, Institute of Behavioral Science, University of Colorado, USA.

# Contents

| 1 | Setu              | ıp and Walk-through                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 1  |  |  |  |  |  |  |
|---|-------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|--|--|--|--|--|--|
|   | 1.1               | Getting Started with Vitis on Amazon F1 Instance                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | 1  |  |  |  |  |  |  |
|   | 1.2               | Step 1: Launch the build instance                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 2  |  |  |  |  |  |  |
|   | 1.3               | Step 2: Setup remote desktop                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 3  |  |  |  |  |  |  |
|   | 1.4               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |    |  |  |  |  |  |  |
|   | 1.5               | Step 4: Edit Source Files in Build Instance.         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         .         . |    |  |  |  |  |  |  |
|   | 1.6               | Step 5: Build Phase                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 5  |  |  |  |  |  |  |
|   |                   | 1.6.1 Initialize the Environment                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | 6  |  |  |  |  |  |  |
|   |                   | 1.6.2 Create a Project in Vitis HLS                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 6  |  |  |  |  |  |  |
|   |                   | 1.6.3 C Simulation and Code Debugging                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 7  |  |  |  |  |  |  |
|   |                   | 1.6.4 Synthesis in Vitis HLS                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 8  |  |  |  |  |  |  |
|   |                   | 1.6.5 HLS Kernel Optimization using the <i>Vitis HLS</i> IDE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 8  |  |  |  |  |  |  |
|   |                   | 1.6.6 Compile the Hardware Function                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 9  |  |  |  |  |  |  |
|   | 1.7               | Step 6: Runtime Phase                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 10 |  |  |  |  |  |  |
|   |                   | 1.7.1 Set up a runtime instance                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | 10 |  |  |  |  |  |  |
|   |                   | 1.7.2 Run the application on the FPGA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 10 |  |  |  |  |  |  |
| 2 | Matrix Multiplier |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |    |  |  |  |  |  |  |
|   | 2.1               | Directory Structure                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 11 |  |  |  |  |  |  |
|   | 2.2               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |    |  |  |  |  |  |  |
|   |                   | 2.2.1 Host.cpp: the main                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | 12 |  |  |  |  |  |  |
|   |                   | 2.2.2 MatrixMultiplication.cpp: the kernel                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | 13 |  |  |  |  |  |  |
|   |                   | 2.2.3 design.cfg: Compiler Flags                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | 14 |  |  |  |  |  |  |
|   |                   | 2.2.4 xrt.ini: Vitis Analyzer                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 14 |  |  |  |  |  |  |
|   | 2.3               | CPU implementation.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 15 |  |  |  |  |  |  |
|   | 2.4               | Create a Project in Vitis                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 15 |  |  |  |  |  |  |
|   | 2.5               | C Simulation and Code Debugging                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | 16 |  |  |  |  |  |  |
|   | 2.6               | Synthesis in Vitis HLS                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | 16 |  |  |  |  |  |  |
|   |                   | 2.6.1 Synthesis Report                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | 16 |  |  |  |  |  |  |
|   |                   | 2.6.2 Latency                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 17 |  |  |  |  |  |  |
|   |                   | 2.6.3 Scheduler View                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | 17 |  |  |  |  |  |  |
|   |                   | 2.6.4 Data Flow                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | 17 |  |  |  |  |  |  |
|   | 2.7               | HLS Kernel Optimization: Loop Unrolling                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | 17 |  |  |  |  |  |  |
|   |                   | 2.7.1 <b>Resource Profile</b>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 18 |  |  |  |  |  |  |
|   |                   | 2.7.2 Full Unroll                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 18 |  |  |  |  |  |  |
|   | 2.8               | HLS Kernel Optimization: Pipelining                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 19 |  |  |  |  |  |  |

| terval (II) |    |
|-------------|----|
| Pipelining  |    |
|             |    |
|             |    |
|             |    |
|             | 23 |
|             |    |
|             |    |
|             |    |
|             |    |
|             |    |
| g Open MPI  |    |
|             |    |
|             |    |
|             |    |
|             |    |
| ment        |    |
| 5           |    |
| ıffers      |    |
| e           |    |
|             |    |
|             |    |
|             |    |
|             |    |
|             |    |
|             |    |
| agmas       |    |
| -<br>       | 43 |
| fpga        |    |
| _sim_alm    | 50 |
| hw_sim_ihp  | 51 |
| _ast        | 56 |
| _alm_coeff  | 61 |
|             | 66 |
|             | 67 |
| ization     |    |
|             |    |
| cl          |    |
|             |    |
|             |    |

\_\_\_\_\_

| 3.8  | Run or                 | the FPGA                    | 72 |
|------|------------------------|-----------------------------|----|
| 3.9  | Makefi                 | le                          | 74 |
| 3.10 | .10 Command Guidelines |                             |    |
|      | 3.10.1                 | OpenCL Commands Description | 76 |
|      | 3.10.2                 | Error Management            | 78 |
|      | 3.10.3                 | Pragmas Description         | 78 |

# Setup and Walk-through

To implement a function in hardware (e.g., the Krusell and Smith (1998) algorithm), it will ultimately be necessary to perform low-level placement and routing of the hardware onto the FPGA substrate. That is, the tools must decide which particular instance of each primitive is used (placement) or which wires to use for connections (routing). These tasks take typically longer time (at least 30 minutes, sometimes hours) than the compilation time for software (a few minutes). This means you will need to plan your time carefully for these tutorials. One way to optimize our development time is to be careful about when we invoke low-level placement and routing and when we can avoid it. The content of this chapter was curated by Syed Ahmed.<sup>1</sup>

# 1.1 Getting Started with Vitis on Amazon F1 Instance

Make sure you complete the following pre-requisites before continuing with this tutorial:

1. You have an AWS account and know how to create AWS instances. Check Getting Started on Amazon EC2 for a refresher.



#### 2. Read about Vitis from here.

<sup>&</sup>lt;sup>1</sup>University of Pennsylvania, Electrical and System Engineering. *email:* stahmed@seas.upenn.edu

In this tutorial, we will use two instances:

- z1d.2xlarge referred to as the *build* instance where we will compile and build our FPGA binary. It costs 0.744 \$/hr. You can create this instance in any AWS region.
- f1.2xlarge referred to as the **runtime** instance where we will run our FPGA binary. It costs 1.65 \$/h. We can only use us-east-1 (N. Virginia) for this instance.

# **1.2** Step 1: Launch the build instance

- 1. Navigate to the AWS Marketplace
- 2. Click on Continue to Subscribe
- 3. Accept the EULA and click Continue to Configuration
- 4. Select version v1.10.0 and US East (N.Virginia)
- 5. Click on Continue to Launch
- 6. Select Launch through EC2 in the Choose Action drop-down and click Launch
- 7. Search and select FPGA Developer AMI
- 8. Select z1d.2xlarge Instance type from the Filter All instance families
- 9. At the top of the console, click on 6. Configure Security Groups
- 10. Click Add Rule. Note: Add a new rule. Do NOT modify existing rule.
  - (a) Select Custom TCP Rule from the Type pull-down menu
  - (b) Type 8443 in the **Port Range** field
  - (c) Select **Anywhere** from the Source pull-down

Note: This steps will enable us to install a NICE DCV Server on the instance.

- 11. Click **Review and Launch**. This brings up the review page.
- 12. Click Launch to launch your instance.
- 13. Select a valid key pair and **check** the acknowledge box at the bottom of the dialog
- 14. Select Launch Instances. This brings up the launch status page
- 15. When ready, select View Instances at the bottom of the page
- 16. Login to your build instance by doing:

ssh -i <AWS key pairs.pem> centos@<IPv4 Public IP of EC2 instance>

# 1.3 Step 2: Setup remote desktop

We will use **NICE DCV** as our remote desktop server on Amazon. We will use the remote desktop to work with several **Vitis GUI** utilities. For the setup we follow the Amazon GUI FPGA Development Environment with NICE DCV Tutorial.

- 1. Attach **NICE DCV** license to your z1d.2xlarge instance by doing the following:
  - (a) Sign in to the AWS Management Console and open the IAM console at link.
  - (b) In the navigation pane of the IAM console, choose **Roles**, and then choose **Create role**.
  - (c) For Select type of trusted entity, choose AWS service.
  - (d) For Choose a use case, select EC2 and then click Next: Permissions.
  - (e) Click on Next: Tags to move forward.
  - (f) Click on Next: Review to move forward.
  - (g) Enter a name, e.g. "DCVLicenseAccessRole" and click Create role.
  - (h) Click on **Policies** in the left menu.
  - (i) Click on **Create policy**.
  - (j) Click on the **JSON** tab and paste the following:

```
1 {
2 "Version": "2012-10-17",
3 "Statement": [
4 {
5 "Effect ": "Allow",
6 "Action": "s3:GetObject",
7 "Resource": "arn:aws:s3 ::: dcv-license.us-east-1/*"
8 }
9 ]
10 }
```

*Note:* The NICE DCV software needs to access the NICE DCV license, and the license is located in the s3 bucket. Change us-east-1 to the region you are using (if different). For more information, see link..

- (k) Click on Next: Tags to move forward.
- (l) Click on Next: Review to move forward.
- (m) Enter a name, e.g. "DCVLicensePolicy" and click Create policy.
- (n) Search for your new policy and click on it to open it.
- (o) Click on **Policy usage** and then on **Attach**.

- (p) Enter your DCV role name, select the role and click on Attach policy.
- (q) Go to your console home page and click on **Instances**.
- (r) Right-click on your z1d.2xlarge instance and click on *Security* and then Modify IAM role.
- (s) From the drop-down menu, select your DCV role name and click save. Your instance will now be able to use the server.
- 2. Login to your z1d.2xlarge instance and install NICE DCV pre-requisites. More info at link.
  - <sup>1</sup> sudo yum update
  - <sup>2</sup> sudo yum install kernel-devel
  - 3 sudo yum groupinstall "GNOME Desktop"
  - 4 sudo yum install glx-utils

*Note:* You may receive the message: **Failed to set locale, defaulting to C**. Locales define language and country-specific setting for your programs and shell session. If you want to fix it (not required) you can follow the instructions at this link.

- 3. Install also the crudini rpm package to modify the nice dcv server configuration preferences (see more here).
  - sudo yum install crudini
- 4. Install NICE DCV Server. More info at link.

```
sudo rpm --import https://s3-eu-west-1.amazonaws.com/nice-dcv-publish/NICE-GPG-KEY
wget https://d1uj6qtbmh3dt5.cloudfront.net/2019.0/Servers/nice-dcv-2019.0-7318-el7.tgz
tar xvf nice-dcv-2019.0-7318-el7.tgz
cd nice-dcv-2019.0-7318-el7
sudo yum install nice-dcv-server-2019.0.7318-1.el7.x86_64.rpm
sudo yum install nice-xdcv-2019.0.224-1.el7.x86_64.rpm
cd ~
```

sudo systemctl enable dcvserver
sudo systemctl start dcvserver

5. Setup a password

1 sudo passwd centos

- 6. Change firewall settings: Disable firewall to allow all connections
  - 1 sudo systemctl stop firewalld
  - <sup>2</sup> sudo systemctl disable firewalld
- 7. Create a virtual session to connect to.

*Note:* You will have to create a new session if you restart your instance. Put this in your /.bashrc so that you automatically create a session on login..

dcv create-session --type virtual --user centos centos

- 8. Connect to the DCV Remote Desktop session
  - Download and install the DCV Client in your computer<sup>2</sup>.
  - Use the Public IP address to connect
- 9. Logging in should show you your new GUI Desktop

# 1.4 Step 3: Setup AWS CLI

- 1. Go to the Amazon AWS Console and then from the top right, select your account name, and then **My Security Credentials**.
- 2. Click on Access Keys and Create New Access Key.
- 3. Note down your Access Key ID and Secret Access Key.
- 4. Login to your z1d.2xlarge instance and issue the following command:

<sup>1</sup> aws configure

5. Enter your access key, add us-east-1 as region and output to be json.

# **1.5** Step 4: Edit Source Files in Build Instance.

To edit your source files, you can use vim or emacs directly in the remote terminal. Or you can ssh from an editor in your local machine to edit files remotely. For instance: Remotely edit files using SSH from VS Code in Mac/Linux/Windows.

# 1.6 Step 5: Build Phase

The build phase is conducted entirely in the z1d.2xlarge instance. The build phase consists of

- **Profiling of the Code**, where you use the *Vitis Analyzer* to figure out bottlenecks in your application. To learn how to use *Vitis Analyzer* read here.
- **Synthesis of the Code**, which create the AFI executable which you can run on the f1 instance

<sup>&</sup>lt;sup>2</sup>**IMPORTANT:** use the 2020.2 version. The latest version is not otherwise compatible with the setup.

In order to profile and synthesize your code you need to use the *Vitis HLS* software. This section guides you on the steps on how to launch Vitis, create a *Project* in Vitis. The next chapters discuss the Code profiling and Synthesis in the context of the different applications.

## **1.6.1** Initialize the Environment

If you are just starting a new project from scratch,

1. Login to your instance and initialize your environment as follows:

```
1 tmux
```

git clone https://github.com/aws/aws-fpga.git \$AWS\_FPGA\_REPO\_DIR

```
<sup>3</sup> source $AWS_FPGA_REPO_DIR/vitis_setup.sh
```

```
4 export PLATFORM_REPO_PATHS=$(dirname $AWS_PLATFORM)
```

Note: Make sure to run under tmux! It will save you hours..

- 2. Clone your git repository using the following command:
  - <sup>1</sup> git clone GETYOURREPO

These are one-time operations which you do not need to repeat later.

## 1.6.2 Create a Project in Vitis HLS

Creating a new project in Vitis HLS is explained here. Make sure you enter the **top-level function** during the creation of the project (although you can also change it later). The top-level function is the function that will be called by the part of your application that runs in software. *Vitis HLS* needs it for synthesis. You can also indicate which files you want to create. It is wise to add a **Testbench file** too, while you are creating the project, to check that your application runs correctly.

- 1. To get started
  - (a) Launch (or restart) your z1d.2xlarge in AWS
  - (b) In a terminal, ssh into your z1d.2xlarge instance (wait for the instance to be ready!). Start the DCV server using the following:

dcv create-session --type virtual --user centos centos

*Note:* This command launches a DCV session in the building instance to which you can connect remotely from your computer.

- (c) Open the NICE DCV Viewer in your computer
  - Enter the public IP address of the z1d.2xlarge instance.

• Enter centos as user and the password you set during DCV setup.

You should now the see the desktop of your building instance!!

- 2. To launch the Vitis HLS Software
  - (a) In the desktop of your building instance, select *Applications > System Tools > Terminal*
  - (b) Launch *Vitis HLS* by typing vitis\_hls & in the terminal. You should now see the Integrated Development Environment (IDE).
- 3. To create a *New Project* 
  - In the drop-down click on *File* and select *New Project*
  - Give a name to the Project and select the location where to store the project.
  - Specify TBD as top function.
  - Add to the source files
    - all the .c files
    - all the .h files
  - Add Testbench.cpp to the TestBench files
  - Select the xcvu9p-flgb2104-2-i in the device selection.
  - Use a #CLOCK SPEED ns clock, and select Vitis Kernel Flow Target.
  - Click Finish.

We will specialize the Project creation depending on the target application in the Chapters to come.

#### 1.6.3 C Simulation and Code Debugging

We encourage you to implement a testbench file (e.g. Testbench.cpp ) to debug your code. A testbench application is not different from any other software applications written in C:

- they have a main function that is invoked
- the main function includes any functionality needed to test your function, including calling the top function that you would like to test.
- they return 0 if the function is correct, otherwise it should return another value

To run the Testbench.cpp

1. Select  $Project \rightarrow Run \ C \ Simulation$  from the menu.

- A window should pop up. The default settings of the dialog should be fine. You can dismiss the dialog by pressing *OK*.
- 2. You can see in the *Console* whether your test has passed.
- 3. If your test fails, you can run the test in debug mode.
  - This can be done by repeating the same procedure, except that you should check the box in front of *Launch Debugger* this time before you dismiss the dialog.
  - This will take you to the *Debug* perspective, where you can set breakpoints and use the step into/step over buttons to debug.
- You can go back to the original perspective by pressing the *Synthesis* button in the top, right corner. To rebuild the code, you should go back to Synthesis mode, and click *Run C Simulation* again to rebuild the code.

## 1.6.4 Synthesis in Vitis HLS

Once you have verified that the code is free of bugs, run *Solution*  $\rightarrow$  *Run C Synthesis*  $\rightarrow$  *Active Solution* from the menu to synthesize your design.

*C/RTL Cosimulation.* You can also verify the synthesized version of your accelerator in your testbench. If you choose to do so, Vitis HLS will run your accelerator in a simulator, so this method is called C/RTL Cosimulation. The employed cycle-level simulation is much slower than realtime execution, so this method may not be practical for every testbench. It avoids needing to run low level-placement and routing and will give you more visibility into the behavior of your design. Anyway, you can start it by choosing *Solution* →*Run C/RTL Cosimulation* from the menu.

### The Vitis HLS Kernel

- The RTL export will produce an .xo file (Vitis Kernel)
- Then go to the terminal and use the makefile to create the xclbin

The Synthesis will produce a **Vitis Kernel**, that is a Xilinx object file (.xo) that describes the hardware implementation of our application.

The next section discusses how to optimize it.

## 1.6.5 HLS Kernel Optimization using the Vitis HLS IDE

The optimization follows a bottom-up approach

- 1. Profile the Code using the *Vitis Analyzer*. To learn how to use the *Vitis Analyzer* read here.
- 2. Optimize your hardware function using the *Vitis HLS* IDE;
  - *Vitis HLS* controls the hardware implementation wit the **#pragma** command. Examples:
  - <sup>1</sup> #pragma HLS unroll 2
  - <sup>2</sup> #pragma HLS pipeline

The different **#pragma** that you can use are listed in the Vitis HLS User Guide.

- 3. Re-compile it;
- 4. Once you happy, you are ready to move the code to the FPGA

#### Note

We are using the GUI mode of *Vitis HLS* (using NICE DCV) so that we can see the HLS schedule. If your remote desktop connection is lagging, you can run *Vitis HLS* from the command line. You can learn more about the TCL commands from: link 1, link 2 Note that the only way to see the HLS schedule is through the GUI. If you are unable to use the GUI in AWS or try to install Vitis toolchain locally.

### **1.6.6** Compile the Hardware Function

Once you are happy with your *Vitis HLS* acceleration:

- Export Vitis Kernel: When you have obtained a satisfying hardware description in *Vitis HLS*, you will Export Vitis Kernel, i.e. a Xilinx object file (.xo). We will then use this object file/kernel and link it together in our existing Vitis application.
- 2. **Compile a hardware function.** Build the hardware function by doing make afi EMAIL=<your email>, substituting your email. Depending on the complexity of your function, this build can take hours. In the end:
  - it will wait for you to confirm a subscription from your email account.
  - Open your email and confirm the subscription and wait to receive an email that your Amazon FPGA Image (AFI) is available (takes about 30 minutes to an hour).
- 3. Copy binaries to the runtime instance
  - Create a github repository and clone it in your z1d.2xlarge instance.
  - Add the host, mmult.awsxclbin and xrt.ini files to the repository; commit and push

# 1.7 Step 6: Runtime Phase

Once you have created your executable and have your AFI it is time to run your application on the f1.2xlarge.

## 1.7.1 Set up a runtime instance

Follow the steps from Section 1.2, but instead of choosing a z1d.2xlarge instance, choose f1.2xlarge.

## 1.7.2 Run the application on the FPGA

To run your application, execute the following commands in your f1.2xlarge instance

```
<sup>1</sup> source $AWS_FPGA_REPO_DIR/vitis_runtime_setup.sh
```

```
2 # Wait till the MPD service has initialized . Check systemctl status mpd
```

3 ./ host ./ mmult.awsxclbin

You should see the following files generated when you ran:

1 profile\_summary.csv

- 2 timeline\_trace . csv
- 3 xclbin .run\_summary

Note: Make sure to shut down your F1 instance! It costs 1.65 \$/hr..

# CHAPTER 2 Matrix Multiplier

#### This chapter

- illustrates the use of Vitis HLS
- discusses the main parallelism pragmas

in the context of a matrix multiplication algorithm. The content of this chapter was curated by Syed Ahmed.<sup>1</sup>

# 2.1 Directory Structure

| 1  | code/                     |
|----|---------------------------|
| 2  | Makefile                  |
| 3  | design.cfg                |
| 4  | xrt . ini                 |
| 5  | common/                   |
| 6  | Constants.h               |
| 7  | EventTimer.h              |
| 8  | EventTimer.cpp            |
| 9  | Utilities .cpp            |
| 10 | Utilities .h              |
| 11 | hls/                      |
| 12 | export_hls_kernel .sh     |
| 13 | run_hls.tcl               |
| 14 | MatrixMultiplication .h   |
| 15 | MatrixMultiplication .cpp |
| 16 | Testbench.cpp             |
| 17 | Host.cpp                  |
|    |                           |

# 2.2 The code

- There are 5 targets in the Makefile. Use make help to learn about them
- design.cfg defines several options for the *v++ compiler*. Learn more about it here
- xrt.ini defines the options necessary for Vitis Analyzer
- The common folder has header files and helper functions.

<sup>&</sup>lt;sup>1</sup>University of Pennsylvania, Electrical and System Engineering. *email:* stahmed@seas.upenn.edu

• The hls/MatrixMultiplication.cpp file has the function that gets compiled to a hardware function (known as a kernel in Vitis). The Host.cpp file has the "driver" code that transfers the data to the fpga, runs the kernel, fetches back the result from the kernel and then verifies it for correctness.

## 2.2.1 Host.cpp: the main

The Host.cpp file has the "driver" code that transfers the data to the FPGA, runs the kernel, fetches back the result from the kernel and then verifies it for correctness.

```
1 #include " Utilities .h"
  2
  3 // --
  4 // Main program
  5 // --
  6 int main(int argc, char** argv)
  7 {
  8 // Initialize an event timer we'll use for monitoring the application
                      EventTimer timer;
  9
10 // --
11 // Step 1: Initialize the OpenCL environment
12 // -
                    timer.add("OpenCL Initialization ");
13
14
                      cl int err;
                      std :: string binaryFile = argv [1];
15
                     unsigned fileBufSize ;
16
17
                      std :: vector < cl :: Device> devices = get_xilinx_devices () ;
18
                      devices. resize (1);
                      cl :: Device device = devices [0];
19
                      cl :: Context context (device, NULL, NULL, &err);
20
                      char* fileBuf = read_binary_file ( binaryFile , fileBufSize );
21
                       cl :: Program:: Binaries bins {{ fileBuf , fileBufSize }};
22
                       cl :: Program program(context, devices, bins, NULL, &err);
23
                       cl :: CommandQueue q(context, device, CL_QUEUE_PROFILING_ENABLE, &err);
24
                       cl :: Kernel krnl_mmult(program,"mmult", &err);
25
26
27 // -
28 // Step 2: Create buffers and initialize test values
29 // -
                      timer.add(" Allocate contiguous OpenCL buffers");
30
                      // Create the buffers and allocate memory
31
                      cl::Buffer \ in1\_buf(context\,,\ CL\_MEM\_ALLOC\_HOST\_PTR\,|\,CL\_MEM\_READ\_ONLY,\ sizeof(matrix\_type)*MATRIX\_SIZE, NULL, \& CL\_MEM\_READ\_ONLY,\ sizeof(matrix\_type)*MATRIX\_SIZE, NULL, & CL\_MEM\_READ\_ONLY,\ sizeof(matrix\_type)*MATRIX\_SIZE, & CL\_MEM\_READ\_ONLY,\ sizeof(matrix\_type)
32
                         err);
                       cl::Buffer in2\_buf(context, CL\_MEM\_ALLOC\_HOST\_PTR | CL\_MEM\_READ\_ONLY, \\ sizeof(matrix\_type) * MATRIX\_SIZE, NULL, \\ \& Cl\_MEM\_READ\_ONLY, \\ sizeof(matrix\_type) * MATRIX\_SIZE, NULL, \\ with a standard sta
33
                          err);
                       cl :: Buffer out_buf_hw(context, CL_MEM_ALLOC_HOST_PTR | CL_MEM_WRITE_ONLY, sizeof(matrix_type) * MATRIX_SIZE, NULL,
34
                          &err);
35
                      timer.add("Set kernel arguments");
 36
                       // Map buffers to kernel arguments, thereby assigning them to specific device memory banks
37
                      krnl_mmult.setArg(0, in1_buf);
38
                      krnl_mmult.setArg(1, in2_buf);
39
                      krnl_mmult.setArg(2, out_buf_hw);
40
41
                      timer.add("Map buffers to userspace pointers");
42
43
                      // Map host-side buffer memory to user-space pointers
44
                      matrix_type *in1 = (matrix_type *) q.enqueueMapBuffer(in1_buf, CL_TRUE, CL_MAP_WRITE, 0, sizeof(matrix_type) * MATRIX_SIZE);
```

```
matrix_type *in2 = (matrix_type *) q.enqueueMapBuffer(in2_buf, CL_TRUE, CL_MAP_WRITE, 0, sizeof(matrix_type) * MATRIX_SIZE);
45
       matrix_type *out_sw = Create_matrix();
46
47
       timer.add("Populating buffer inputs");
48
49
       // Initialize the vectors used in the test
50
      Randomize_matrix(in1);
51
      Randomize_matrix(in2);
52
  11 -
53
54 // Step 3: Run the kernel
55 //
      timer.add("Set kernel arguments");
56
       // Set kernel arguments
57
58
      krnl_mmult.setArg(0, in1_buf);
59
      krnl_mmult.setArg(1, in2_buf);
       krnl_mmult.setArg(2, out_buf_hw);
60
61
       // Schedule transfer of inputs to device memory, execution of kernel, and transfer of outputs back to host memory
62
       timer.add("Memory object migration enqueue host->device");
63
       cl :: Event event_sp;
64
       q.enqueueMigrateMemObjects({in1_buf, in2_buf}, 0 /* 0 means from host*/, NULL, &event_sp);
65
       clWaitForEvents(1, (const cl_event *)&event_sp);
66
67
       timer.add("Launch mmult kernel");
68
       q.enqueueTask(krnl_mmult, NULL, &event_sp);
69
       timer.add("Wait for mmult kernel to finish running");
70
71
       clWaitForEvents(1, (const cl_event *)&event_sp);
       timer.add("Read back computation results (implicit device->host migration)");
73
       matrix_type *out_hw = (matrix_type *)q.enqueueMapBuffer(out_buf_hw, CL_TRUE, CL_MAP_READ, 0, sizeof(matrix_type) *
74
        MATRIX_SIZE);
       timer. finish ();
75
76
77 //
     Step 4: Check Results and Release Allocated Resources
78
79
       multiply_gold(in1, in2, out_sw);
80
81
       bool match = Compare_matrices(out_sw, out_hw);
82
      Destroy_matrix(out_sw);
       delete [] fileBuf ;
83
       q.enqueueUnmapMemObject(in1_buf, in1);
84
       q.enqueueUnmapMemObject(in2_buf, in2);
85
       q.enqueueUnmapMemObject(out_buf_hw, out_hw);
86
       q. finish ();
87
88
       std :: cout << "-----" << std::endl;
89
       timer.print ();
90
91
       std :: cout << "TEST " << (match ? "PASSED" : "FAILED") << std :: endl;</pre>
92
       return (match ? EXIT_SUCCESS : EXIT_FAILURE);
93
94
  ł
```

Listing 2.1: Host.cpp

## 2.2.2 MatrixMultiplication.cpp: the kernel

The MatrixMultiplication.cpp file has the function that gets compiled to a hardware function (known as a kernel in Vitis).

```
1 #include " MatrixMultiplication .h"
2
<sup>3</sup> void mmult(const matrix_type Input_1[MATRIX_WIDTH * MATRIX_WIDTH],
    const matrix_type Input_2[MATRIX_WIDTH * MATRIX_WIDTH],
4
    matrix_type Output[MATRIX_WIDTH * MATRIX_WIDTH]) {
5
6 #pragma HLS INTERFACE m_axi port=Input_1 bundle=aximm1
7 #pragma HLS INTERFACE m_axi port=Input_2 bundle=aximm2
8 #pragma HLS INTERFACE m_axi port=Output bundle=aximm1
   matrix_type Buffer_1[MATRIX_WIDTH][MATRIX_WIDTH];
9
   matrix_type Buffer_2[MATRIX_WIDTH][MATRIX_WIDTH];
10
    Init_loop_i : for (int i = 0; i < MATRIX_WIDTH; i++)
12
     Init_loop_j : for (int j = 0; j < MATRIX_WIDTH; j++) {
13
     Buffer_1[i][j] = Input_1[i * MATRIX_WIDTH + j];
14
     Buffer_2[i][j] = Input_2[i * MATRIX_WIDTH + j];
    }
16
17
   Main_loop_i: for (int i = 0; i < MATRIX_WIDTH; i++)
18
    Main_loop_j: for (int j = 0; j < MATRIX_WIDTH; j++) {
19
     matrix_type Result = 0;
20
     Main\_loop\_k: for (int k = 0; k < MATRIX_WIDTH; k++) \{
      Result += Buffer_1[i][k] * Buffer_2[k][j];
22
23
     }
     Output[i * MATRIX_WIDTH + j] = Result;
24
25
    }
26
  }
```

Listing 2.2: MatrixMultiplication.cpp

## 2.2.3 design.cfg: Compiler Flags

Defines several options for the *v++ compiler*. Learn more about it here

```
1 platform=xilinx_aws-vu9p-f1_shell-v04261818_201920_2
2 debug=1
3 profile_kernel =data: all : all : all
4 save-temps=1
5
6 [connectivity]
7 nk=mmult:1:mmult_1
8 sp=mmult_1.Input_1:DDR[1]
9 sp=mmult_1.Input_2:DDR[2]
10 sp=mmult_1.Output:DDR[1]
```



## 2.2.4 xrt.ini: Vitis Analyzer

xrt.ini defines the options necessary for Vitis Analyzer

```
[Debug]
profile=true
timeline_trace=true
data_transfer_trace=fine
stall_trace=all
```

| Key execution times                                             |        |    |
|-----------------------------------------------------------------|--------|----|
| OpenCL Initialization                                           | 83.500 | ms |
| Allocate contiguous OpenCL buffers                              | 0.043  | ms |
| Set kernel arguments                                            | 0.164  | ms |
| Map buffers to userspace pointers                               | 1.058  | ms |
| Populating buffer inputs                                        | 0.119  |    |
| Set kernel arguments                                            | 0.020  | ms |
| Memory object migration enqueue host->device                    | 0.255  |    |
| Launch mmult kernel                                             | 0.130  |    |
| Wait for mmult kernel to finish running                         | 1.385  |    |
| Read back computation results (implicit device->host migration) | 0.169  | ms |
| TEST PASSED                                                     |        |    |
| [centos@ip-172-31-4-76 hw5]\$                                   |        |    |

Figure 2.1: CPU Implementation

# 2.3 CPU implementation.

To set a benchmark for our HLS acceleration, let us first run our application on the CPU. Connect to your z1d.2xlarge and execute the following commands from the terminal to run your application on the CPU.

```
# compile
source $AWS_FPGA_REPO_DIR/vitis_setup.sh
export PLATFORM_REPO_PATHS=$(dirname $AWS_PLATFORM)
make all TARGET=sw_emu

# run
source $AWS_FPGA_REPO_DIR/vitis_runtime_setup.sh
export XCL_EMULATION_MODE=sw_emu
```

```
./ host mmult.xclbin
```

The latency is 86.93ms and will provide our benchmark.

*Note:* The .xclbin is a binary format optimized for FPGA. Yet, you can run as a normal app on your CPU (although you would not run it usually as it is not optimized for it).

# 2.4 Create a Project in Vitis

- 1. Launch the build instance z1d.2xlarge and Vitis HLS following the steps in Section 1.6.2
- 2. Create a *Project* in *Vitis HLS* as follows
  - In the drop-down click on *File* and select *New Project*
  - Give a name to the Project and select the location where to store the project.
  - Specify mmult as top function.
  - Add to the source files
    - hw5/fpga/hls/MatrixMultiplication.cpp
    - hw5/fpga/hls/MatrixMultiplication.h
  - Add Testbench.cpp to the TestBench files
  - Select the xcvu9p-flgb2104-2-i in the device selection.
  - Use a 8 ns clock, and select Vitis Kernel Flow Target.

• Click Finish.

*Vitis HLS* automatically does loop pipelining. For the purpose of this project, we will turn it off, since we are going to do it ourselves. To do so,

- Right-click on solution 1 and select Solution Settings.
- In the *General* tab, click on *Add*.
- Select config\_compile command and set pipeline\_loops to 0.

# 2.5 C Simulation and Code Debugging

We will now follow the steps in 1.6.3 to debug the code using **Testbench.cpp** in **Vitis HLS**. *Note:* The test bench generates random matrices and attempts matrix multiplication using both our mmult function (from HW) and the standard software matrix multiply function. The testbench then compares both of the outputs and makes sure they are exactly the same..

- Run C simulation by right-clicking on the project on the Explorer view
- Figure 2.2 verifies that the test passes

Figure 2.2: Testbench Console

# 2.6 Synthesis in Vitis HLS

Let us now synthesize our code using *Vitis HLS*. To do so, run *Solution*  $\rightarrow$  *Run C Synthesis*  $\rightarrow$  *Active Solution* from the menu to synthesize your design.

#### 2.6.1 Synthesis Report

To open the Synthesis Report

- Expand the *solution 1* tab in the *Explorer View*
- Browse to *syn/report* and open the .rpt file.

| Property    | Value                            |
|-------------|----------------------------------|
| Line Number | 22                               |
| Name        | mul                              |
| Opcode      | fmul                             |
| Op Latency  | 1                                |
| RTL Name    | fmul_32ns_32ns_32_2_max_dsp_1_U2 |
| Source File | hls/MatrixMultiplication.cpp     |
| Topo Index  | 74                               |
|             |                                  |

Figure 2.3: Scheduler View

## 2.6.2 Latency

The total latency of the hardware accelerator is **16.976**, so slower than the CPU (**86.93**). The reason is that the current implementation does not use any kind of parallelism of the computation which the software baseline may have. Table 2.1 reports the resource utilization.

| Resources | BRAM | DSP Units | Flip-Flops | LUTs |
|-----------|------|-----------|------------|------|
| Usage     | 20   | 5         | 1793       | 1933 |

Table 2.1: Resource Utilization

## 2.6.3 Scheduler View

Use the *Scheduler View* under the *Analysis Perspective* to analyze how the computations are scheduled in time. From the *Scheduler View* it appears that the multiplication takes 1 cycle (Figure 2.4)

## 2.6.4 Data Flow

Dataflow and FSM diagram for main loop of MatrixMultiplication.cpp

# 2.7 HLS Kernel Optimization: Loop Unrolling

- Go back to the *Synthesis perspective*
- Unroll the loop with label Main\_loop\_k 2 times using **#pragma HLS UNROLL** .

```
Main_loop_k: for (int k = 0; k < MATRIX_WIDTH; k++) {
```

```
2 #pragma HLS unroll factor =2
```

Result += Buffer\_1[i][k] \* Buffer\_2[k][j];

Listing 2.4: MatrixMultiplication.cpp with **#pragma HLS UNROLL** 

For other examples see here.



Figure 2.4: Scheduler View

- Synthesize the code
- Look at the Scheduler View

The unroll is able to save cycles by performing the multiplies in parallel. (The original loop had to wait for next read to perform another multiply). To understand how the unrolling work, notice that we could have performed the unrolling manually as shown here

```
Main_loop_k: for (int k = 0; k < MATRIX_WIDTH; k=k+2) {
    Result += Buffer_1[i][k] * Buffer_2[k][j] + Buffer_1[i][k+1] * Buffer_2[k+1][j];
}</pre>
```

Listing 2.5: MatrixMultiplication.cpp

### 2.7.1 Resource Profile

Now use the *Resource Profile* view of the *Analysis Perspective* to inspect the resource usage. As we unroll more and more, the number of:

- fadd's increases but
- the number of fmul's does not.

This implies that the fmul s are shared by multiple operations!

### 2.7.2 Full Unroll

- Unroll the loop with label Main\_loop\_k completely.
- Synthesize the design again.

You may notice that the estimated clock period in the *Synthesis Report* is shown in red. Due to variation among *Vitis HLS* versions, sometimes it works and nothing is flagged.

#### Change the clock

Change the clock period to 20ns, and synthesize it again. The new latency is 4.062ms.

#### Resources

| Resources | BRAM | DSP Units | Flip-Flops | LUTs |
|-----------|------|-----------|------------|------|
| Usage     | 20   | 14        | 5586       | 5174 |

Table 2.2: Resource Utilization

*Note:* You may have noticed that all floating-point additions are scheduled in series. This suggests that they cannot be parallelized. Floating-Point addition is non-associative; this forces us to perform them in the original serial order in order to guarantee we achieve the same result as the original, serial C code. In contrast, Integer and Fixed-Point additions are associative, giving the compiler more freedom to re-order operations and exploit parallelism.

# 2.8 HLS Kernel Optimization: Pipelining

Pipeline using **#pragma HLS PIPELINE** 

- Remove the unroll pragma, and pipeline the Main\_loop\_j loop with the minimal initiation interval (II) of 1 using the **#pragma HLS PIPELINE**. (Xilinx link)
- Restore the clock period to 8ns.
- Synthesize the design again.

### 2.8.1 Understanding the Initiation Interval (II)

Note the initiation interval is **32** for the pipelined loop *j*. To understand this result, Figure 2.5 draws a schematic for the data path of Main\_loop\_j and shows how it is connected to the memories. You can find the variables that are mapped onto memories in the *Resource Profile* view of the *Analysis Perspective*. The memory for each of the Buffers is stored in one bank, in 8 BRAMS. There are only two port to read from, despite needing 64 values. Assuming a continuous flow of input data, we need to read a full row of Buffer1, meaning 64 values. The BRAM only lets us read at most 2 words per cycle, but we need 64 for loop iteration, which results in a delay (II) of 32.



Figure 2.5: Scheduler View

#### 2.8.2 Partitioning Arrays to Improve Pipelining

To improve the II of the pipelining, we can partition Buffer\_1 and Buffer\_2 to achieve a better performance. To do so, we partition the input buffer into 32 pairs of columns for Buffer 1. This way, the two ports can read both the values in each BRAM at once and get all 64 values in 1 cycle. For buffer 2, we need to read all the rows of one column at once so we partition it into 32 pairs of rows. To partition the buffers we use the **#pragma HLS ARRAY\_PARTITION**. For examples on how to use the pragma see here.

### 2.8.3 Export the Vitis Kernel

To conclude pipeline the Init\_loop\_j loop also with an II of 1.

```
#include " MatrixMultiplication .h"
  void mmult(const matrix_type Input_1[MATRIX_WIDTH * MATRIX_WIDTH],
3
    const matrix_type Input_2[MATRIX_WIDTH * MATRIX_WIDTH],
4
    matrix_type Output[MATRIX_WIDTH * MATRIX_WIDTH]) {
6 #pragma HLS INTERFACE m_axi port=Input_1 bundle=aximm1
7 #pragma HLS INTERFACE m_axi port=Input_2 bundle=aximm2
8 #pragma HLS INTERFACE m_axi port=Output bundle=aximm1
   matrix_type Buffer_1[MATRIX_WIDTH][MATRIX_WIDTH];
9
   matrix_type Buffer_2[MATRIX_WIDTH][MATRIX_WIDTH];
10
   #pragma HLS ARRAY_PARTITION variable=Buffer_1 complete dim=2
12
   #pragma HLS ARRAY_PARTITION variable=Buffer_2 complete dim=1
13
14
    Init_loop_i : for (int i = 0; i < MATRIX_WIDTH; i++)</pre>
     Init_loop_j : for (int j = 0; j < MATRIX_WIDTH; j++) {
16
     Buffer_1[i][j] = Input_1[i * MATRIX_WIDTH + j];
     Buffer_2[i][j] = Input_2[i * MATRIX_WIDTH + j];
18
19
    }
20
   Main_loop_i: for (int i = 0; i < MATRIX_WIDTH; i++)
21
    Main_loop_j: for (int j = 0; j < MATRIX_WIDTH; j++) {
     #pragma HLS PIPELINE II=1
23
     matrix_type Result = 0;
24
     Main_loop_k: for (int k = 0; k < MATRIX_WIDTH; k++) {
      Result += Buffer_1[i][k] * Buffer_2[k][j];
26
27
     }
     Output[i * MATRIX WIDTH + j] = Result;
28
29
    }
```

30 }

#### Listing 2.6: MatrixMultiplication.cpp

- Synthesize your design.
- Export. Export your synthesized design:
  - right-click on *solution 1* and then select *Export RTL*.
  - Choose Vitis Kernel (.xo) as the Format.
  - Select output location to be your directory
  - Select OK.
- Save your design and quit Vitis HLS .
- Open a terminal and go to your directory. Make sure your terminal environment is initialized as follows.
- <sup>1</sup> source \$AWS\_FPGA\_REPO\_DIR/vitis\_setup.sh
- export PLATFORM\_REPO\_PATHS=\$(dirname \$AWS\_PLATFORM)

# 2.9 Run on the FPGA

Connect to your f1.2xlarge and execute the following commands from the terminal to run your application on the FPGA.

```
    source $AWS_FPGA_REPO_DIR/vitis_runtime_setup.sh
    # Wait till the MPD service has initialized . Check systemctl status mpd
    ./ host ./ mmult.awsxclbin
```

You should see the following files generated when you ran:

```
profile_summary.csv
```

```
2 timeline_trace . csv
```

```
3 xclbin .run_summary
```

```
Listing 2.7: FPGA Run Output
```

Add, commit and push these files in the repository you created and then shutdown your F1 instance.

Note: Make sure to shut down your F1 instance! It costs 1.65 \$/hr.

# 2.10 Additional Documentation

- Read this to learn about the syntax of the code in hls/MatrixMultiplication.cpp.
- Read this to learn about how the hardware function is utilized in Host.cpp.

- Read this to learn about simple memory allocation and OpenCL execution.
- Read this to learn about aligned memory allocation with OpenCL.

# CHAPTER 3 Krusell Smith (1998)

This section describes the FPGA acceleration of the Krusell and Smith (1998) algorithm in Cheela et al. (2022).

# 3.1 Directory Structure

The directory is structured in four folders. The folder common contains code shared by FPGA and CPU acceleration. The folders cpu and fpga contain code which is specific to the two acceleration platforms. Results are stored in the folder results.

|    |                     | 31 | definitions .h     |
|----|---------------------|----|--------------------|
| 1  | code/               | 32 | dev_options.h      |
| 2  | /common             | 33 | init .cpp          |
| 3  | / libs              | 34 | init .h            |
| 4  | ap_common.h         | 35 | stopwatch.h        |
| 5  | ap_decl.h           | 36 | /cpu               |
| 6  | ap_fixed_base.h     | 37 | sw.cpp             |
| 7  | ap_fixed_ref .h     | 38 | sw.h               |
| 8  | ap_fixed_special .h | 39 | / executables      |
| 9  | ap_fixed.h          | 40 | / fpga_afi         |
| 10 | ap_int_base .h      | 41 | / host executables |
| 11 | ap_int_ref .h       | 42 | /fpga              |
| 12 | ap_int_special .h   | 43 | design.cfg         |
| 13 | ap_int .h           | 44 | hls_config . tcl   |
| 14 | xcl2.cpp            | 45 | hw.cpp             |
| 15 | xcl2.hpp            | 46 | hw.h               |
| 16 | xcl2.mk             | 47 | / results          |
| 17 | /shocks             | 48 | /fpga              |
| 18 | agshock.txt         | 49 | /double            |
| 19 | idshock . txt       | 50 | / fixed            |
| 20 | / util              | 51 | /matlab            |
| 21 | 2run_me.sh          | 52 | /openmpi           |
| 22 | compare_results.py  | 53 | /double            |
| 23 | input_pack.py       | 54 | /power usage       |
| 24 | matlab_compare.m    | 55 | /double            |
| 25 | OpenMPI_install.sh  | 56 | /seq_cpu           |
| 26 | power.sh            | 57 | /double            |
| 27 | save_results .sh    | 58 |                    |
| 28 | app.cpp             | 59 | Makefile           |
| 29 | app.h               |    | README.md          |
| 30 | cons.h              |    | xrt . ini          |
|    |                     |    |                    |

# 3.2 The Code

- **Makefile.** Run the Makefile to execute the application. The Makefile has 3 main targets that allow you to choose the execution mode:
  - Serial execution on CPU: make cpu,
  - Parallel execution on CPU using Open MPI: make openmpi,
  - Specified FPGA Target and Device: make fpga.

There are other auxiliary targets. Execute make help to learn more about them. See section 3.3 for a complete guide on how to setup and launch the application.

- Main. The /common/app.cpp is the main file that initializes the variables, transfers the data to the fpga, launches the hardware execution (cpu serial, cpu parallel, fpga), fetches back the result from the kernel.
- **Kernel.** The /fpga/hw.cpp contains the Vitis kernel for FPGA execution. The /cpu/sw.cpp contains the kernel executed on the CPU.
- **Results.** Results are stored in /results.
- Header Files. Header files and helper functions are contained in the following directory
  - /common: files shared by FPGA and CPU codes
  - /common/libs: libraries for FPGA software emulation
  - /cpu: files unique to CPU execution
  - /fpga: files unique to FPGA execution
- Hardware Design.
  - design.cfg, hls\_config.tcl defines several options for the v++ compiler. Learn more about it here.
  - xrt.ini defines the options necessary for Vitis Analyzer.

# 3.3 Setup and Launch

This section summarizes the steps required to compile and run the application under the different acceleration modes provided in the Makefile.

#### 3.3.1 Shared Instructions

1. Open /code/common/app.cpp and set the number of models N\_MODEL you want to compute (1200 in our benchmark specification):

```
10 #define N_MODEL 6 // total number of models
```

2. Open /code/common/definitions.h and set the grid size:

```
76 #define NKGRID 100 ///< number of grid points
```

```
77 #define NKM_GRID 4 ///< number of grid points for the mean of capital distribution grid
```

In the FPGA execution the user can only choose NKGRID  $\in$  {100, 200, 300} and NKM\_GRID  $\in$  {4, 8}.

3. Open /code/common/dev\_options.h and set:

```
      9
      // Set only one of the below 4 to 1. For best performance, set _ACROSS_ECONOMY to 1 and rest 0

      10
      # define _BASELINE 0

      11
      # define _PIPELINE 0

      12
      # define _WITHIN_ECONOMY 0

      13
      # define _ACROSS_ECONOMY 1
```

## 3.3.2 Serial execution on CPU

- Setup. Complete the steps 1-3 in section 3.3.1
- **Setup.** To use the Jump search algorithm similar to that implemented in the FPGA, select the \_CUSTOM\_BINARY\_SEARCH in /code/common/dev\_options.h

```
16 // set only one for the below 3 to 1. For best CPU performance, set _CUSTOM_BINARY_SEARCH to 1 and rest 0
```

```
17 #define _LINEAR_SEARCH 0
```

```
18 #define _BINARY_SEARCH 0
```

- 19 #define \_CUSTOM\_BINARY\_SEARCH 1
- **Compile and run.** Go to the directory /code. From there, you can use the following terminal instructions to compile and run two alternative versions of the application:
- 1 make cpu

```
2 ./ app
```

# 3.3.3 Parallel execution on CPU using Open MPI

```
1 sh OpenMPI_install.sh
```

- Set the environment by executing the following commands in the terminal from the parent directory

1 export PATH=\$PATH:\$HOME/openmpi/bin

```
2 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/openmpi/lib
```

- Setup. Complete the steps 1-3 in section 3.3.1
- **Compile and run.** Go to the directory /code. From there, you can use the following terminal instructions to compile and run two alternative versions of the application:

```
1 make openmpi
```

<sup>2</sup> mpirun -n N ./ openmpi\_app // replace N with the number of CPU cores

#### 3.3.4 FPGA execution

- Go to the directory /code.
- Setup. Complete the steps 1-3 in section 3.3.1
- The FPGA execution has two running modalities: the software emulation and the hardware image generation

#### 1. Software emulation

- Description. The main goal of software emulation (sw\_emu) is to ensure functional correctness of the host program and kernels. Software emulation provides a purely functional execution, without any modeling of timing delays, or latency; it does not give any indication of the accelerator performance. Hence, the sw\_emu target can be built and executed on the build instance which may not have an FPGA connected to it. Click here to know more about this.
- **Compile and Run.** From the folder /code, execute the following instruction in the terminal to compile and run the application:

```
1 // setup environment
```

```
<sup>2</sup> source $AWS_FPGA_REPO_DIR/vitis_setup.sh
```

```
a export PLATFORM_REPO_PATHS=$(dirname $AWS_PLATFORM)
```

- 4 // build the target
- 5 make fpga TARGET=sw\_emu
- 6 // run
- 7 source \$AWS\_FPGA\_REPO\_DIR/vitis\_runtime\_setup.sh
- 8 export XCL\_EMULATION\_MODE=sw\_emu
- 9 ./ host ./ fpga/ build/runOnfpga.xclbin

#### 2. System Hardware Target

- **Description.** . When the build target is the hardware, v++ builds the FPGA binary for the Xilinx device by running Vivado synthesis and implementation

on the design. It is normal for this build target to take a longer period of time than generating either the software or hardware emulation targets in the Vitis IDE. Therefore, we recommend using a lower cost build instance (z1d.2xlarge) to generate the fpga target. Click here to know more about this.

- **Compile.** From the folder /code, execute the following instruction in the terminal to generate the host and the fpga target files on the build instance:

```
    make clean
    // setup environment
    source $AWS_FPGA_REPO_DIR/vitis_setup.sh
    export PLATFORM_REPO_PATHS=$(dirname $AWS_PLATFORM)
    export XCL_EMULATION_MODE=hw
    // build the target
    make afi EMAIL=<email address>
```

- **Run**. Launch a new runtime instance (f1.2xlarge) and copy the host and the fpga targets (host, runOnfpga.awsxclbin) files from build instance to the runtime instance. Then execute the following commands to set up the vitis environment and run on the fpga device. If you like to recreate the results from the paper using the fpga binaries from the git repository, refer to section 3.8 that makes use of a bash script file.

ı git clone https:// github.com/aws/aws-fpga.git \$AWS\_FPGA\_REPO\_DIR //AWS repo

- <sup>2</sup> git clone https://github.com/AleP83/KS-FPGA.git -b "dev\_accel" //KS-FPGA Project
- <sup>3</sup> source \$AWS\_FPGA\_REPO\_DIR/vitis\_setup.sh
- <sup>4</sup> source \$AWS\_FPGA\_REPO\_DIR/vitis\_runtime\_setup.sh
- 5 export PLATFORM\_REPO\_PATHS=\$(dirname \$AWS\_PLATFORM)
- / host ./ runOnfpga.awsxclbin

## 3.4 Header Files

#### File: /code/common/definitions.h

**Description:** This is the main header files. It defines all variables and structures, it defines and initializes the model parameters, the simulation parameters, the number of states, the tolerance for convergence or the number of iterations, the file paths, among others.

**Note.** The file describes the main structures:

- env\_t: stores model parameters, stochastic transition matrix, grids, wealth function, tax rate, wage, interest rate, and auxiliary variables for the agents optimization problem;
- input\_t: stores aggregate and idiosyncratic shocks;
- var\_t: stores equilibrium individual capital holdings, cross-sectional distribution, coefficients of aggregate law of motion of capital and time series of aggregate capital holdings;

- out\_t: stores the computed results of cross-sectional distribution, individual capital policy functions, coefficients for good and bad states, r2 values;
- preinit\_t: stores the initial values of the aggregate capital and wealth.

### File: /code/common/dev\_options.h

**Description:** This header file defines the macros used for the hardware acceleration, including: unrolling factors, finite precision of operations, and associated debugging macros.

### File: /code/common/app.h

**Description:** This header file contains auxiliary C libraries in support of I/O operations, math operations, timing etc.

### File: /code/common/cons.h

**Description:** This header file stores as constant the encoded aggregate and idiosyncratic shocks used in the Krusell and Smith simulation.

#### Files: /code/common/libs/\*.h

**Description:** This folder contains a collection of header files which provides both integer and fixed-point arbitrary precision data types for OpenCL C++ API. The advantage of arbitrary precision data types is that they allow the C code to be updated to use variables with smaller bit-widths and then for the C simulation to be re-executed to validate that the functionality remains identical or acceptable.

### Files: /code/fpga/hw.h

**Description:** This header file declares variables and functions in support of the FPGA acceleration kernel. In particular it declares:

- the kernel function runOnfpga;
- the structure hw\_env\_t which is a stripped down version excluding the of the env\_t with only necessary structure members. This can be removed in the future by utilizing the definition from definitions.h;
- the regression functions;
- the linear interpolation function hw\_findrange and its variations;
- auxiliary math functions.

#### Files: /code/cpu/sw.h

**Description:** This header file declares variables and functions in support of the CPU acceleration kernel, and it is comparable to hw.h for the FPGA.

# Files: /code/cpu/init.h

**Description:** This header file declares the functions used in init.cpp

# Files: /code/cpu/stopwatch.h

**Description:** This header file contains the class definition for the stopwatch timer which is used for measuring all latencies.

# 3.5 Main: app.cpp

The file /common/app.cpp is the main. The application uses the following macros to activate the alternative acceleration options: serial CPU (\_SERIAL\_CPU\_MODE), Open MPI parallel CPU cores (\_OPENMPI\_MODE), FPGA acceleration (\_FPGA\_MODE)

```
1 #ifdef _OPENMPI_MODE
2 #define OMPI_MODE 1 // 1 ON, 0 OFF
3 #elif _FPGA_MODE
4 #define FPGA_MODE 1 // 1 ON, 0 OFF
5 #elif _SERIAL_CPU_MODE
6 #define SERIAL_CPU_MODE 1 // 1 ON, 0 OFF
7 #endif
```

When we issue the make commands make cpu, make openmpi, make fpga, the appropriate flag gets defined using -D flag which would set only one of the above modes.

# 3.5.1 Overview

The rest of the section describes the FPGA acceleration associated with \_FPGA\_MODE.

- 1. Setting up the OpenCL environment
- 2. Allocating the buffers
- 3. Set up the kernels and Initialize Buffers
- 4. Buffer transfer to the FPGA
- 5. Kernel execution on FPGA
- 6. Buffer transfer from FPGA
- 7. Event synchronization
- 8. Post processing and release of resources

# 3.5.2 Setting up the OpenCL environment

The host code in the Vitis core development kit follows the OpenCL programming paradigm. To setup the runtime environment properly, the host application needs to initialize the standard OpenCL structures: target platform, devices, context, command queue, and program.

*Note:* The users can follow the native OpenCL C API. However, in this tutorial, we use OpenCL C++ wrapper API which is supported by XRT and many of the Vitis Examples are written using the C++ API. For more information on this C++ wrapper API, refer to this link.. However, for the CPU implementation, we only use C programming language apart from the object-oriented class in stopwatch.h file.

It is always a good coding practice to use error checking after each of the OpenCL API calls. This can help debugging and improve productivity when you are debugging the host and kernel code in the emulation flow, or during hardware execution.

```
cl_int err = CL_SUCCESS;
```

The second argument to the host executable stores the path to the FPGA binary file (.xclbin or .awsxclbin)

```
369 std :: string binaryFile = argv [1];
```

After a Xilinx platform is found, the application needs to identify the corresponding Xilinx devices. In case of larger f1 instances, this may go up to 8 devices.

```
373 auto devices = xcl :: get_xil_devices ();
```

and count them.

```
auto device_count = devices.size ();
int NUM_DEVICES = (int) device_count;
```

The OpenCL program is written such that it automatically scales up depending on the number of FPGA devices that are found attached to the device. Since each of the FPGA's can be individually programmed, we create a 1 dimensional vectors of context, programs, queues, binaries. In the code example, the cl::Context API is used to create a context for each of the device.

380 vector < cl :: Context> contexts (device\_count);

Create a program from a vector of source strings and the default context. Does not compile or link the program.

381 vector < cl :: Program> programs(device\_count);

Create one command queue vector for each of the FPGA devices

vector < cl :: CommandQueue > queues(device\_count);

Create a vector of kernels. Since the design makes use of three-kernel compute units per FPGA device, we create a vector of 3 kernels for each device

```
xector < vector < cl :: Kernel> > kernels (device_count, vector < cl :: Kernel>(NUM_KERNELS));
```

Attribute device name to each FPGA device

```
vector<std :: string > device_name(device_count);
```

cl::Program creates an OpenCL program object for a context and loads the binary bits specified by the binary in each element of the vector binaries into the program object.

```
vector < cl :: Program:: Binaries > bins(device_count);
```

Upon initialization, the host application needs to identify a platform composed of one or more Xilinx devices. The command cl::Platform::get stores the list of available platforms in the vector *platform*.

368 vector < cl :: Platform > platform;

Our application assigns NUM\_KERNELS kernels per device to the variable. So each FPGA-kernel compute unit is in charge of computing sequentially COMP\_PER\_DEVICE economies

```
390 int COMP_PER_DEVICE = ceil(N_MODEL/(NUM_DEVICES*NUM_KERNELS));
```

For example in our baseline application we execute 1200 models, N\_MODEL. When we accelerate using the f1.16xlarge instance we can launch 3 kernels on each of the 8 devices in parallel. Each of the 24 FPGA-kernel compute units is in charge of computing  $(1200/(8^*3)) = 50$  economies sequentially.

## 3.5.3 Allocate the Buffers and Events

In the OpenCL API, data transfer between the host and the device (fpga) can be achieved by creating buffers using the command cl::Buffer API and then assigning the data pointer to it. In order to create these buffers in the stack memory, we need the size of the buffers (in bytes). This variable is used to keep track of the number of IHP iterations. Since the hardware expects a fixed size buffer, 300 elements is arbitrarily chosen for our algorithm.

const size\_t hw\_iter\_size = 300; ///< arbitrary number chosen to represent max iterations

To determine the amount of bytes allocated per buffer we multiply total number of elements by the size of the data type used to represent the data

```
395 const size_t hw_preinit_size_bytes = sizeof(preinit_t);
396 const size_t hw_out_size_bytes = sizeof(out_t);
397 const size_t hw_iter_size_bytes = sizeof(int) * (hw_iter_size);
```

Initialize a 2D vector array for inputs and outputs. In this example, we are going to run the same economy several times, therefore we only need to initialize the input once which can be sent several times to different kernels on different fpga's. The output result from each of the fpga kernel is copied to different files and stored.

```
403 vector < vector < preinit_t > > hw_preinit(NUM_DEVICES, vector < preinit_t> (NUM_KERNELS));
404 vector < vector < out_t > > hw_out(NUM_DEVICES, vector < out_t> (NUM_KERNELS));
```

Initialize a 3D vector array, in which the size of the 1st dimension is the number of devices, the size of the 2nd dimension is the number of kernels (per device), and the 3rd dimension is the length of each of the variable.

For example, in the previous code, we instantiate a 2D vector structure variable of type preinit\_t. The dimensions of this vector is the number of FPGA-kernel computing units NUM\_DEVICES x NUM\_KERNELS.

Initialize 2 dimensional OpenCL buffers for each of the variable that needs to be transferred between the host and the device.

| 408 | vector< vector <cl ::="" buffer=""> &gt;</cl> | <ul> <li>buffer_agshock(device_count,</li> </ul> | vector < cl :: Buffer >(NUM_KERNELS)); |
|-----|-----------------------------------------------|--------------------------------------------------|----------------------------------------|
|-----|-----------------------------------------------|--------------------------------------------------|----------------------------------------|

409 vector < vector < cl :: Buffer > > buffer\_idshock (device\_count, vector < cl :: Buffer > (NUM\_KERNELS));

```
410 vector < vector < cl :: Buffer > > buffer_preinit (device_count, vector < cl :: Buffer > (NUM_KERNELS));
```

411 vector< vector<cl :: Buffer > > buffer\_out (device\_count, vector<cl :: Buffer >(NUM\_KERNELS));

412 vector < vector < cl :: Buffer > > buffer\_hw\_iter (device\_count, vector < cl :: Buffer > (NUM\_KERNELS));

Vector of events are created to coordinate the read, compute, and write operations such that each iteration is independent of each other, which allows for overlap between the data transfer and compute.

```
416 vector< vector< vector<cl::Event> >> memory_read_events(NUM_DEVICES, vector< vector<cl::Event> >(NUM_KERNELS, std::vector<cl::
Event>(1));
```

```
417 vector< vector< vector<cl :: Event> >> task_events (NUM_DEVICES, vector< vector<cl::Event> >(NUM_KERNELS, std::vector<cl::Event>(1))
);
```

```
418 vector< vector< vector< cl :: Event> >> memory_write_events(NUM_DEVICES, vector< vector<cl::Event> >(NUM_KERNELS, std::vector<cl::
Event>(1)));
```

For example, in the above code, we instantiate a 3D vector of type cl::Event for using it for read events in later sections. The dimensions of this vector are NUM\_DEVICES x NUM\_KERNELS x 1.

## 3.5.4 Set Up Kernels and Initialize Buffers

After setting up the runtime environment, such as identifying devices, creating the context, command queue, and program, the host application should identify the kernels that will execute on the device, and set up the kernel arguments.

OpenCL context, queues and device names are initialized for each of the FPGA's.

```
    OCL_CHECK(err, contexts[d] = cl :: Context(devices [d], props, nullptr, nullptr, &err));
    OCL_CHECK(err, queues[d] = cl::CommandQueue(contexts[d], devices[d], CL_QUEUE_PROFILING_ENABLE |
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err));
```

```
431 OCL_CHECK(err, device_name[d] = devices[d].getInfo<CL_DEVICE_NAME>(&err));
```

Each of the FPGA devices needs to be loaded and programmed with a binary file.

```
434 fileBuf [d] = xcl :: read_binary_file ( binaryFile );
```

```
435 bins[d].push_back({ fileBuf [d]. data(), fileBuf [d]. size () });
```

```
436 programs[d] = load_cl2_binary(bins[d], devices[d], contexts[d]);
```

The OpenCL API cl::Kernel should be used to access the kernels contained within the .xclbin file (the "program"). The cl::Kernel object identifies a kernel in the program loaded into the FPGA that can be run by the host application. In our paper we propose a design that can at most instantiate three kernels into the three different compute units (SLRs) of our FPGA device. Therefore, we identify each of the three kernels with the extension shown below. The kernel names are defined as in the *design.cfg* file. For example, in the below code, we have the NUM\_KERNELS set to 3. So, the three kernel names that will be implemented in a single FPGA will be of the names runOnfpga\_1, runOnfpga\_2 and runOnfpga\_3. Buffers are created for each of the FPGA devices separately as shown below.

438 for (int k = 0;  $k < NUM_KERNELS$ ; k++) {

```
if (k\% 5 == 0)
439
     OCL_CHECK(err, kernels[d][k] = cl :: Kernel(programs[d], "runOnfpga:{runOnfpga_1}", &err));
440
441
    }
442
    if (k\% 5 == 1)
443
     OCL_CHECK(err, kernels[d][k] = cl :: Kernel(programs[d], "runOnfpga:{runOnfpga_2}", &err));
444
    }
445
    if (k\% 5 == 2)
446
     OCL_CHECK(err, kernels[d][k] = cl :: Kernel(programs[d], "runOnfpga:{runOnfpga_3}", &err));
447
    }
448
   }
```

Interactions between the host program and hardware kernels rely on creating buffers and transferring data to and from the memory in the device. This process makes use of functions like cl::Buffer and clEnqueueMigrateMemObjects. There are two methods for allocating memory buffers, and transferring data:

- 1. Letting XRT Allocate Buffers
- 2. Using Host Pointer Buffers

In the case where XRT allocates the buffer, use cl::enqueueMapBuffer to capture the buffer handle. In the second case, allocate the buffer directly with CL\_MEM\_USE\_HOST\_PTR, so you do not need to capture the handle.

On data center platforms, it is more efficient to allocate memory aligned on 4k page boundaries. On embedded platforms it is more efficient to perform contiguous memory allocation. In either case, you can let the XRT allocate host memory when creating the buffers. This is done by using the CL\_MEM\_ALLOC\_HOST\_PTR flag when creating the buffers, and then mapping the allocated memory to user-space pointers using cl::EnqueueMapBuffer . With this approach, it is not necessary to create a host space pointer aligned to the 4K boundary.

The cl::EnqueueMapBuffer API maps the specified buffer and returns a pointer created by XRT to this mapped region. Then, fill the host side pointer with your data, followed by cl::EnqueueMigrateMemObject to transfer the data to and from the device. The following code example uses this style:

```
450 std :: cout << "Creating Buffers [" << d << "] [" << k << "]... " << std :: endl;
```

```
451 OCL_CHECK(err, buffer_agshock[d][k] = cl :: Buffer ( contexts [d], CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY, (cl::size_type)
AGSHOCK_ARR_SIZE, NULL, &err));
```

There are two main parts of a cl\_mem object: host side pointer and device side pointer. Before the kernel starts its operation, the device side pointer is implicitly allocated on the device side

<sup>452</sup> OCL\_CHECK(err, buffer\_idshock[d][k] = cl :: Buffer (contexts[d], CL\_MEM\_ALLOC\_HOST\_PTR | CL\_MEM\_READ\_ONLY, (cl::size\_type) IDSHOCK\_ARR\_SIZE, NULL, & err));

<sup>453</sup> OCL\_CHECK(err, buffer\_preinit[d][k] = cl :: Buffer (contexts [d], CL\_MEM\_USE\_HOST\_PTR | CL\_MEM\_READ\_ONLY, hw\_preinit\_size\_bytes, &hw\_preinit[d][k], &err));

<sup>454</sup> OCL\_CHECK(err, buffer\_out[d][k] = cl :: Buffer (contexts [d], CL\_MEM\_USE\_HOST\_PTR | CL\_MEM\_WRITE\_ONLY, hw\_out\_size\_bytes, & hw\_out[d][k], &err));

<sup>455</sup> OCL\_CHECK(err, buffer\_hw\_iter[d][k] = cl :: Buffer ( contexts [d], CL\_MEM\_USE\_HOST\_PTR | CL\_MEM\_WRITE\_ONLY, hw\_iter\_size\_bytes, hw\_iter[d][k].data(), &err));

memory (for example, on a specific location inside the device global memory) and the buffer becomes a resident on the device. Using cl::EnqueueMigrateMemObjects this allocation and data transfer occur upfront, much ahead of the kernel execution. This especially helps to enable software pipelining if the host is executing the same kernel multiple times, because data transfer for the next transaction can happen when kernel is still operating on the previous data set, and thus hide the data transfer latency of successive kernel executions.

In the Vitis software platform, two types of arguments can be set for kernel objects:

- 1. Scalar arguments are used for small data transfer, such as constant or configuration type data. These are write-only arguments from the host application perspective, meaning they are inputs to the kernel.
- 2. Memory buffer arguments are used for large data transfer. The value is a pointer to a memory object created with the context associated with the program and kernel objects. These can be inputs to, or outputs from the kernel.

Kernel arguments can be set using the cl::Kernel::setArg command, as shown in the following example for setting kernel arguments for two scalar and two buffer arguments.

```
for (int d = 0; d < NUM_DEVICES; d++) {
461
   for (int k = 0; k < NUM KERNELS; k++) {
462
    OCL_CHECK(err, err = kernels[d][k].setArg(0, buffer_agshock[d][k]));
463
    OCL_CHECK(err, err = kernels[d][k].setArg(1, buffer_idshock[d][k]));
464
    OCL_CHECK(err, err = kernels[d][k].setArg(2, buffer_preinit [d][k]));
465
466
    OCL_CHECK(err, err = kernels[d][k].setArg(3, buffer_out[d][k]));
467
    OCL_CHECK(err, err = kernels[d][k].setArg(4, buffer_hw_iter[d][k]));
    std :: cout << "Comleted Setting Arguments"<< std::endl;</pre>
    469
       AGSHOCK ARR SIZE);
    idshock_ptr[d][k] = (unsigned char *) queues[d].enqueueMapBuffer(buffer_idshock[d][k], CL_TRUE, CL_MAP_WRITE, 0,
       IDSHOCK ARR SIZE);
471
   }
472
```

# We then allocate NUM\_DEVICES X NUM\_KERNELS number of inputs that we keep reusing to launch across these kernels COMP\_PER\_DEVICE number of times.

483 env\_t env[NUM\_DEVICES][NUM\_KERNELS];
484 input\_t in[NUM\_DEVICES][NUM\_KERNELS];
485 vars\_t vars[NUM\_DEVICES][NUM\_KERNELS];

For each of the economy, we initialize the inputs that will be transferred to the fpga device.

```
495 init_all (&env[d][k], &in[d][k], &vars[d][k]);
496
497 for(int i=0; i<NSTATES; i++){
498 hw_preinit[d][k].kprime[i] = vars[d][k].kprime_a[i];
499 }
500
501 for(int i=0; i<NSTATES; i++){
502 hw_preinit[d][k].wealth[i] = env[d][k].wealth[i];
503 }</pre>
```

513 memcpy(agshock\_ptr[d][k], in [d][k].agshock, AGSHOCK\_ARR\_SIZE); 514 memcpy(idshock\_ptr[d][k], in [d][k].idshock, IDSHOCK\_ARR\_SIZE);

# 3.5.5 Copy Input from Host to Device

Transfer the data from host to global memory using the OpenCL API call enqueueMigrateMem-Objects. The definition of this API can be found here.

```
<sup>526</sup> printf ("Migrating buffers to kernel\n");
527 if (i == 0){
528 OCL_CHECK(err,
   err = queues[d].enqueueMigrateMemObjects( {
529
     buffer_agshock[d][k], buffer_idshock[d][k], buffer_preinit [d][k] },
530
     0 /* 0 means from host*/, nullptr, &memory_read_events[d][k][0]));
532
533
    else {
   OCL CHECK(err,
534
535
    err = queues[d].enqueueMigrateMemObjects( {
     buffer_agshock[d][k], buffer_idshock[d][k], buffer_preinit [d][k] },
536
   0 /* 0 means from host*/, &memory_write_events[d][k], &memory_read_events[d][k][0]));
537
538
   }
```

# 3.5.6 Submit Kernel for Execution

Often the compute intensive task required by the host application can be defined inside a single kernel, and the kernel is executed only once to work on the entire data range. Though the kernel is executed only one time, and works on the entire range of the data, the parallelism is achieved on the FPGA inside the kernel hardware. If properly coded, the kernel is capable of achieving parallelism by various techniques such as instruction-level parallelism (loop pipeline) and function-level parallelism (dataflow).

In this tutorial, to keep things less complicated, we create a single kernel for each of the SLR compute units in the FPGA device(s). Therefore we can have a maximum of 24 independent kernels (in the f1.16xlarge) running in parallel. Each kernel has a command queue. When organizing the allocation of economies across kernels, it is advisable to break them equally among all available kernels. In this case, an out-of-order command queue can determine how the kernel tasks are processed as explained in Command Queues.

```
549 OCL_CHECK(err,
```

```
err = queues[d].enqueueTask(kernels[d][k], &memory_read_events[d][k],
```

```
%51 &task_events[d][k][0]);
```

## 3.5.7 Copy the results back

After the kernel computation is completed, the host code can initiate the read back of the computed results. Depending on whether the kernel tasks are launched In-Order or Out-of-Order, the results are read back once the cl::event indicates that the data is ready as explained in the next sections.

```
562 OCL_CHECK(err,
```

```
<sup>563</sup> err = queues[d].enqueueMigrateMemObjects( {buffer_out[d][k], buffer_hw_iter[d][k]},
```

```
564 CL_MIGRATE_MEM_OBJECT_HOST, &task_events[d][k], &memory_write_events[d][k][0]));
```

# 3.5.8 Event Synchronization

All OpenCL enqueue-based API calls are asynchronous. These commands will return immediately after the command is enqueued in the command queue. To pause the host program to wait for results, or resolve any dependencies among the commands, an API call such as clFinish or clWaitForEvents can be used to block execution of the host program.

578 queues[d]. finish () ;

Note how the commands have been used in the example above:

- 1. The clFinish API has been explicitly used to block the host execution until the kernel execution is finished. This is necessary otherwise the host can attempt to read back from the FPGA buffer too early and may read garbage data.
- 2. cl::Event

## 3.5.9 Printing Results

We copy the results into text files and store the values of each of the computed economy.

```
for (int d = 0; d < NUM DEVICES; d++) {
590
     for (int k=0; k < NUM KERNELS; k++)
591
592
      FILE * cfile ;
593
      char FileName[512];
594
      printf ("Migrating buffers from kernel\n"); // add kgrid, km grid to file names
595
      sprintf (FileName, "%sfpga_nkM%d-nk%d_i%d_d%d_k%d.txt", KP_OUT_FILE, NKM_GRID, NKGRID, i, d, k);
596
      cfile = fopen(FileName, "w");
597
      for(int i=0; i<NSTATES; i++){</pre>
598
       fprintf ( cfile , "%.15 lf \n", hw_out[d][k].kprime[i]);
599
      }
600
      fclose ( cfile );
601
602
     .
603
604
605
    }
606
    }
```

In addition to storing several values, we print some of the main results on the serial console for a quick check.

```
639 for (int d=0; d<NUM_DEVICES; d++){
640 for (int k = 0; k < NUM_KERNELS; k++) {
97 printf ("i=%d d=%d k=%d Bad Coeff 0: %.15 lf \n", i, d, k, hw_out[d][k].coeff [0]);
```

```
printf ("i=%d d=%d k=%d Bad Coeff 1: %.15 lf \n", i, d, k, hw_out[d][k].coeff [1]);
642
      printf ("i=%d d=%d k=%d Bad R2: %.15lf\n", i, d, k, hw_out[d][k].r2[0]);
643
      printf ("i=%d d=%d k=%d Good Coeff 0: %.15lf\n", i, d, k, hw_out[d][k].coeff [2]);
644
645
      printf ("i=%d d=%d k=%d Good Coeff 1: %.15lf \n", i, d, k, hw_out[d][k].coeff [3]);
      printf ("i=%d d=%d k=%d Good R2: %.15lf\n\n", i, d, k, hw_out[d][k].r2[1]);
646
647
      printf ("i=\%d d=\%d k=\%d Total EGM iter: \%d\n", i, d, k, total_egm_iter [d][k]);
648
      printf ("i=%d d=%d k=%d Total Main loop iter : %d\n\n", i, d, k, hw_iter[d][k][0]);
649
    }
650
   }
```

**Free resources.** At the end of the host code, all the allocated resources in the heap memory should be released. If the resources are not properly released, the Vitis core development kit might not able to generate a correct performance related profile and analysis report. Most of the OpenCL C++ API's have the destructor defined. Therefore we do not have to de-allocate most of them.

```
655 for (int d=0; d<NUM_DEVICES; d++){
656   for (int k = 0; k < NUM_KERNELS; k++) {
657   free_all (&in[d][k]);
658   }
659 }</pre>
```

# 3.5.10 Open MPI

This subsection describes the Open MPI-specific code associated with \_OPENMPI\_MODE.

Begin by initializing the MPI environment.

```
64 mpi_enabled = MPI_Init(NULL, NULL);
```

Collect the number of processes (available cores).

72 int n\_tasks;

```
MPI_Comm_size(MPI_COMM_WORLD, &n_tasks);
```

Collect the rank of the processes.

76 int id\_task ;

it.

```
7 MPI_Comm_rank(MPI_COMM_WORLD, &id_task);
```

Block all processes in the communicator MPI\_COMM\_WORLD until all processes have called

91 MPI\_Barrier(MPI\_COMM\_WORLD);

Specify the range of models for each process to compute. We assign the economies equally across processes.

```
93 // Range of tasks per processor.
94 int i_min_task_id, i_max_task_id;
95
96 // Define the Block to be assigned to each task
97 parameters_range_pertask(0,N_MODEL-1,n_tasks,id_task,&i_min_task_id,&i_max_task_id);
```

Next, the processes compute their assigned economies in parallel.

```
107 for (int i = i_{min_task_id}; i \le i_{max_task_id}; i++) {
108
109
110 .
111 env_t env;
     input_t in;
112
     vars_t vars;
113
     out t out:
114
     int hw_iter [500];
115
116
     init_all (&env, &in, &vars);
117
118
119
120
     runOncpu(&env, &vars, in.agshock, in.idshock, &out, hw_iter);
    }
```

Save the results of each of the computed model.

```
FILE * cfile ;
```

Print the final values of R2 score and the Coefficient values for each model in the terminal.

```
      176
      printf ("Total EGM iter: %d\n", total_egm_iter);

      177
      printf ("Total Main loop iter: %d\n", hw_iter[0]);

      178
      printf ("Bad Coeff 0: %.15 lf \n", out. coeff [0]);

      179
      printf ("Bad Coeff 1: %.15 lf \n", out. coeff [1]);

      180
      printf ("Good Coeff 0: %.15 lf \n", out. coeff [2]);

      181
      printf ("Good Coeff 1: %.15 lf \n", out. coeff [3]);

      182
      printf ("Bad R2: %.15 lf \n", out.r2 [0]);

      183
      printf ("Good R2: %.15 lf \n", out.r2 [1]);
```

After the processes have completed their assigned economies, terminate the MPI environment and exit.

```
223 MPI_Finalize ();
```

# 3.6 Kernel: hw.cpp

The file /common/hw.cpp contains the hardware design of the kernel.

# 3.6.1 Common HLS Optimization Pragmas

This section describes the main #PRAGMAs used to design the hardware acceleration of our algorithm.

#### 3.6.1.1 #pragma HLS ARRAY\_PARTITION

Each memory block (BRAM, URAM) consists of a limited number of memory ports to read or write from the memory. For example a BRAM block usually consist of 2 ports. When data is stored in a BRAM in a contiguous manner, we can only read a maximum of 2 elements in the same clock cycle for a dual port BRAM block. This may create a bottleneck when we want to access more than two elements simultaneously. To overcome this challenge, Xilinx suggest to store the data across multiple blocks of memory instead of storing it in a contiguous manner. By partitioning an array across *N* memory blocks, we utilize *N* number of memory blocks each of which can have up to 2 memory ports thereby enabling a maximum of 2*N* memory accesses in a single cycle. We can instruct the Vitis complier to split the elements of an array and then map them to smaller arrays using **#pragma HLS ARRAY\_PARTITION**. There are 3 main ways to partition an array as described in Figure 3.1. *Source:* Xilinx link.





*Note:* Array partition using the three types: (*i*) Block; (*ii*) Cyclic; and (*iii*) Complete. The image is taken from Xilinx UG1393.

#### 3.6.1.2 #pragma HLS UNROLL

In order to make use of the fpga resources, the designer can spatially unroll loops to create multiple independent operations rather than a single collection of operations. The **#pragma** 

| my_array[10][6][4] -> partition dimension 3 | my_array_0[10][6]<br>→ my_array_1[10][6]<br>my_array_2[10][6]<br>my_array_3[10][6]                                     |
|---------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
| my_array[10][6][4] — partition dimension 1  | my_array_0[6][4]<br>→ my_array_1[6][4]<br>my_array_2[6][4]<br>my_array_3[6][4]<br>my_array_4[6][4]<br>my_array_5[6][4] |
|                                             | my_array_6[6][4]<br>my_array_7[6][4]<br>my_array_8[6][4]<br>my_array_9[6][4]                                           |
| my_array[10][6][4] -partition dimension 0   | 10x6x4 = 240 elements<br>x14804-102219                                                                                 |

Figure 3.2: Partitioning Dimensions of an Arrays

*Note:* This figure shows how the same array can be partitioned across different axis (0, 1, 3) resulting in 240, 10 and 4 separate arrays respectively. The image is taken from Xilinx UG1393.

**HLS UNROLL** transforms loops by creating in hardware multiple copies of a loop body such that they can all occur in parallel. By default the unrolling is set to complete, however, the user can set a specific number using the object factor. *Source:* Xilinx link.

#### 3.6.1.3 #pragma HLS PIPELINE

A pipelined function (or loop) processes new inputs every *N* clock cycles, where *N* is the Initiation Interval (II) of the loop or a function. By default, the II for the **#pragma HLS PIPELINE** is set to 1. However, a user can specify the required value using the II option for the pragma.

The Figure 3.4 shows a case where placing the pipeline pragma at different loop locations results in 3 different unrolling of the inner loops along with the increased hardware resources and memory accesses. The user needs to make a conscious choice about the placement of the pipeline pragma. If the data accessed inside the loop is unable to process in a single cycle, the II of the loop would change from 1 to N, where N is the number of clock cycles after which the data of the next loop iteration can be accessed.

The loop pipelining can be prevented when there are loop carry dependency or if the inner loops consist of variable loop bounds. It can also be limited if the required data is unable to be accessed in a single clock cycle. In that case, the designer can solve the problem by using the **#pragma HLS ARRAY\_PARTITION** discussed in the previous section. *Source:* Xilinx link.





*Note:* This figure shows how the unrolling by different factors decreasing the overall latency of the loop while increasing the hardware resources.





## 3.6.1.4 #pragma HLS LOOP\_TRIPCOUNT

This pragma does not perform any optimization and has no impact on the results of the synthesis. However, for a undefined loop bounds, this can be applied to manually specify the expected number of iterations.

When we are in the process of generating the output binary file, after the first step of C synthesis, the Vitis HLS provides us with the synthesis reports. This reports consists of several important information regarding the latencies for all the major loops. Wherever, the loop has a data dependent variable, the tool will be unable to estimate the latencies. Hence, the above



Figure 3.5: Data dependency preventing II=1

*Note:* This figure shows how the data dependency in a loop prevents the pipeline in achieving an II=1.

pragma instructs the tool to calculate the latencies for the given number of iterations. This information helps us to keep track of the results of the optimizations that we perform.

In this example, the loop\_1 is specified to have a minimum, average and maximum trip counts of 12, 14 and 16 respectively. Without this pragma, the tool cannot determine the loop latency.

```
void foo (num_samples, ...) {
    int i;
    ...
    loop_1: for(i=0;i< num_samples;i++) {
        #pragma HLS loop_tripcount min=12 max=16
        ...
        result = a + b;
    }
  }
</pre>
```

Source: Xilinx link.

#### 3.6.1.5 #pragma HLS INLINE

Removes a function as a separate entity in the hierarchy. This reduces the overhead for the function call and can allow the function to be optimized into the caller. When you inline, you will have a separate set of hardware for each place where the function is inlined. *Source:* Xilinx link.

#### 3.6.2 Overview

The kernel is organized in:

- a parent function that manages data transfers from and to the host and executes the fixed point algorithm: runOnfpga;
- four functions that executes the KS algorithm: hw\_sim\_alm, hw\_sim\_ihp, hw\_sim\_ast, sim\_alm\_coeff;

 auxiliary functions that support or accelerate the algorithm: hw\_pow, hw\_exp, hw\_log, hw\_sqrt, hw\_fabs, hw\_init\_env, hw\_rail\_values, hw\_fxd\_rail\_values, hw\_findrange, hw\_findrange\_n4, hw\_findrange\_n100, regression, RSqauredCalc.

# 3.6.3 Parent Kernel Function: runOnfpga

runOnfpga is the parent kernel function which:

- 1. manages the FPGA interface
- 2. manages the memory allocation
- 3. executed the nested fixed point algorithm
- 4. send the results back to the host

#### 3.6.3.1 Memory Management

The kernel function name of the complete synthesised logic is **runOnfpga**. The code snippet below lists the parameter that are passed to the kernel from host. Most of the parameters here refers to the pointers to the off-chip DRAM memory which resides in the external DDR memory in the data center. The memory latency to an off-chip memory access is extremely large and cost a lot of energy compared to on-chip memory access. Therefore, the first step is to allocate on-chip memories for all the data-variables which are accessed multiple times and then initialize the on-chip memories with the data from the off-chip memory. We discuss some of the memory allocations of different variables by making use of the different on-chip memory resources such as BRAM, URAM and Registers.

```
yoid runOnfpga(
const unsigned char *hw_agshock,
const unsigned char *hw_idshock,
preinit_t * preinit ,
out_t * results ,
int *hw_iter)
```

The structure variables which are declared outside the main function are treated as static variables and the data is retained across multiple inferences. It is recommended to limit the usage of global variables.

```
10 /** Static on-PL memories */
11 static hw_env_t st_env;
```

Throughout the program, we make use of the structure variable st\_env which is derived of the structure type hw\_env\_t consisting of the calibration parameters and some of the temporary data variables as defined in the fine hw.h.

We can create local variables whose scope is limited to the function that they are allocated in. In our program, we allocate the following variables that are common across different functions. By default, the Vitis compiler would try to choose a memory type depending on the data access patterns. For example, if the program only reads a value from a pre-initialized data variable, the tool may choose to synthesize that variable using single ported BRAM. This consumes less hardware resources as compared to the dual port BRAM resources. Most of the default memory allocations work well with the designs. However, the user is free to change the default memory types as per their requirement using the **#pragma HLS BIND\_STORAGE**.

We optimize the memory resource for storing the Individual Shocks which is declared here as idshock. The program uses a #ifdef condition which checks for PACK\_IDS. If this is enabled in the dev\_options.h file, we instruct the tool to allocate NEW\_IDSHOCK\_SIZE number of rows of width 72bits. Usually, the x86 machines are limited to using a double to store large numbers. However, we can choose to use a custom fixed point number that can be larger than 64 bits. More details about this is explained below. In the case where the PACK\_IDS is disabled, the tool is free to choose a suitable memory, which is observed to be BRAM18.

```
unsigned char agshock[AGSHOCK_ARR_SIZE];
106
107 # if PACK IDS
     ap_uint<72> idshock[NEW_IDSHOCK_SIZE] = {0};
108
109 #else
     unsigned char idshock[IDSHOCK_ARR_SIZE] = {0};
110
   #endif
     real st_kcross [N_AGENTS];
     real st_kprimes[NUM_KPRIMES][NSTATES];
114
     real kmts[SIM STEPS];
     real r2[NSTATES_AG];
116
     real kmprime[NSTATES_AG * NKM_GRID];
118
     real coeff [NCOEFF] = {0, 1, 0, 1};
119
     real metric_coeff = 1000; // some large number
```

#### 121 # if PACK\_IDS

```
122 #pragma HLS bind_storage variable = idshock type = RAM_1P impl = URAM
123 #endif
```

In our program, we optimize the memory usage for some of the data variables. the variable is specified using the keyword variable, the type of memory is selected using type and the implementation using impl. Xilinx provides a complete list of possible combinations that can be found here. By choosing these options, the tool will now use URAM memory of type single port RAM to implement the idshock variable. We choose a single port RAM as we are going to write the data to this variable only once and read the data from here only once in a single clock cycle. The data read for idshock is further explained in the section (hw\_sim\_ast). Note that for all the arrays, the size needs to be specified for it to be synthesised.

```
    #pragma HLS array_partition variable = st_env.k complete dim = 1
    #pragma HLS array_partition variable = st_env.km complete dim = 1
```

The memory containing the individual capital and the mean of the aggregate capital distribution needs to be accessed multiple times in the same clock cycle. Therefore, these two variables are

partitioned completely.

After allocating the on-chip memories for the different data variables, we now need to initialize the local on-chip memories with the data from the off-chip memory before we start using them in Eq. ??. To perform this step efficiently, Xilinx recommends to use *Burst Transfer*. Burst transfer refers to reading or writing chunks of data to or from the global memory in a single request. This is the most effective optimization to reads/writes data to external memory which is usually the DDR. The below code copies the aggregate shocks using the pointer hw\_agshock pointing to a location in the external memory to the data variable agshock which resides on the on-chip memory.

```
136 for (int i = 0; i < AGSHOCK_ARR_SIZE; i++)
137 {
138 agshock[i] = hw_agshock[i];
139 }</pre>
```

Similarly, now we want to burst transfer the id shocks. In the code snippet below, we have two different options provided to demonstrate the improvement by using URAM. When the PACK\_IDS is enabled, we instruct the compiler to copy 8 elements of the input data elements which is of 8 bits size into a single element of on-chip unsigned fixed point data type that is of size 64 bits. By doing so, we can access 64 bits of idshocks by accessing a single element of the idshocks. Otherwise, the compiler would use the default BRAM memory to store the idshock where we can access a maximum of 8 different idshock s for each access to an element in the array.

```
141 # if PACK_IDS
    // use URAM to store the idshocks
142
    // 8 idshocks are packed into 1 byte-> (1,100 * 10,000 / 8) = 1,375,000 bytes
143
    // copy to data variable of size 64 bits. Hence, 8 input bytes are copied to one element
144
    main_2: // loop over each of the 1,100 time step. (10,000 / 8) = 1250
145
     for (int i = 0, j = 0; i < IDSHOCK\_ARR\_SIZE; i = i + 1250)
146
147
     {
148
     main_2_2: // for each time step, copy 8 bytes into a single element of size 64 bits
       for (int k = 0; k < 1250; j + +)
149
150
       {
        // handle edge case where last 2 bytes are remaining since 1,250 is not devisible by 8
        if (k == 1248)
        idshock[i] = (hw_idshock[i + k + 1] << 8) | (hw_idshock[i + k]);
154
         k = k + 2;
        }
156
        else
158
        {
159
         idshock[j] = ((( ap_uint<72>)hw_idshock[i + k + 7] << 56) |
              ((ap_uint<72>)hw_idshock[i + k + 6] << 48) |
160
              ((ap_uint<72>)hw_idshock[i + k + 5] << 40)
161
              ((ap_uint<72>)hw_idshock[i + k + 4] << 32)
162
              ((ap uint < 72 >)hw idshock[i + k + 3] << 24)
163
              ((ap_uint<72>)hw_idshock[i + k + 2] << 16)
164
              ((ap uint < 72 >)hw idshock[i + k + 1] << 8)
              ((ap\_uint<72>)hw\_idshock[i + k + 0]));
166
         k = k + 8:
167
168
        }
169
       }
```

```
}
170
171
172 #else
    // use BRAM to store the idshocks
173
    main_2:
174
175
     for (int i = 0; i < IDSHOCK_ARR_SIZE; i++)
176
     {
177
      idshock[i] = hw_idshock[i];
     }
178
179 #endif
```

Further, we created a function call to initialize the remaining data variables.

```
182 hw_top_init(st_kprimes, st_kcross);
```

The capital function for t=0 - kprimes, and kcross are burst copied from the global memory.

```
16
    void hw_top_init(
     real st_kprimes[NUM_KPRIMES][NSTATES], real st_kcross[N_AGENTS]
18
   )
19
   {
20
    init_1 :
    for (int j = 0; j < NSTATES; ++j)
21
22
    {
23
      real val = kp_in[j];
      for (int k = 0; k < NUM_KPRIMES; ++k)
24
      {
      st_kprimes[k][j] = val;
26
27
     }
28
    }
29
    init_2 :
30
31
    for (int j = 0; j < N\_AGENTS; ++j)
32
    {
     st_kcross[j] = env_kss;
33
34
   }
```

Note that the initialization from here on-wards can be moved to the host side and the initialized data can be sent to the device. This is left for future experiments. To minimize some of the one-time initialized data variables, we pre-compute the result and store it locally.

```
36
    st_env. irate_factor [0] = 0.35640000000000;
    st_env. irate_factor [1] = 0.36360000000000;
37
38
    st_env.wage_factor[0] = 0.6336000000000;
39
    st_env.wage_factor[1] = 0.64640000000000;
40
41
    st_env.cons2_factor [0] = 0.1500000000000;
42
    st_env.cons2_factor[1] = 1.09444444444445;
43
    st_env. cons2_factor [2] = 0.1500000000000;
44
45
    st_env.cons2_factor[3] = 1.10486111111111;
46
47
    hw_init_env();
48
    return:
49
```

After all the burst reads, we initialize the global **env** structure variable using the following code.

```
void hw_init_env()
900
901
    {
   #pragma HLS inline
902
     st_env.alpha = env__alpha; // 0.36 (Output capital share)
903
     st_env.beta = env_beta; // 0.99 (Quarterly subjective discount factor)
904
     st_env. delta = env__delta; // 0.025 (Quarterly depreciation rate)
905
     st_env.mu = env_mu; // 0.15 (Unemployment benefits in terms of wages)
906
      st env.l bar = env l bar;
907
      st_env.delta_a = env__delta_a; // 0.01
908
909
      st_env.l_bar_inv = env_l_bar_inv; // 0.9 (Time endowment) ?
910
911
      st_env.gamma_inv = env__gamma_inv;
912
      st_env.gamma_neg = env__gamma_neg;
      st_env.gamma_neg_inv = env__gamma_neg_inv;
913
914
      st_env.epsilon_u = env__epsilon_u;
915
      st_env.epsilon_e = env__epsilon_e;
916
917
      st_env.ur[0] = env_ur_0;
918
      st_env.er[0] = (1 - st_env.ur[0]);
919
920
      st_env.ur[1] = env\_ur_1;
921
      st_env.er[1] = (1 - st_env.ur[1]);
922
     st_env.er_inv[0] = 1 / st_env.er[0];
923
     st_env.er_inv[1] = 1 / st_env.er [1];
924
925
926
      // st_env.kss = hw_pow((1./st_env.beta-(1.-st_env. delta))/st_env.alpha ,1./( st_env.alpha-1));
927
      st_env.kss = env_kss;
928
      // transition
929
      st_env.P[0] = 0.525;
930
     st_env.P[1] = 0.35;
931
     st env.P[2] = 0.03125;
932
     st_env.P[3] = 0.09375;
933
      st_env.P[4] = 0.038889;
934
935
     st_env.P[5] = 0.836111;
936
     st_env.P[6] = 0.002083;
937
     st_env.P[7] = 0.122917;
     st_env.P[8] = 0.09375;
938
     st_env.P[9] = 0.03125;
939
     st_env.P[10] = 0.291667;
940
     st_env.P[11] = 0.583333;
941
     st_env.P[12] = 0.009115;
942
      st_env.P[13] = 0.115885;
943
      st_env.P[14] = 0.024306;
944
     st_env.P[15] = 0.850694;
945
946
      // parmshocks
947
     st_env.epsilon[0] = st_env.epsilon_u;
948
     st_env.epsilon[1] = st_env.epsilon_e;
949
     if AST_UNROLL
950
     for (int k = 0; k < NUM_KCROSS; ++k)
951
952
     {
953 #pragma HLS pipeline off
      st_env.epsilon2[k][0] = 0;
954
      st_env.epsilon2[k][1] = 1;
955
     }
956
957
   #else
     st_env.epsilon2[0] = 0;
958
     st_env.epsilon2[1] = 1;
959
```

```
960 #endif
961
962 st_env.ag[0] = 1 - st_env. delta_a;
963 st_env.ag[1] = 1 + st_env. delta_a;
964 st_env.ag2[0] = 0;
965 st_env.ag2[1] = 1;
966 return;
967 }
```

#### 3.6.3.2 Fixed Point Algorithm

The following data variables are used to keep track of the total number of iterations required for the convergence of the ALM coefficients hw\_main\_iter and individual household **IHP** problem curr\_ihp\_iter, and an array to store the number of **IHP** iterations at every ALM coefficient loop iteration. These variables (among others) are used in the validation phase to debug and compare the results with the MATLAB code.

```
int hw_main_iter = 0; // total number of ihp calls
int curr_ihp_iter = 0; // number of ihp iterations in each ihp call
int hw_ihp_iter[300] = {0}; // local mem array to store the number of ihp iterations
```

After completing all the memory initialization, the runOnfpga function launches the nested fixed point algorithm:

- hw\_sim\_alm: updates the expectations about the first moment of the capital distribution, m';
- hw\_sim\_ihp: solves the individual household (IHP) problem
- hw\_sim\_ast: performs the stochastic simulation
- sim\_alm\_coeff: updates the estimates of the Aggregate Law of Motion coefficients.

```
while (metric_coeff > TOLL_COEFF)
164
     {
      hw_main_iter++;
      hw_sim_alm(kmprime, coeff); // step 1
166
167
       curr ihp iter = 0;
168
       hw_sim_ihp(st_kprimes, kmprime, curr_ihp_iter); // step 2
169
       hw_ihp_iter[hw_main_iter] = curr_ihp_iter ; // start from 1st element of hw_ihp_iter
170
       real kcross_l[N_AGENTS];
172
       kc_t kcross_mean = 0;
174
      ast kcross:
      for (int is = 0; is < N_AGENTS; is++)
176
177
       {
       kcross_l[is] = st_kcross[is];
178
179
       kcross mean += (kc t) st kcross [ is ];
180
       }
181
       hw_sim_ast(kmts, st_kprimes, kcross_l, agshock, idshock, kcross_mean); // step3
182
```

```
183
       sim_alm_coeff(kmts, coeff, &metric_coeff, r2, agshock); // step 4
184
185
       if (metric_coeff > TOLL_COEFF * 100)
186
187
       {
188
        // Replace the old with new capital distribution
189
        for (int j = 0; j < N_AGENTS; j++)
190
         st_kcross[j] = kcross_l[j];
191
192
       }
193
       }
194
195 # if PRINT_LOOP_CNT
196
       iter_main++;
       printf("main loop iter = %d\n", iter_main);
197
198 #endif
199
    }
```

# 3.6.4 Aggregate Law of Motion: hw\_sim\_alm

**Description**: This function computes the next period expected aggregate physical capital. **Acceleration:** None / Instruction Level Parallelism.

```
void hw_sim_alm(real *kmprime, real * coeff )
237
238
    {
     small_idx_t cidx = 0;
239
     real c0. c1:
240
     small idx t kidx = 0;
241
242
243
    alm_1:
244
     for (int ia = 0; ia < NSTATES_AG; ++ia)
245
      {
      c0 = coeff[cidx];
246
      c1 = coeff[cidx + 1];
247
      cidx += REGRESSORS;
248
     alm 2:
249
      for (int ikm = 0; ikm < NKM_GRID; ++ikm)
250
251
      {
   #pragma HLS unroll factor = 1
252
        // add pipeline registers to split the computation into multiple stages
253
254
        real t_log = hw_log(km_grid[ikm]);
        real t_mul = c1 * t_log;
        real t_add = c0 + t_mul;
256
        real val = hw exp(t add);
                                      // hw_exp(c0 + c1 * st_env.log_env_km[ikm])
       hw_rail_values(&val, KM_MAX, KM_MIN); // eq 15
258
       kmprime[kidx++] = val;
259
260
      }
261
     }
262
     return;
263
    }
```

The function computes the next period expected aggregate physical capital. We note that the important step (in the code snippet above) is the computation of the logarithm of the coefficient and updating the kmprime. The exponential operator consumes a large number of resources to implement and this function only takes a small fraction of the total compute time. Therefore, we

instruct the vitis compiler to only create 1 copy of the inner loop using the unroll pragma. Further, we increase the number of pipeline registers in this inner loop by storing the intermediate results in separate registers thereby improving the setup and hold timing.

### 3.6.5 Individual Household Problem: hw\_sim\_ihp

Description: This function solves the individual agent problem

$$k' = \left[\mu(1-\epsilon) + (1-\tau)\overline{l}\epsilon\right] w + (1-\delta+r)k$$
$$- \left\{\lambda + \beta \mathbb{E}\left[\frac{1-\delta+r'}{\left((\mu(1-\epsilon') + (1-\tau')\overline{l}\epsilon')w' + (1-\delta+r')k' - k'(k')\right)^{\gamma}}\right]\right\}^{-1/\gamma}$$
(3.1)

at every state,  $k, \epsilon, m, A \in \mathbf{K} \times \{0, 1\}_{\epsilon} \times \mathbf{M} \times \mathbf{A}$ . More compactly,

$$\hat{k'}_{i+1} = \Phi k'_i$$

Acceleration: Array Partition, Pipeline, Unroll.

#### 3.6.5.1 Memory Management.

```
#if (NUM_KPRIMES == 8 && _WITHIN_ECONOMY)
#pragma HLS array_partition variable = st_env.P complete
#pragma HLS array_partition variable = st_kprimes complete dim = 1
#pragma HLS bind_storage variable = st_kprimes type = RAM_1WNR impl = BRAM
#else
#pragma HLS array_partition variable = st_kprimes complete dim = 1
#pragma HLS bind_storage variable = st_kprimes type = RAM_2P impl = BRAM
#endif
#if AST_UNROLL
#pragma HLS array_partition variable = st_env.epsilon2 complete dim = 1
#endif
```

We will later see that the st\_env.P is accessed only once in the inner most loop. Therefore, it needs to have at least 4 read ports when the outerloop, ihp\_2 is pipelined. Since the size of this structure member consist of only 16 elements, we partition it completely. However, it is sufficient to have a cyclic partition with a factor of 4.

```
/** Lookup tables */
// substitute for IXV call
static const small_idx_t li_2d_aux_idx_base [4] = {
0,
NKGRID,
NKM_GRID * NSTATES_ID * NKGRID,
(NKGRID + NKM_GRID * NSTATES_ID * NKGRID)};
#pragma HLS array_partition variable = li_2d_aux_idx_base complete
```

```
262
263 // Local kprime/new copies
264 real kprime_new[NSTATES];
265 real metric = 1;
266 #if PRINT_LOOP_CNT
267 unsigned int iter_cnt = 0;
268 #endif
```

We then proceed with initializing a lookup table to calculate the indexes of nested loops and unroll it completely. Further, we allocate memory for kprime\_new and do not perform any memory optimization as it only accessed once for every iteration of ihp\_2 and therefore a single memory port is sufficient.

#### 3.6.5.2 Individual Household Problem (IHP) Loop.

This loop determines the number of iterations hw\_egm\_iter = *i* required to estimate the individual capital-holdings policy functions,  $k'(k, \epsilon, m, A) : \mathbf{K} \times \{0, 1\}_{\epsilon} \times \mathbf{M} \times \mathbf{A} \to \mathbb{R}_+$ . **endogenous convergence** : This modality is for determining the policy functions. TOLL\_K stores the convergence tolerance  $\varepsilon_k$ , while metric is initialized to 1 and it is iteratively updated.

```
253 // Convergence loop: 4 x NSTATES interp over kprime[]
254 ihp_1:
255 while (metric > (real)TOLL_K) // eq 14
256 {
257 hw_ihp_iter++;
```

Since the ihp\_1 loop iterations are data dependent, the vitis compiler will not be able to estimate the loop latencies as discussed in the section??. Hence, we use the **#pragma HLS LOOP\_TRIPCOUNT** to inform the compiler about the maximum number of iterations.

```
<sup>287</sup> #pragma HLS loop_tripcount min = 1 avg = 200 max = 2000
```

Initializations. Before executing the IAP Iteration Step (in the next section):

```
289 spread_t spread_scalar = VERY_SMALL_SCALAR;
290
291 // Reset index values for [1600] loop
292 pidx_t p_idx_outer = 0b0100; // 4
293 small_idx_t hundreds_cnt = NKGRID;
294 small_idx_t kp_iter_cnt = (NSTATES_ID * NKGRID);
295 small_idx_t kidx = 0;
```

- we initialize spread\_scalar to a small number. spread\_scalar stores the maximum absolute difference (across the state space) between the guessed policy function and the policy function implied by Equation (3.1), max<sub>(k,∈,m,A)∈K×{0,1}∈×M×A</sub> |k'<sub>i+1</sub> k'<sub>i</sub>|. This variable is updated in the next loop.
- we reset the indexes

At each iteration the loop iterates over the states

$$\rho\left(k_{i+1}',k_{i}'\right) = \max_{(k,\epsilon,m,A)\in\mathbf{K}\times\{0,1\}_{\epsilon}\times\mathbf{M}\times\mathbf{A}}|k_{i+1}'-k_{i}'| < \varepsilon_{k} = 1e(-8)$$

#### 3.6.5.3 IHP Iteration Step.

This loop over the state space  $(k, \epsilon, m, A) \in \mathbf{K} \times \{0, 1\}_{\epsilon} \times \mathbf{M} \times \mathbf{A}$ 

```
296 ihp_2:
297 for (small_idx_t is = 0; is < NSTATES; ++is)
298 {
299 #if SMALL_PL
300 #pragma HLS unroll factor = 1
301 #else
302 #pragma HLS pipeline
303 #endif
```

takes as given:

- tomorrow's predicted aggregate capital kmp = m', as computed in hw\_sim\_alm
- the guessed individual capital-holding policy function,  $\mathbf{kp} = k'_i(k, \epsilon, m, A)$

and uses Equation (3.1) to update the guess

$$\hat{k'}_{i+1} = \Phi k'_i$$
 (a) Solve (3.1)  
 $k'_{i+1} = \eta_k \hat{k'}_{i+1} + (1 - \eta_k) k'_i$  (b) Update Guess

To do so, the IAP Iteration Step performs the following operations:

#### 1. Index Handling (Technical).

```
pidx_t p_idx_inner = 0; // IIDP x IAP
305
        real kmp, temp_base;
306
        emu_s_t emu_s = 0.;
307
        real kp = st_kprimes[0][ is ];
308
309
        // Index handling
310
        if (++kp_iter_cnt >= NSTATES_ID * NKGRID)
311
312
        {
         kp_iter_cnt = 0;
313
         kmp = kmprime[kidx++];
314
315
         temp_base = kmp * ( real )env_l_bar_inv;
316
        }
        if (++hundreds_cnt >= NKGRID)
317
318
        {
319
        hundreds_cnt = 0;
         // (changes between 0 and 4 for every 100 iterations uptil is = 800,
320
         // and changes between 8 and 12 for every 100 iterations uptil is = 1600)
321
         p_idx_outer ^= (pidx_t)0b0100; // (XOR at every bit) 0100 ^ 0100 = 0000 -> 0 ( explicit conversion to short) decimal
322
         value
323
        }
324
        if (is == (NKM_GRID * NSTATES_ID * NKGRID)) // 800 ia
325
         p_idx_outer |= (pidx_t)0b1000; // (OR at every bit) 0000 | 1000 = 1000 -> 8 ( explicit conversion to short) decimal
          value
326
```

2. Compute the conditional expectation emu\_s:

$$\mathbb{E}\left[\frac{1-\delta+r'}{\left(\left(\mu(1-\epsilon')+(1-\tau')\bar{l}\epsilon'\right)w'+(1-\delta+r')k'_{i}-k'_{i}(k'_{i})\right)^{\gamma}}\right]$$

To compute the conditional expectation the algorithm iterates over next period aggregate and idiosyncratic shocks' states:

(a) For each tomorrow's aggregate-shock state, A', compute wages, interest rate and labor-income taxes:

```
ihp 3:
328
       for (int iap = 0; iap < NSTATES_AG; ++iap)</pre>
329
330
       {
331 # if SMALL_PL
332 #pragma HLS unroll factor = 1
   #endif
333
334
        real temp = temp_base * st_env.er_inv[iap];
335
        real irate = st_env. irate_factor [iap] * hw_pow(temp, env_alpha_c);
336
        real imrt = env__delta_c + irate ;
        real wage = st_env.wage_factor[iap] * hw_pow(temp, env_alpha);
337
        small idx t kpb = iap \langle 2;
338
```

(b) For each tomorrow's aggregate shock, A', and idiosyncratic-shock,  $\epsilon'$ , state

```
339 ihp_4:
340 for (int iidp = 0; iidp < NSTATES_ID; ++iidp)
341 {</pre>
```

(c) Use a linear interpolation scheme to determine tomorrow's individual capital-holding choice  $fp = k'' = k'_i(k'_i) = (k'(k, \epsilon, m, A), \epsilon', m', A'))$ 

```
small_idx_t i1_min = hw_findrange((fixed_t)kmp, fxd_km_grid, NKM_GRID);
345
          small_idx_t i1_max = i1_min + 1;
346
347
          real i1_min_val = km_grid[i1_min];
          real i1_max_val = km_grid[i1_max];
348
          small_idx_t i2_min = hw_findrange((fixed_t)kp, fxd_k_grid, NKGRID);
349
          small idx t i2 max = i2 min + 1;
350
          real i2_min_val = k_grid[i2_min];
351
          real i2_max_val = k_grid[i2_max];
352
          small_idx_t idx_base = li_2d_aux_idx_base[p_idx_inner];
353
          small_idx_t i1_min_base = idx_base + (NSTATES_ID * NKGRID * i1_min);
354
          small_idx_t i1_max_base = idx_base + (NSTATES_ID * NKGRID * i1_max);
355
          real tz_num = (kmp - i1_min_val);
356
          real tz_den = (i1_max_val - i1_min_val);
357
          real tz = tz num / tz den;
358
          real tw_num = (kp - i2_min_val);
359
          real tw_den = (i2_max_val - i2_min_val);
360
          real tw = tw_num / tw_den;
361
          real sub_tz = (1.0 - tz);
362
363
          real sub_tw = (1.0 - tw);
          real sub_tz_sub_tw = sub_tz * sub_tw;
364
          real tz tw = tz * tw:
365
          real sub tz tw = sub tz * tw;
366
          real tz sub tw = tz * sub tw;
367
368 # if (NUM_KPRIMES == 1)
```

```
real fp_1 = st_kprimes[0][i1_min_base + i2_min] * sub_tz_sub_tw +
369
             st_kprimes[0][i1_min_base + i2_max] * sub_tz_tw;
370
371
          real fp_2 = st_kprimes[0][i1_max_base + i2_min] * tz_sub_tw +
            st_kprimes[0][i1_max_base + i2_max] * tz_tw;
372
373
    # elif (NUM_KPRIMES == 4)
374
          real fp_1 = st_kprimes[0][i1_min_base + i2_min] * sub_tz_sub_tw +
375
             st_kprimes[1][i1_min_base + i2_max] * sub_tz_tw;
376
          real fp_2 = st_kprimes[2][i1_max_base + i2_min] * tz_sub_tw +
             st_kprimes [3][i1_max_base + i2_max] * tz_tw;
377
    # elif (NUM KPRIMES == 8)
378
          real fp_1 = st_kprimes[kpb + 0][i1_min_base + i2_min] * sub_tz_sub_tw +
379
             st_kprimes[kpb + 1][i1_min_base + i2_max] * sub_tz_tw;
380
          real fp_2 = st_kprimes[kpb + 2][i1_max_base + i2_min] * tz_sub_tw +
381
382
             st_kprimes[kpb + 3][i1_max_base + i2_max] * tz_tw;
383 #endif
          real fp = fp_1 + fp_2;
384
```

*Note:* The algorithm implements a fixed-size, parallel search algorithm as discussed in the paper..

(d) Given tomorrow's individual capital-holding choice fp and tomorrow's wealth, compute tomorrow's consumption  $cons2 = (\mu(1 - \epsilon') + (1 - \tau')\bar{l}\epsilon')w' + (1 - \delta + r')k'_i - k'_i(k'_i)$  and the marginal utility of tomorrow's consumption mu2

 $\frac{1-\delta+r'}{\left(\left(\mu(1-\epsilon')+(1-\tau')\bar{l}\epsilon'\right)w'+(1-\delta+r')k'_i-k'_i(k'_i)\right)}$ (e) Compute  $\mathbb{E}$ emu\_s\_t emu\_s = 0.; 307 328 ihp\_3: for (int iap = 0; iap < NSTATES\_AG; ++iap) 329 330 { 339 ihp\_4: for (int iidp = 0; iidp < NSTATES\_ID; ++iidp)</pre> 340 341 {

3. Compute the RHS of Equation (3.1) and store it in new\_kp =  $k'_{i+1}$ 

$$\hat{k'}_{i+1} = \left[ \mu(1-\epsilon) + (1-\tau)\overline{l}\epsilon \right] w + (1-\delta+r)k$$

$$- \left\{ \lambda + \beta \mathbb{E} \left[ \frac{1-\delta+r'}{\left( (\mu(1-\epsilon') + (1-\tau')\overline{l}\epsilon') w' + (1-\delta+r')k'_i - k'_i(k'_i) \right)^{\gamma}} \right] \right\}^{-1/\gamma}$$

397 real new\_kp = init\_wealth[is] - hw\_pow(env\_beta \* (real)emu\_s, env\_gamma\_neg\_inv); // eq 10

*Note:* Notice, following Maliar et. al (2010) we set the multipler  $\lambda$  to 0...

#### 3.6.5.4 Closing the IAP Loop.

#### 1. Update the guess.

```
406
      ihp_5:
       for (small_idx_t is = 0; is < NSTATES; ++is)</pre>
407
408
      {
   #pragma HLS pipeline
409
        real updated_kp = UPDATE_K * kprime_new[is] + UPDATE_K_C * st_kprimes[0][is]; // eq 13
410
        for (small_idx_t k = 0; k < NUM_KPRIMES; ++k)</pre>
411
         st_kprimes[k][is] = updated_kp;
412
413
      }
```

$$k'_{i+1} = \eta_k \hat{k'}_{i+1} + (1 - \eta_k) k'_i$$

*Note:* To reduce the memory ports access bottleneck we created NUM\_KPRIMES copies of the policy function guess  $k'_i$ , which all need to be initialized with the new guess.

2. Update the metric =  $\rho(k'_{i+1}, k'_i)$ .

$$\rho\left(k_{i+1}',k_{i}'\right) = \max_{(k,\epsilon,m,A)\in\mathbf{K}\times\{0,1\}_{\epsilon}\times\mathbf{M}\times\mathbf{A}}|k_{i+1}'-k_{i}'| < \varepsilon_{k} = 1e(-8)$$

```
ihp_5:
406
      for (small_idx_t is = 0; is < NSTATES; ++is)</pre>
407
      {
408
    #pragma HLS pipeline
409
        real updated_kp = UPDATE_K * kprime_new[is] + UPDATE_K_C * st_kprimes[0][is]; // eq 13
410
411
        for (small_idx_t k = 0; k < NUM_KPRIMES; ++k)</pre>
412
         st_kprimes[k][is] = updated_kp;
413
      }
```

The metric is updated and before the start of next iteration, it is checked if lower (equal) to TOLL\_K ( $\varepsilon_k$ ), the loop exits.

415 // ~ Update metric
416 metric = (real) spread\_scalar;

## 3.6.6 Stochastic Simulation: hw\_sim\_ast

**Description**: This function simulates the time series of the cross-sectional average (per-capita) stock of capital  $\{m_t\}_{t=1}^{1100}$  which is then used by the aggregate law of motion function sim\_alm\_coeff to estimate the expected evolution of the capital distribution.

Acceleration: Array Partition, Pipeline, Unroll.

#### 3.6.6.1 Memory Management.

We first determine the number of reads for each of the arrays and perform the array\_partition as per the requirement. For example, the array st\_kcross is a double precision 1D array with 10,000 (N\_AGENTS) elements. As we will see in later section of the code, for every iteration of the inner most loop, there is a read and write operation requiring at least 2 ports for a single pipeline. In the baseline model, we require 8 parallel pipelines which translates to requiring 16 IO ports. In the below code, where the PARTITION\_KCROSS is set to 8, we partition the array in a cyclic manner with a factor of 8 resulting us with 16 ports. Since we explicitly specify the memory type to be RAM\_S2P, we get 8 read ports and 8 write ports all of which can be accessed in the same clock cycle.

```
#if (PARTITION_KCROSS == 1)
#pragma HLS array_partition variable = st_kcross type = cyclic factor = 1
#elif (PARTITION_KCROSS == 4)
#pragma HLS array_partition variable = st_kcross type = cyclic factor = 4
#elif (PARTITION_KCROSS == 8)
#elif (PARTITION_KCROSS == 8)
#endif
#endif
#pragma HLS bind_storage variable = st_kcross type = RAM_S2P impl = BRAM
```

The interpolated values are read 4 times in a random manner for each of the pipeline. In the baseline model, we have 8 parallel pipelines. Therefore, we allocate the memory for two copies each of which have NUM\_KCROSS number of copies. In total, we create  $NUM_KCROSS * 2 = 16$  copies of the interpolated values. When we partition then using a dual port RAM across the first dimension, we get 32 read ports which can then satisfy our requirement of 4 reads over 8 pipelines.

```
#if AST_UNROLL
real kprime_interp0[NUM_KCROSS][NSTATES_ID * NKGRID];
real kprime_interp1[NUM_KCROSS][NSTATES_ID * NKGRID];
#pragma HLS array_partition variable = kprime_interp0 complete dim = 1
#pragma HLS array_partition variable = st_env.epsilon2 complete dim = 1
#pragma HLS array_partition variable = st_env.epsilon2 complete dim = 1
#else
real kprime_interp0[NSTATES_ID * NKGRID];
real kprime_interp1[NSTATES_ID * NKGRID];
#endif
```

As discussed in section ??, we provide an option to optimize the memory usage for storing the IDSHOCKS when the PACK\_IDS is enabled. In the below code, we set the count to start from the number of IDSHOCKS stored in each of the array elements.

```
459 #if PACK_IDS
460 small_idx_t idshock_cnt = 64;
461 ap_uint<72> temp_ids = idshock [0];
462 #else
463 small_idx_t idshock_cnt = 8;
464 #endif
```

The temporary variables are declared to keep track of the shocks.

```
int idshock_idx = 0;
idx_t agshock_idx = 0;
shock_t curr_ids;
shock_t curr_ags;
small_idx_t ags_phase = AGS_PACK_FACTOR;
```

The initial value of the moment of the capital distribution is passed in to this function. For every next iteration, this value is calculated at the end of its previous iteration. This value is then checked to be within the bounds of 30,50.

```
real curr_kmts = (real)kcross_mean * N_AGENTS_INV;
hw_rail_values(&curr_kmts, KM_MAX, KM_MIN);
```

#### 3.6.6.2 Loop.

For each time period  $t \in \{0, \ldots, 1099\}^{I}$ 

```
479 ast_1:
480 for (int t = 0; t < SIM_STEPS; ++t)
481 {</pre>
```

1. **Interpolation.** For each individual j = 1, ..., 10, 000, use an interpolation scheme to determine the next period individual capital holdings, given the period t idiosyncratic  $(k_{t,j}, \epsilon_{t,j})$  and aggregate  $(m_t, A_t)$  state.

```
kmts[t] = curr_kmts;
486
487
       // Read next packed agshock value when needed
488
       if (++ags_phase >= AGS_PACK_FACTOR)
489
490
       {
       curr_ags = agshock[agshock_idx++];
491
        ags_phase = 0;
492
493
       }
494
      bool p0 = (curr_ags & 0b1) ? 0b1 : 0b0;
495
496
497
      curr_ags >>= 1;
       real p1 = kmts[t];
498
       small_idx_t i2_min = hw_findrange((fixed_t)p1, fxd_km_grid, NKM_GRID);
499
       small_idx_t i2_max = i2_min + 1;
500
       real i2_min_val = km_grid[i2_min];
501
       real i2_max_val = km_grid[i2_max];
502
       real ty = (p1 - i2\_min\_val) / (i2\_max\_val - i2\_min\_val);
503
       real P = (p0 == 1) ? 0 : (1.0 - ty);
504
       real Q = (p0 == 1) ? 0 : (ty);
505
       real R = (p0 == 1)? (1.0 - ty): 0;
506
       real S = (p0 == 1)? (ty) : 0;
507
       small_idx_t i1_min_base = 0; // L4D_D3 * i1.min(0)
508
       small_idx_t i1_max_base = L4D_D3; // L4D_D3 * i1.max
509
       small_idx_t i2_min_base = L4D_D2 * i2_min;
```

<sup>&</sup>lt;sup>1</sup>Notice the recasting of the time indexes from  $\{1, ..., 1100\}$  to  $\{0, ..., 1099\}$  in order to accommodate the array indexing convention in C.

```
511 small_idx_t i2_max_base = L4D_D2 * i2_max;
512 small_idx_t i12_min_min = i1_min_base + i2_min_base;
513 small_idx_t i12_min_max = i1_min_base + i2_max_base;
514 small_idx_t i12_max_min = i1_max_base + i2_min_base;
515 small_idx_t i12_max_max = i1_max_base + i2_max_base;
516 small_idx_t kpi_idx = 0;
517
```

Begin by initializing values of the aggregate shock  $A_t$  and the average of individual capital holdings  $m_t$  for interpolation Initialize values for interpolation given each idiosyncratic shock to the employment status,  $\epsilon_{t,j} \in \{0, 1\}_{\epsilon}$ 

```
517 small_idx_t i3_min_base = 0; // L4D_D1 * i3.min (0)
518 small_idx_t i3_max_base = L4D_D1; // L4D_D1 * i3.max (1)
519 real tz = st_env.epsilon[iid];
520
```

Initialize values for interpolation given each point in the individual capital holdings grid,  $k_{t,j} \in \mathbf{K}$ 

```
523 ast_3:
524 for (int ik = 0; ik < NKGRID; ++ik)
525 {
526 #pragma HLS pipeline
527 int i4_min = ik;
528 real p = (1.0 - tz);
529 real r = tz;
530 }
531
```

Use linear interpolation to determine the next period individual capital holdings fp =  $k'(k, \epsilon, m, A)$ 

```
530
         small_idx_t kp_idx_0 = i4_min + i3_min_base + i12_min_min;
         small_idx_t kp_idx_2 = i4_min + i3_max_base + i12_min_min;
531
         small_idx_t kp_idx_4 = i4_min + i3_min_base + i12_min_max;
         small_idx_t kp_idx_6 = i4_min + i3_max_base + i12_min_max;
533
534
         small_idx_t kp_idx_8 = i4_min + i3_min_base + i12_max_min;
         small_idx_t kp_idx_10 = i4_min + i3_max_base + i12_max_min;
         small_idx_t kp_idx_12 = i4_min + i3_min_base + i12_max_max;
536
         small_idx_t kp_idx_14 = i4_min + i3_max_base + i12_max_max;
537
   // ** LI3D
538
    # if ((NUM_KPRIMES == 4) || (NUM_KPRIMES == 8))
539
         real fp = st_kprimes[0][kp_idx_0] * P * p +
540
             st_kprimes[0][kp_idx_2] * P * r +
541
542
             st_kprimes[1][kp_idx_4] * Q * p +
             st_kprimes[1][kp_idx_6] * Q * r +
543
             st_kprimes[2][kp_idx_8] * R * p +
544
             st_kprimes[2][kp_idx_10] * R * r +
545
             st_kprimes[3][kp_idx_12] * S * p +
546
             st_kprimes[3][kp_idx_14] * S * r;
547
    # elif (NUM_KPRIMES == 1)
548
         real fp = st_kprimes[0][kp_idx_0] * P * p +
549
             st_kprimes[0][kp_idx_2] * P * r +
550
             st_kprimes [0][kp_idx_4] * Q * p +
551
             st_kprimes[0][kp_idx_6] * Q * r +
552
             st_kprimes[0][kp_idx_8] * R * p +
553
             st kprimes[0][kp idx 10] * R * r +
554
             st_kprimes[0][kp_idx_12] * S * p +
555
```

```
556 st_kprimes[0][kp_idx_14] * S * r;
557 #endif
558
```

Store the solution given each point in the capital holdings grid as kprime\_interp0 and kprime\_interp1

```
#if AST_UNROLL
558
         for (int k = 0; k < NUM_KCROSS; ++k)
559
560
         {
          kprime_interp0[k][kpi_idx] = fp;
561
          kprime_interp1[k][kpi_idx] = fp;
562
563
         }
   #else
564
         kprime_interp0[kpi_idx] = fp;
565
566
         kprime_interp1[kpi_idx] = fp;
567
   #endif
         ++kpi_idx;
568
569
```

Initialise the aggregate capital to 0

```
572 // aggregate capital initialized to 0
573 kc_t agg_capital = 0;
574
```

Iterate over N\_AGENTS using 8 parallel pipelines. The **#pragma HLS PIPELINE** unrolls the inner loop completely creating 8 pipelines. The IDSHOCKS when the PACK\_IDS is enabled, consists of 64 shocks in each element, hence a new element is fetched from the array only once for every 8 iterations of ast\_4

```
small idx t kidx = 0;
576
577 // Loop 1.3: AST agents interp over kprime_interp
578 // Unroll factor dictated by inner loop over k
579 # if PACK_IDS
580
      idshock_cnt = 8;
581
   #endif
582
      ast 4:
      for (int j = 0; j < (N_AGENTS / IDS_PACK_FACTOR) / IDS_AGG_X; j++)
583
      {
584
585 #pragma HLS pipeline
   # if PACK_IDS
586
        if (idshock_cnt >= 8)
587
588
        {
589
        idshock_cnt = 0;
         temp_ids = idshock[idshock_idx];
590
        idshock_idx++;
591
592
        }
        curr_ids = temp_ids & 0xFF;
593
        idshock_cnt++;
594
        temp_ids >>= 8;
595
    #else
596
        curr_ids = idshock[idshock_idx++];
597
598
    #endif
599
```

Initialize values for interpolation over kprime\_interp0 and kprime\_interp1 from above

618

688 689

```
real p1b = st_kcross [kidx];
603
         small_idx_t i2b_min = hw_findrange((fixed_t)st_kcross[kidx], fxd_k_grid, NKGRID);
604
         small_idx_t i2b_max = i2b_min + 1;
605
         real i2b_min_val = k_grid[i2b_min];
606
         real i2b_max_val = k_grid[i2b_max];
607
         bool p0b = (curr_ids & 0b1) ? 0b1 : 0b0;
608
         curr ids >>= 1:
609
         small idx t i1b min base = 0; // NKGRID * i1b min(0)
610
         small_idx_t i1b_max_base = NKGRID; // NKGRID * i1b_max(1)
611
         real bw = (p1b - i2b_min_val) / (i2b_max_val - i2b_min_val);
612
         real sub_bw = (1.0 - bw);
613
         real bz_bw = (p0b == 1) ? bw : 0;
614
         real sub_bz_sub_bw = (p0b == 1) ? 0 : sub_bw;
615
         real bz_sub_bw = (p0b == 1) ? sub_bw : 0;
616
         real bw_sub_bz = (p0b == 1) ? 0 : bw;
617
```

```
Use linear interpolation to compute and store next period aggregate capital given each agent's individual savings decision
```

```
real fbp_1 = (kprime_interp0[k][i1b_min_base + i2b_min] * sub_bz_sub_bw) +
618
             (kprime_interp0[k][i1b_min_base + i2b_max] * bw_sub_bz);
619
         real fbp_2 = (kprime_interp1[k][i1b_max_base + i2b_min] * bz_sub_bw) +
620
             (kprime_interp1[k][i1b_max_base + i2b_max] * bz_bw);
621
         kc_t fpb = kc_t(fbp_1 + fbp_2);
622
623
         hw_fxd_rail_values(&fpb, KMAX, KMIN);
624
         st_kcross[kidx] = (real)fpb;
625
         agg_capital += fpb;
626
         kidx++;
627
```

2. Accumulation. For each time period t, compute  $m_t$ , the cross-sectional average of individual capital holdings

$$m_t = \frac{1}{\mathcal{I}} \sum_{j=1}^{\mathcal{J}} k_{j,t}.$$

curr\_kmts = (( real ) agg\_capital \* N\_AGENTS\_INV);

For values that fall outside the capital grid,  $\mathbf{M} = [m_{\min}, m_{\max}]$ , set as the range value

```
689 hw_rail_values(&curr_kmts, KM_MAX, KM_MIN);
690
```

## 3.6.7 Aggregate Law of Motion: <a href="mailto:sim\_alm\_coeff">sim\_alm\_coeff</a>

**Description**: This function estimates the *i*-iteration ALM coefficients  $\hat{b}^i(a) = (\hat{b}^i_1(a), \hat{b}^i_2(a))$  and updates them.

Acceleration: Array Partitioning, Pipelining..

1. **House keeping.** Store old coefficient  $b_l^i(a)$ ,  $a \in \{a_b, a_g\}$ , Prevent automatic array partitioning of coeff array

```
701 real coeff [NCOEFF] = {0.};
702 sim_alm_1:
703 for (small_idx_t i = 0; i < NCOEFF; i++)
704 {
705 #pragma HLS pipeline off
706 coeff [i] = coeff_updated [i];
707 }</pre>
```

#### Initializations

```
small_idx_t agshock_idx = 0;
707
      small_idx_t ags_phase = AGS_PACK_FACTOR;
708
709
     shock_t curr_ags = 0;
     shock_t curr_shock_val = 0;
710
      real coeff_new[NCOEFF] = {0.};
      real x_good_v[1000] = {0.};
      real y_good_v[1000] = {0.};
      real x_bad_v[1000] = {0.};
714
      real y_bad_v[1000] = \{0.\};
716
      int ibad = 0;
717
      int igood = 0;
718
719
     agshock_idx = 0;
720
      ags_phase = AGS_PACK_FACTOR;
```

```
sim_alm_2:
    for (int t = 0; t < SIM_STEPS; t++)
722
723
   {
724 #pragma HLS pipeline off
725 #pragma HLS unroll factor = 1
      // Read new value when needed
726
727
      if (++ags_phase >= AGS_PACK_FACTOR)
728
      {
      curr_ags = agshock[agshock_idx++];
729
      ags_phase = 0;
730
731
      }
      curr_shock_val = curr_ags & 0b1; // take the least significant bit from the byte
732
      curr_ags >>= 1; // right shift by 1
733
      // Discard first 100
734
735
     sim alm 3:
      if (t < NDISCARD || t > SIM_STEPS - 2)
736
737
      continue;
```

**Organize the time series.** The best linear approximation of the conditional expectation of next period log-aggregate capital depends on the aggregate shock. So after discarding the first 100 observations the code split the simulated data  $\{m_t\}_{t=100}^{1,100}$  into two time series. To estimate the coefficients:

2. • when the aggregate shock is 
$$a_t = a_b, \{b_1(a_t), b_2(a_b)\}$$

$$E[\ln m_{t+1}|a_t = a_b] = b_1(a_b) + b_2(a_b) \ln m_t, \quad t = 100, \dots, 1100$$

```
738 sim_alm_4:
739 if (curr_shock_val == 0)
740 {
741 y_bad_v[ibad] = hw_log(kmts[t + 1]);
742 x_bad_v[ibad] = hw_log(kmts[t]);
743 ibad++;
744 }
```

it collects

 $\{\ln m_{l+1}, \ln m_l\}_{l \in \{t \in \{100, \dots, 1100\}: a_t = a_b\}}$ 

• when the aggregate shock is  $a_t = a_g$ ,  $\{b_1(a_t), b_2(a_g)\}$ 

 $E[\ln m_{t+1}|a_t = a_g] = b_1(a_g) + b_2(a_g) \ln m_t, \quad t = 100, \dots, 1100$ 

```
745 else
746 {
747 y_good_v[igood] = hw_log(kmts[t + 1]);
748 x_good_v[igood] = hw_log(kmts[t]);
749 igood++;
750 }
```

it collects

```
\{\ln m_{l+1}, \ln m_l\}_{l \in \{t \in \{100, \dots, 1100\}: a_t = a_g\}}
```

```
752
      real badcoeff [2] = {0.}; // initialize to prevent garbage values
753
      real goodcoeff [2] = \{0.\};
      regression (badcoeff, x_bad_v, y_bad_v, ibad);
754
      regression (goodcoeff, x_good_v, y_good_v, igood);
756
      real rbad = RSquaredCalc(badcoeff, x_bad_v, y_bad_v, ibad);
      real rgood = RSquaredCalc(goodcoeff, x_good_v, y_good_v, igood);
757
      coeff_new[0] = badcoeff [0]; // bb
758
      coeff_new[1] = badcoeff [1];
759
      coeff_new[2] = goodcoeff [0];
760
      coeff_new[3] = goodcoeff [1];
761
762
      R2[0] = rbad;
763
      R2[1] = rgood;
```

**Estimate the coefficients.** For each aggregate state  $a_t \in \{a_b, a_g\}$  it uses the matrixfunction to run the OLS regressions

```
 \ln m_{l+1} = b_1(a_l) + b_2(a_l) \ln m_l + \epsilon_{l+1}, \qquad l \in \{t \in \{100, \dots, 1100\} : a_l = a_b\} 
  \ln m_{l+1} = b_1(a_l) + b_2(a_l) \ln m_l + \epsilon_{l+1}, \qquad l \in \{t \in \{100, \dots, 1100\} : a_l = a_g\}
```

and estimate the coefficients governing the transition from a bad state badcoeff = { $b_1(a_t)$ ,  $b_2(a_b)$ }. and good state goodcoeff = { $b_1(a_t)$ ,  $b_2(a_g)$ }.

```
765 // Update metric for convergence test (eq 17)
```

```
766 real norm = 0.;
```

```
767 sim_alm_5:
```

```
for (int ib = 0; ib < NCOEFF; ++ib)
{
    #pragma HLS pipeline off
    norm += (coeff_new[ib] - coeff[ib]) * (coeff_new[ib] - coeff[ib]);
    }
    *metric = hw_sqrt(norm);</pre>
```

Compute the Euclidean Norm.

$$\sqrt{\sum_{l \in \{1,2\}, a \in \{a_b, a_g\}} (b_l^{i+1}(a) - b_l^i(a))^2} < \varepsilon_b = 1e(-8)$$

```
// Update ALM coefficients vector
sim_alm_6:
for (int ib = 0; ib < NCOEFF; ++ib)
{
    for (int ib = 0; ib < NCOEFF; ++ib)
    ceff_updated[ib] = coeff_new[ib] * UPDATE_B + coeff[ib] * (1. - UPDATE_B); //
}</pre>
```

Update the Coefficients.

$$b_l^{i+1}(a) = \eta_b \hat{b}_l^i(a) + (1 - \eta_b) b_l^i(a), \qquad l \in \{1, 2\}, \quad a \in \{a_b, a_g\}$$

#### 3.6.7.1 Regression Coefficients: Regression

**Description**: This function computes the estimated coefficients. Since the mathematical operators such as pow, div consumes significant amount of hardware resources, and the execution time of this function is considerably small, we decided to turn-off the automatic pipeline to make use of the hardware resources for more time-consuming tasks. We instruct the compiler using **#pragma HLS UNROLL** to unroll the loop by a factor of 1 and use **#pragma HLS LOOP\_TRIPCOUNT** to specify the number of loop iterations.

Acceleration: No acceleration.

```
void regression (real * resultmatrix, real *x, real *y, int ndim)
783
784
    {
     real twobytwo[4] = \{0, 0, 0, 0\};
785
    RG_1:
786
     for (int i = 0; i < ndim; i++)
787
     {
788
789 #pragma HLS loop_tripcount min = 100 avg = 494 max = 1000
790 #pragma HLS unroll factor = 1
791 #pragma HLS pipeline off
792
     twobytwo[0] += 1;
793
      twobytwo[1] += x[i];
      twobytwo[2] += x[i];
794
      twobytwo[3] += hw_pow(x[i], 2);
795
796
     }
797 // get inverse
```

```
real a = twobytwo[0]; // switching indices and multiplying by determinant
798
      real b = twobytwo[1];
799
      real c = twobytwo[2];
800
     real d = twobytwo[3];
801
802
     real det = (a * d - b * c);
803
804
     real inv_det = (1.0 / det);
     real inv_d = inv_det * d;
805
     real inv_b = inv_det * (b) * -1;
806
     real inv_c = inv_det * (c) * -1;
807
     real inv_a = inv_det * a;
808
     real acc1 = resultmatrix [0];
809
     real acc2 = resultmatrix [1];
810
811
     // multiply by transpose of matrix and y
812
    RG_2:
     for (int i = 0; i < ndim; i++)
813
814
     {
815 #pragma HLS loop_tripcount min = 100 avg = 494 max = 1000
816 #pragma HLS unroll factor = 1
   #pragma HLS pipeline off
817
       real acc_t1 = inv_b * x[i];
818
       real acc_t2 = inv_d + acc_t1;
819
      acc1 += acc_t2 * y[i];
820
821
     }
     resultmatrix [0] = acc1;
822
     RG_3:
823
     for (int i = 0; i < ndim; i + +)
824
     {
825
<sup>826</sup> #pragma HLS loop_tripcount min = 100 avg = 494 max = 1000
827 #pragma HLS unroll factor = 1
828 #pragma HLS pipeline off
       real acc2_t1 = inv_a \star x[i];
829
      real acc2_t2 = inv_c + acc2_t1;
830
      acc2 += acc2_t2 * y[i];
831
     }
832
      resultmatrix [1] = acc2;
833
834
     return;
835
    }
```

### 3.6.7.2 Regression *R* squared: RSquaredCalc

Description: This function calculates the R squared coefficient.

Acceleration: No Acceleration.

Initialize the temporary variables and compute the rsquared result using the minimal hardware resources. Since this computation involves several complex mathematical operators, **#pragma HLS PIPELINE** is explicitly set to off and **#pragma HLS UNROLL** is set to use a factor of 1. R2\_1 computes the average fitted values and R2\_2 computes the sum of squared residuals (rss) and the total sum of squares (tss).

```
837 real RSquaredCalc(real * coeff, real *x, real *y, int ndim)
838 {
839 real r_value = 0;
840 real predict [1000] = {0};
841 real rss = 0;
842 real tss = 0;
843 real y_mean = 0;
```

```
R2 1:
844
     for (int i = 0; i < ndim; i + +)
845
     {
846
847 #pragma HLS pipeline off
   #pragma HLS unroll factor = 1
848
849
    #pragma HLS loop_tripcount min = 100 avg = 494 max = 1000
850
      y_mean += y[i];
851
     }
     y_mean = (y_mean / ndim);
852
853
     R2 2:
854
855
     for (int i = 0; i < ndim; i++)
856
     {
857 #pragma HLS pipeline off
858
   #pragma HLS unroll factor = 1
   #pragma HLS loop_tripcount min = 100 avg = 494 max = 1000
859
860
       predict [i] = (coeff [0] + (coeff [1] * x[i]));
861
       rss += hw_pow((predict[i] - y[i]), 2);
       tss += hw_pow((y[i] - y_mean), 2);
862
863
      }
      r_value = (1.0 - (rss / tss));
864
865
     return r_value;
```

## 3.6.8 Math Functions

Collection of double precision operations - (hw\_exp, hw\_log, hw\_sqrt, hw\_fabs, hw\_pow)

When the math operators are implemented in the fpga, they use the bit-approximate HLS math library functions which do not have the same accuracy as the standard C function. To achieve the same result, these functions use a different underlying algorithm from the standard C functions. The accuracy of this is between 1-4 ULP (Unit of Least Precision). If the standard math.h is used, there can be differences between the C simulation results and the RTL co-simulation results due to the fact of having different underlying function definitions as explained above. However, if we use the Vitis HLS Math Library (hls\_math.h), there will be no difference between the C simulation and the RTL co-simulation. However, as hls\_math.h is not optimized to run on CPU, using the hls mathematical operators results in longer execution times during the sw\_emu. For example, In hw\_exp function hls::exp uses the function from hls\_math.h. This function is also inlined.

```
938 real hw_exp(real b)
939 {
940 #pragma HLS inline
941 # if USE_HLS_LIB
942 return hls :: exp(b);
943 #else
944 return exp(b);
945 #endif
946 }
```

## 3.6.9 Linear Interpolation

#### 3.6.9.1 hw\_findrange

**Description**: This function uses an optimized routine to find the interpolation range. The function comes in five versions, which differ in the size of the interpolation grids: new\_hw\_findrange\_n4, hw\_findrange\_n8, hw\_findrange\_n100, hw\_findrange\_n200, hw\_findrange\_n300. **Acceleration:** Unrolling, Pipelining..

```
1096
      small_idx_t hw_findrange(fixed_t p, const fixed_t *src, int n_elem)
1097
     {
1098 #if !_BASELINE
1099 #pragma HLS inline
1100 # if (NKM_GRID == 4)
      if (n_{elem} = 4)
1101
       return hw_findrange_n4(p, src);
1102
1103 # elif (NKM_GRID == 8)
      if (n_elem == 8)
1104
       return hw_findrange_n8(p, src);
1105
1106 #endif
1107 # if (NKGRID == 100)
      else if (n_{elem} = 100)
1108
       return hw_findrange_n100(p, src);
1109
1110 # elif (NKGRID == 200)
      else if (n_{elem} = 200)
       return hw_findrange_n200(p, src);
1113 # elif (NKGRID == 300)
1114
      else if (n_{elem} = 300)
       return hw_findrange_n300(p, src);
1115
1116 #endif
      else
       return 0;
1118
1119 #else
      small_idx_t result = 1;
1120
      for (signed short i = (n_elem - 1); i > 0; --i)
      {
       if (p <= src[i])
1124
       {
        result = i - 1:
1126
       }
1127
      }
1128
      return result ;
1129
     #endif
     }
1130
```

Based on the selection of the NKGRID, NKM\_GRID, the appropriate functions will be synthesized and the rest will be disabled. A generic function can be designed that could work efficiently for all the different grids, but that is left for future experiments.

We accelerate interpolation as follows. First, we declare the loop bounds of the individual and aggregate capital grids (namely,  $\{0, N_k\}$  and  $\{0, N_M\}$ ) as fixed constants, allowing the compiler to autonomously physically *place* the required CL resources (*space dimension*). Next, we implement a jump search algorithm to find the interpolation interval over the individual capital grid. The compiler instructs the hardware to pipeline a parallel reduce tree algorithm with three stages.

Each stage determines the index of the smallest grid value larger than the interpolation point  $k'(k, \epsilon, m, A)$  by performing comparisons in parallel. The number of comparisons varies by stage and grid size and ensures that the entire grid is examined,  $i = \{0, ..., N_k\}$ . The winner of each stage determines the search area of the successive stage. Since the result of this operation is part of a pipeline where the only dependence on subsequent loop iterations is through a final accumulation, we achieve an **II** of 1.

Notice that the input to this function is of fixed point data type rather than the standard double precision. The floating point comparison is implemented using dcmp (Double precision comparator) operator which consumes significant amount of hardware resources. Therefore, we type cast the input data type of fixed point data type and use the grid of values which are in fixed point representation to perform all the 100 comparisons using icmp (Integer comparator) which consumes minimal resources.

Importantly for context, the CPU cannot physically place CL resources to make these comparisons in parallel, as its silicon is pre-manufactured and cannot be programmed. We could potentially implement the described parallel-search algorithm using multiple cores. But this design would be very inefficient, as the data transfer overhead costs would dominate the increase in performance. Conversely, our single FPGA vs. single CPU core and multi-core CPU benchmarking exercises are efficient, as they keep all CPU cores busy, minimizing data transfer overhead costs.<sup>2</sup>

```
1179
     small_idx_t hw_findrange_n100(fixed_t p, const fixed_t * src)
1180
     {
1181 #pragma HLS pipeline
1182
       small_idx_t result_1 = 0;
1183
       small_idx_t result_2 = 0;
1184
       small_idx_t result_3 = 0;
       small_idx_t result = 0;
1185
1186
      fr100 1:
1187
       for (signed short i = 99; i > 0; i=i-20) // 5 comparators
1188
1189
1190
       fr100 2:
1191
       if (p <= src[i])
1192
       {
1193
         result_1 = i; // send the max idex
1194
        }
1195
       }
1196
      fr100 3:
1197
       for (signed short i = 4; i > 0; i--) // 4 comparators
1198
1199
       {
       fr100 4:
1200
       if (p <= src[result_1])</pre>
1201
1202
        {
         result_2 = result_1 ; // send the max index
1204
        }
        result_1 = result_1 - (small_idx_t);
1205
1206
```

 $<sup>^{2}</sup>$ The C++ to CPU compiler can autonomously decide to perform these operations in parallel, but this step is not controlled by the coder.

```
1207
     fr100 5:
1208
      for (signed short i = 5; i > 0; i - -) // 5 comparators
1209
1210
1211
      fr100_6:
1212
       if (p <= src [ result_2 -- ])
1213
       {
1214
        result_3 = result_2 ; // send the min index
       }
      }
1216
       result = (p = src[0]) ? (small_idx_t)0 : result_3;
1218
      return result;
1219
1220
     }
```

#### 3.6.9.2 hw\_rail\_values

**Description**: This function set the values outside the range to the range values.

#### Acceleration: Inline..

The **#pragma HLS INLINE** synthesizes separate hardware each time the function is called.

```
void hw_rail_values(real *val, const real max, const real min)
1118
1119 {
1120 #pragma HLS inline
     real src = * val;
     bool over_max = (src > max);
     bool under_min = (src < min);</pre>
1123
1124
1125 hw_rail_1:
1126
      if (over_max)
       * val = max;
      else if (under_min)
1128
      * val = min:
1129
      return;
1130
    }
1131
```

# 3.7 FPGA Configuration & Runtime Initialization

## 3.7.1 Configuration File: design.cfg

**Description.** The Vitis allows the user to control the compiler and the linker behavior using the configuration file. More information regarding the different options can be found here.

```
<sup>1</sup> #check if the platform is the latest version
```

```
2 platform=xilinx_aws-vu9p-f1_shell-v04261818_201920_3
```

```
3 debug=1
```

```
4 profile_kernel =data: all : all : all
```

```
5 save-temps=1
```

```
6
7 [hls]
```

```
8 pre_tcl = hls_config . tcl
```

In our baseline model, we use three kernels. Therefore, the three kernel names are defined here under the *connectivity*. We further specify the SLR names for each of these three kernels followed by the DDR port assignment. The xclbin utility provides us with the information about the DDR ports that are attached to each of the SLR. By using the respective ports, we can minimize the SLR crossings. If no details are specified in the configuration file, the compiler automatically tries to configure the ports which may not be optimal.

The following command can be executed in the terminal after setting the environment variables to get the information of the DDR ports.

```
1 source $AWS_FPGA_REPO_DIR/vitis_setup.sh
2 export PLATFORM_REPO_PATHS=$(dirname $AWS_PLATFORM)
  platforminfo -$AWS_FPGA_REPO_DIR
3
10 #Enable either single kernel or three kernel
  11
12 # [ connectivity ]
13 # nk=runOnfpga:1:runOnfpga_1
16
17
  [ connectivity ]
  nk=runOnfpga:3:runOnfpga_1.runOnfpga_2.runOnfpga_3
18
19
20 # slr =<compute_unit_name>:<slr_ID>
21 slr =runOnfpga_1:SLR2
  slr=runOnfpga 2:SLR1
22
  slr=runOnfpga_3:SLR0
24
25
  # [ connectivity ]
26
  sp=runOnfpga_1.hw_agshock:DDR[1]
27
  sp=runOnfpga_1.hw_idshock:DDR[1]
  sp=runOnfpga_1.preinit:DDR[1]
28
  sp=runOnfpga_1.results:DDR[1]
29
  sp=runOnfpga_1.hw_iter:DDR[1]
30
31
  sp=runOnfpga_2.hw_agshock:DDR[0]
32
  sp=runOnfpga_2.hw_idshock:DDR[0]
33
34 sp=runOnfpga_2.preinit:DDR[0]
  sp=runOnfpga_2.results:DDR[0]
35
  sp=runOnfpga_2.hw_iter:DDR[0]
36
37
  sp=runOnfpga_3.hw_agshock:DDR[3]
38
  sp=runOnfpga_3.hw_idshock:DDR[3]
39
  sp=runOnfpga_3.preinit:DDR[3]
40
41
  sp=runOnfpga_3.results:DDR[3]
  sp=runOnfpga_3.hw_iter:DDR[3]
42
  43
45 [vivado]
```

48 #prop=run.impl\_1. strategy =Performance\_WLBlockPlacementFanoutOpt

52 # prop=run.impl\_1. strategy =Performance\_EarlyBlockPlacement

<sup>46 #</sup>prop=run.impl\_1. strategy =Performance\_Explore

<sup>47 #</sup>prop=run.impl\_1. strategy =Performance\_NetDelay\_high

<sup>49 #</sup>prop=run.impl\_1. strategy =Performance\_WLBlockPlacement

<sup>50 #</sup>prop=run.impl\_1. strategy =Performance\_ExploreWithRemap

<sup>51 #</sup> prop=run.impl\_1. strategy =Performance\_BalanceSLRs

```
53 prop=run.impl_1. strategy =Performance_ExtraTimingOpt
```

```
54 #prop=run.impl_1. strategy =Performance_NetDelay_low
```

```
55 # prop=run.impl_1. strategy =Congestion_SpreadLogic_low
```

<sup>66</sup> #param=place.runPartPlacer=0

# 3.7.2 Configuration File: hls\_config.tcl

**Description.** While implementing the logic, some of the mathematical operators consumes considerably large number of hardware resources. The user needs to make conscious of the number of pipelines that are to be implemented whenever it involves several mathematical operators. As discussed during hw.cpp file, the functions sim\_alm\_coeff, regression, RsquaredCalc, hw\_sim\_alm consumes several resources when it is left to compile with the default settings. Therefore, we intsruct the compiler to limit the number of hardware operators using the following directives. For example, we limit the number of calls to the regression function to 1 from sim\_alm\_coeff function. This implies that if the prior function is being called 3 times, the compiler will implement the logic only once but utilize it thrice.

| 1 | config_interface _m_axi_max_widen_bitwidth 512 |                                                    |  |  |
|---|------------------------------------------------|----------------------------------------------------|--|--|
| 2 | set_directive_allocation                       | -limit 1 -type function sim_alm_coeff regression   |  |  |
| 3 | set_directive_allocation                       | -limit 1 -type function sim_alm_coeff RSquaredCalc |  |  |
| 4 | set_directive_allocation                       | -limit 1 -type function sim_alm_coeff hw_log       |  |  |
| 5 | set_directive_allocation                       | -limit 1 -type function regression hw_pow          |  |  |
| 6 | set_directive_allocation                       | -limit 1 -type function RSquaredCalc hw_pow        |  |  |
| 7 | set_directive_allocation                       | -limit 1 -type function hw_sim_alm hw_exp          |  |  |
| 8 | set_directive_allocation                       | -limit 1 -type function hw_sim_alm hw_log          |  |  |
| 9 | set_param route . enableGlobalHoldIter true    |                                                    |  |  |

# 3.7.3 Xilinx Runtime Library: xrt.ini

**Description.** The Xilinx runtime (XRT) uses various parameters to control execution flow, debug, profiling, and message logging during host application and kernel execution in software emulation, hardware emulation, and system run on the acceleration board. These control parameters are optionally specified in a runtime initialization file xrt.ini. This file needs to be created manually and saved to the same directory as the host executable. The runtime library checks if xrt.ini exists in the same directory as the host executable and automatically reads the file to configure the runtime.

In our program, we place this file in the parent directory. Alternatively, the file can be placed in a different location and the following command can be used to set the directory of the xrt.ini file.

```
1 export XRT_INI_PATH=/path/to/xrt.ini
```

The below code snippet of the xrt.ini file shows that the profile, data transfer trace and summary are set to true.

1 #Start of Debug group

2 [Debug]

- 3 profile =true
  4 timeline\_trace =true
  5 data\_transfer\_trace =coarse
  6 opencl\_summary=true
- 7 opencl\_device\_counter=true
- 8 opencl\_trace=true

# 3.8 Run on the FPGA

Connect to your f1.2xlarge and execute the following commands from the terminal for setting up the Xilinx environment and to clone the project.

1 git clone https://github.com/aws/aws-fpga.git \$AWS\_FPGA\_REPO\_DIR //AWS repo

2 git clone https://github.com/AleP83/KS-FPGA.git -b "dev\_accel" //KS-FPGA Project

Navigate to the parent directory (KS-FPGA/baseline/codes/accel/src/fpga) within the cloned KS-FPGA folder and execute the following command to generate the computations of the baseline economy for 1200 computations.

1 make results

Once the results are computed, execute the following command to copy all the logs, reports and summary files into a single folder (single.zip) and download this folder to your local PC to analyze the results.

1 make zip

Note: Make sure to terminate your F1 instance! It costs 1.65\$/hr.

| ale un de se de se ale se rei se de se ale se al se al se de se |           |  |
|-----------------------------------------------------------------------------------------------------|-----------|--|
| Resource Availability                                                                               |           |  |
|                                                                                                     |           |  |
| =====                                                                                               |           |  |
| Total                                                                                               |           |  |
| =====                                                                                               |           |  |
|                                                                                                     |           |  |
|                                                                                                     |           |  |
| Per SLR                                                                                             |           |  |
| PER SLR                                                                                             |           |  |
|                                                                                                     |           |  |
| SLR0:                                                                                               |           |  |
| SLR1:                                                                                               |           |  |
| SLR2:                                                                                               |           |  |
|                                                                                                     |           |  |
|                                                                                                     |           |  |
| Memory Information                                                                                  |           |  |
| **************                                                                                      |           |  |
| Bus SP Tag: DDR                                                                                     |           |  |
| Segment Index:                                                                                      |           |  |
| Consumption:                                                                                        | automatic |  |
| SP Tag:                                                                                             | bank0     |  |
| SLR:                                                                                                | SLR1      |  |
| Max Masters:                                                                                        | 15        |  |
| Segment Index:                                                                                      | 1         |  |
| Consumption:                                                                                        | automatic |  |
| SP Tag:                                                                                             | bank1     |  |
| SLR:                                                                                                | SLR2      |  |
| Max Masters:                                                                                        | 15        |  |
| Segment Index:                                                                                      | 2         |  |
| Consumption:                                                                                        |           |  |
| SP Tag:                                                                                             | bank2     |  |
| SLR:                                                                                                | SLR1      |  |
| Max Masters:                                                                                        | 15        |  |
| Segment Index:                                                                                      | 3         |  |
| Consumption:                                                                                        |           |  |
| SP Tag:                                                                                             |           |  |
| SLR:                                                                                                | SLRØ      |  |
| Max Masters:                                                                                        |           |  |
| Bus SP Tag: PLRA                                                                                    |           |  |
| Segment Index:                                                                                      |           |  |
| Consumption:                                                                                        |           |  |
| SLR:                                                                                                | SLR2      |  |
| Max Masters:                                                                                        |           |  |
| Segment Index:                                                                                      |           |  |
|                                                                                                     |           |  |
| Consumption:<br>SLR:                                                                                | SLR1      |  |
|                                                                                                     |           |  |
| Max Masters:                                                                                        |           |  |
| Segment Index:                                                                                      |           |  |
| Consumption:                                                                                        |           |  |
| SLR:                                                                                                | SLRØ      |  |
| Max Masters:                                                                                        | 15        |  |
|                                                                                                     |           |  |

Figure 3.6: Information from xclbinutil

# 3.9 Makefile

This file is in the parent directory (KS-FPGA/baseline/codes/accel/src/fpga) within the cloned KS-FPGA project. Makefile is a tool that we use to compile source code into executable programs, run scripts, parse and combine files. It is designed to automatically update the outputs when there is a change in any of the dependencies. A simple tutorial for the Makefile can be found here.

In the below code snippet, we show the build process of the AWSXCLBIN file that can be executed on AWS f1 instance. We start by defining the variables that we use in the later section of the code.

```
<sup>41</sup> TARGET := hw
<sup>42</sup> MPICXX := mpic++
<sup>43</sup> CC := g++
<sup>44</sup> INCLUDES := -L/common -L/common/libs -L/cpu -I / fpga -I / -I$(XILINX_XRT)/include/ -I$(XILINX_VIVADO)/include/
<sup>45</sup> PLATFORM := xilinx_aws-vu9p-f1_shell-v04261818_201920_3
<sup>46</sup> HOST_EXE := host
<sup>47</sup> CPU_EXE := app
<sup>48</sup> OPENMPI_EXE := openmpi_app
<sup>49</sup> XO := ./ fpga/ build / runOnfpga.xo
<sup>50</sup> XCLBIN := ./ fpga/ build / runOnfpga.xclbin
<sup>51</sup> S3_BUCKET_NAME := ksfpga=$(shell aws sts get-caller-identity | grep "Account" | tr -dc '0-9')
<sup>52</sup> S3_LOG_DIR := vitis-dcps
<sup>55</sup> SHELL := / bin/bash
```

```
56 CPU_CORES := 1 #set the number of CPU cores
```

These three flags are defined so that the host program can determine the target application. Notice that -D lets us pass a particular flag during compilation. As we see that the below code is for fpga, the FPGA\_FLAG is being passed while building the host program.

```
57 OPENMPI_FLAG := -D_OPENMPI_MODE
```

```
58 FPGA_FLAG := -D_FPGA_MODE
```

```
59 SERIAL_CPU_FLAG := -D_SERIAL_CPU_MODE
```

The below script is drawn from the tutorial provided by AWS. We utilize the scripts provided by AWS to generate the .AWSXCLBIN file from the .XCLBIN file.

```
70 .PHONY: afi
71 afi: afigen
72 source $(AWS FPGA REPO DIR)/hdk setup.sh
73 pip install --user --upgrade boto3
   wait_for_afi .py --afi $( shell cat * afi_id .txt | sed -n '2p' | tr -d '", ' | sed 's /.*:// ') --notify --email $(EMAIL) &
74
75
76 .PHONY: afigen
77 afigen : fpga
78 aws s3 mb s3 :// $(S3_BUCKET_NAME) -- region us-east-1
79 touch FILES GO HERE.txt
aws s3 cp FILES GO HERE.txt s3://$(S3 BUCKET NAME)/$(S3 DCP DIR)
81 touch LOGS_FILES_GO_HERE.txt
aws s3 cp LOGS_FILES_GO_HERE.txt s3://$(S3_BUCKET_NAME)/$(S3_LOG_DIR)
   rm -rf to_aws
83
```

84 \$(VITIS\_DIR)/tools/ create\_vitis\_afi .sh -xclbin=\$(XCLBIN) -s3\_bucket=\$(S3\_BUCKET\_NAME) -s3\_dcp\_key=\$(S3\_DCP\_DIR) s3\_logs\_key=\$(S3\_LOG\_DIR) 85

#### 86 fpga: \$(XO) \$(XCLBIN) \$(HOST\_EXE) emconfig

The dependency for the following code snippet is shown in the Figure 3.7.

# # Building kernel \$(XO): ./ fpga/hw.cpp v++ -I./ common -I./fpga -I./ \$(FPGA\_FLAG) \$(EGM\_UNTIL\_CONV\_FLAG) \$(KRNL\_COMPILE\_OPTS) -c -k runOnfpga -o'\$@' '\$<'</li> \$(XCLBIN): \$(XO) v++ -I./ common -I./fpga -I./ \$(KRNL\_LINK\_OPTS) -l -o'\$@' \$(+) # Building fpga Host for EGM until convergence \$(HOST\_EXE): ./common/libs/xcl2.cpp ./ common/app.cpp ./common/init.cpp \$(CC) \$(FPGA\_FLAG) \$(EGM\_UNTIL\_CONV\_FLAG) \$(CXXFLAGS) \$^ -o \$@ \$(CXXFLAGS2)





# 3.10 Command Guidelines

# 3.10.1 OpenCL Commands Description

This section provides a comprehensive list of the OpenCL commands used to design the communications between host and FPGA device(s) and the computation workflow. *Source:* Open CL Official Manual. Xilinx Documentation - UG1393 Kronos OpenCL Documentation.

# 3.10.1.1 Gathering information about platforms

- Command: cl::Context
- **Description:** The cl::Context API is used to create a context that contains a Xilinx device that will communicate with the host machine.
- Command: cl::Platform
- **Description:** Upon initialization, the host application needs to identify a platform composed of one or more Xilinx devices.
- Command: cl::Platform::get
- Description: Gets a list of available platforms.

## 3.10.1.2 Programming the device

- Command: cl::Program::Binaries
- Description:
- Command: cl::Program
- **Description:** Program interface that implements cl\_program

## 3.10.1.3 Command Queue

- Command: cl::CommandQueue
- **Description:** The cl::CommandQueue API creates one or more command queues for each device. The FPGA can contain multiple kernels, which can be either the same or different kernels. When developing the host application, there are two main programming approaches to execute kernels on a device:
  - Single out-of-order command queue: Multiple kernel executions can be requested through the same command queue. XRT dispatches kernels as soon as possible, in any order, allowing concurrent kernel execution on the FPGA.

 Multiple in-order command queue: Each kernel execution is requested from different in-order command queues. In such cases, XRT dispatches kernels from the different command queues, improving performance by running them concurrently on the device.

The following is an example of standard API calls to create in-order and out-of-order command queues.

```
1 // In-order Command Queue
commands = clCreateCommandQueue(context, device<sub>i</sub>d, 0, err);
```

## 3.10.1.4 Kernels

- Command: cl::Kernel
- **Description:** Identifies a kernel in the program loaded into the FPGA that can be run by the host application.

## 3.10.1.5 Buffers

- Command: cl::Buffer
- **Description**: Interactions between the host program and hardware kernels rely on creating buffers and transferring data to and from the memory in the device. cl::Buffer constructs a buffer in a specified context.

## 3.10.1.6 Events

- Command: cl::Event
- Description: Class interface for cl\_event

## 3.10.1.7 Memory Transfer & Kernel Computation Management

- Command: cl::enqueueMigrateMemObjects
- **Description:** Enqueues a command to indicate which device a set of memory objects should be associated with. Using this API, memory migration can be explicitly performed ahead of the dependent commands.
- Command: cl::enqueueTask
- **Description:** When the kernel is compiled to a single hardware instance (or CU) on the FPGA, the simplest method of executing the kernel is using cl::EnqueueTask which enqueues a command to execute a kernel on a device.

# 3.10.2 Error Management

- cl\_int err
- OCL\_CHECK(err, buffer\_in\_coeffs[d][k] = cl::Buffer(contexts[d], CL\_MEM\_USE\_HOST\_PTR | CL\_MEM\_READ\_ONLY, hw\_coeff\_size\_bytes, in\_coeff[d][k].data(), err));

## 3.10.2.1 Computation Flow

# 3.10.3 Pragmas Description

This section provides a comprehensive list of the pragmas used to accelerate the code.

- Command: #pragma HLS PIPELINE
- What it does: The PIPELINE pragma tells the compiler to start each iteration of the loop immediately, if possible, rather than waiting for the loop body to finish before starting the next iteration of the loop. This allows multiple loop iterations to run concurrently on the same hardware, decreasing runtime. Xilinx link
- Command: #pragma HLS ARRAY\_PARTITION
- What it does: Partitions an array into smaller arrays or individual elements. This can allow the on-chip memories to perform more reads in parallel. Xilinx link
- Command: #pragma HLS UNROLL
- What it does: The UNROLL pragma transforms loops by creating multiples copies of the loop body in the RTL design, which allows some or all loop iterations to occur in parallel. Xilinx link
- Command: #pragma HLS BIND\_STORAGE
- What it does: The BIND\_STORAGE pragma assigns a variable (array, or function argument) in the code to a specific memory type in the RTL Xilinx link
- Command: #pragma HLS LOOP\_TRIPCOUNT
- What it does: When manually applied to a loop, specifies the total number of iterations performed by a loop. This can help the tools in estimating the performance for the application. Xilinx link
- Command: #pragma HLS INLINE
- What it does: Removes a function as a separate entity in the hierarchy. This reduces the overhead for the function call and can allow the function to be optimized into the caller. When you inline, you will have a separate set of hardware for each place where the function is inlined. Xilinx link