

# Preparing for Extreme Heterogeneity in High Performance Computing

Jeffrey S. Vetter

With many contributions from ACSR Section and Colleagues

19th Workshop on HPC in Meteorology 21 Sep 2021 ECMWF (Virtual)

ORNL is managed by UT-Battelle, LLC for the US Department of Energy





## Congratulations on your new data center!

Wish I could be there in person to chat and for a tour ©



September 14, 2021

In January 2020, the European Centre for Medium-Range Weather Forecasts (ECMWF) – a juggernaut in the weather forecasting scene – signed a four-year, \$89-million contract with European tech firm Atos to quintuple its supercomputing capacity. With the deal approaching the two-year mark, ECMWF has announced that the datacenter that will host the center's new supercomputers has opened in Bologna. This new facility, along with new ECMWF offices in Bonn, also marks the ECMWF's expansion from its sole site in the UK into a multi-site organization with international presence.





## **Highlights**

## Recent trends in computing paint an ambiguous future for architectures

- Power constraints initially drove architectural changes
- Now, vendors are forced to use 2-3 foundries if they want access to leading-edge CMOS production
  - Forces vendors to add value with domain specific architectures by specializing processors, node design, memory systems, I/O
- Explosion of new architectures
  - Devices: GPUs, FPGAs, DSPs, SoCs
  - Deployment: HPC, AI, Edge, Cloud
  - OpenHW: RISC-V
- Entering an era of Extreme Heterogeneity



## As a result, applications and software systems are all reaching a state of crisis

- Proliferation of diverse and often immature programming ecosystems
  - In fact, programming and operating systems need major investment to address current and future architectural changes
- Applications will not be functionally or performance portable across architectures
- Additionally, procurements, acceptance testing, and operations of today's new platforms depend on performance prediction and benchmarking.
- Complexity is our main challenge
- This is a crisis!

## Programming systems must provide performance portability (beyond functional portability)!!

- Ultimately, we should strive for 'Write once, perform satisfactorily anywhere'
  - Descriptive models of parallelism and data movement
  - Introspective runtime systems
  - Layered, modular, open-source approaches required
  - Performance prediction tools for design, procurement, and operations
- Examples
  - ECP investments in LLVM
    - FORTRAN with GPU offloading
  - Introspective Runtime Systems
  - Programming FPGAs
    - Without Verilog



Time for a short poll...



## History (circa 2010)

Q: Think back 10 years.

How many of you would have predicted that many of our top HPC systems would be heterogeneous (GPU-based) architectures?





## Future (circa 2030)

Q: Think forward 10 years.

How many of you predict that our top 100 HPC systems will have the following architectural features?

Assume general purpose multicore CPU

**GPU** 

FPGA/Reconfigurable processor

Neuromorphic processor

Deep learning processor

Quantum processor

RISC-V processor

Some new unknown processor

All/some of the above in one SoC



## **Implications for Applications Teams**

Q: Now, imagine you are building a new application with an expected ~3M LOC and 20 team members over the next 10 years.

What on-node programming model/system do you use to future-proof your app?

Assume C and C++

#### Fortran XX

CUDA, cu\*\*\*, HIP, OpenCL

Directives: OpenMP, OpenACC

Python, Julia, Rust, R, Matlab, etc

Metaprogramming, DSEL, etc (e.g., AMP, Kokkos, RAJA, SYCL)

Domain Specific Language (e.g., Claw, Hallide, PySL) or Domain Specific Framework (e.g., PetSc, AMReX)

Legion, Charm++, HPX, OmpSs, Star-P, etc

Some new unknown programming approach

Some combination of the above



## **Motivating Trends**





### Foundries' Sales Show Hard Times

Continuing



GlobalFoundries Forfeit 7nm **Manufacturing - EE Times Asia** 



by Paul Alcorn April 26, 2018 at 6:30 PM

**DESIGNLINES** | WIRELESS AND NETWORKING DESIGNLINE

## GlobalFoundries Selling A **Business to Marvell**

Another Step Toward the End of Moore's Law



SilTerra X-FAB Dongbu HiTek Atmel Atmel Rohm Rohm Sanyo Mitsubish Mitsubish ON ON Hitachi Hitachi Cypress Cypress Cypress Sony Sony Infineon Infineon Sharp Sharp Freescale Freescale

TSMC's 5-Nanometer Process on Track for First Half of 2020 Devices are 15 percent faster, 30 percent more energy

Renesas

Sony

Renesas

Toshiba

Renesas

Toshiba

Foundry & By Anton Shilov 18 days ago

Intel-powered exascale supercomputer will come online later than expected.

2030







Samsung t Delayed Due to Intel's 7nm Setback





(Image credit: Argonne National Laboratory)

When Intel announced earlier this year that its 7nm process technology would be delayed, it brought implications for Aurora, the first Intel-based exascale business to supercomputer. There was no clear answer back in July, but an official for the pree-phase U.S. Department of Energy's (DoE) Office of Science confirmed this week that the

Fujitsu Fujitsu TI TI Panasonio Panasonio Panasonic STM HLMC HLMC HLMC UMC UMC UMC UMC IBM IBM SMIC SMIC SMIC SMIC GF GF GF GF GlobalFoundries Samsung Samsung Samsung Samsung Samsung Samsung Samsung TSMC TSMC TSMC TSMC TSMC TSMC Intel 65 nm 45 nm/40 nm 32 nm/28 nm | 22 nm/20 nm 16 nm/14 nm 5 nm

Number of Semiconductor Manufacturers with a Cutting Edge Logic Fab

National Laboratory

Samsung and TSMC move to

By Dylan McGrath, 05.20.19 🔲 1

## Business climate reflects this uncertainty, cost, complexity, consolidation



## **Sixth Wave of Computing**







# Optimize Software and Expose New Hierarchical Parallelism

- Redesign software to boost performance on upcoming architectures
- Exploit new levels of parallelism and efficient data movement

# Architectural Specialization and Integration

- Use CMOS more effectively for specific workloads
- Integrate components to boost performance and eliminate inefficiencies
- Workload specific memory+storage system design

- Investigate new computational paradigms
  - Quantum
  - Neuromorphic
  - Advanced Digital
  - Emerging Memory Devices



# Optimize Software and Expose New Hierarchical Parallelism

- Redesign software to boost performance on upcoming architectures
- Exploit new levels of parallelism and efficient data movement

# Architectural Specialization and Integration

- Use CMOS more effectively for specific workloads
- Integrate components to boost performance and eliminate inefficiencies
- Workload specific memory+storage system design

- Investigate new computational paradigms
  - Quantum
  - Neuromorphic
  - Advanced Digital
  - Emerging Memory Devices



# Optimize Software and Expose New Hierarchical Parallelism

- Redesign software to boost performance on upcoming architectures
- Exploit new levels of parallelism and efficient data movement

# Architectural Specialization and Integration

- Use CMOS more effectively for specific workloads
- Integrate components to boost performance and eliminate inefficiencies
- Workload specific memory+storage system design

- Investigate new computational paradigms
  - Quantum
  - Neuromorphic
  - DNA Storage
  - Advanced Digital
  - Emerging Memory Devices



## Quantum computing: Qubit design and fabrication have made recent progress but still face challenges

Science 354, 1091 (2016) - 2 December

#### A bit of the action

In the race to build a quantum computer, companies are pursuing many types of quantum bits, or qubits, each with its own strengths and weaknesses.











#### Superconducting loops

A resistance-free current oscillates back and forth around a circuit loop. An injected microwave signal excites the current into superposition states.

Longevity (seconds) 0.00005

#### Trapped ions

Electrically charged atoms, or ions, have quantum energies that depend on the location of electrons. Tuned lasers cool and trap the ions, and put

them in superposition states.

#### Silicon quantum dots

These "artificial atoms" are made by adding an electron to a small piece of pure silicon. Microwaves control the electron's quantum state.

#### Topological qubits

Quasiparticles can be seen in the behavior of electrons channeled through semiconductor structures. Their braided paths can encode quantum information.

#### Diamond vacancies

A nitrogen atom and a vacancy add an electron to a diamond lattice. Its quantum spin state, along with those of nearby carbon nuclei, can be controlled with light.

Logic success rate

99.4% 99.9%

Number entangled

ionQ

>1000

0.03

N/A

N/A

Microsoft Bell Labs

errors.

Quantum Diamond Technologies

Company support Google, IBM, Quantum Circuits

be kept cold.

Cons

Pros Fast working. Build on existing semiconductor industry.

Collapse easily and must

Very stable. Highest achieved gate fidelities.

lasers are needed.

Stable. Build on existing semiconductor industry.

Greatly reduce

Existence not vet confirmed.

Can operate at room temperature.

Slow operation, Many Only a few entangled. Must be kept cold.

Difficult to entangle.

10

99.2%

Note: Longevity is the record coherence time for a single qubit superposition state, logic success rate is the highest reported gate fidelity for logic operations on two qubits, and number entangled is the maximum number of qubits entangled and capable of performing two-qubit operations.

#### The National Academies of SCIENCES · ENGINEERING · MEDICINE CONSENSUS STUDY REPORT **QUANTUM COMPUTING**Progress and Prospects Large, fault-tolerant modular QC MILESTONES FOR QUANTUM COMPUTING >1000 logical gubit QC, single module QUANTUM ANNEALING High-fidelity logical qubi COMMERCIALLY USEFUL QC More qubits, better fidelity Gate-based QC with 100's of aubits running QEC Gate-based QC practical utility NISQ application demonstrating Scale number of qubits while maintaining fidelity Improve qubit quality Gate-based QC demonstrating Scale to 50+ gubits while maintaining gate fidelity Small (10s of gubits) gate-based QC FIGURE 7.4 An illustration of potential milestones of progress in quantum computing. The arrangement of milestones corresponds to the order in which the committee thinks they are likely to be achieved; however, it is possible that some will not be achieved, or that they will not be achieved in the order indicated. National Laboratory

## **DNA Storage**

#### DNA data storage might sound futuristic, but it's on the immediate horizon

By Joel Khalili June 28, 2021

A single gram of DNA is capable of storing 215 PB of data













(Image credit: Pixabay)

As the quantity of data generated worldwide continues to expand at an aggressive rate, researchers are looking for ultra-dense and ultra-durable storage technologies capable of housing it all.

For example, Microsoft is examining the possibility of using lasers to etch data into quartz glass, or storing information in hologram form inside crystals. New developments in the field of tape storage, the current leading choice for archival use cases, are also promising.

However, one new storage medium in particular appears to have all the necessary attributes: deoxyribonucleic acid, or DNA. Researchers have found a single gram of DNA is capable of storing 215 PB (220,000 TB) of data.

- One gram of DNA can store 215 PB of data
- "Early adopters of DNA data storage are likely to be applications where they have
  - Write Once, Read Never (WORN) or
  - Write Once, Read Seldom if Ever (WORSE) data."

#### **DNA DATA STORAGE ALLIANCE** Our mission is to create and promote an interoperable storage ecosystem based on DNA as a data storage medium 4 THE DIGITAL DATA TO DNA PIPELINE LEARN MORE 02 03 04 05 06 $\Omega$ 1 Coding Synthesis Storage Retrieval Sequencing Decoding

## Fun Question about Future Technologies: when was the field effect transistor patented?



Lilienfeld's patent for a "method and appara electric currents" was granted on January 28

According to the patent, his invention was f flow of electric current between two termin conducting solid by establishing a third pot amplification of oscillating currents like the Moral of this story

It may take decades for a new technology to be manufacturable, economical, and usable, if ever.



amplifying, oscillating or switching, or capacitors or resistors with at least one potential-jump barrier or surface barrier, e.g. PN junction depletion layer or carrier concentration layer; Details of semiconductor bodies or of electrodes thereof; Multistep manufacturing processes therefor

► H01L29/78681 Thin film transistors, i.e. transistors with a channel being at least partly a thin film having a semiconductor body comprising AIIIBV or AIIBVI or AIVBVI semiconductor materials, or Se or Te



https://www.edn.com/electronics-blogs/edn-moments/4422371/Lilienfeld-patents-field-effect-

55 transistor--October-8--1926

# Optimize Software and Expose New Hierarchical Parallelism

- Redesign software to boost performance on upcoming architectures
- Exploit new levels of parallelism and efficient data movement

# Architectural Specialization and Integration

- Use CMOS (compatible) technology more effectively for specific workloads
- Integrate components to boost performance and eliminate inefficiencies
- Workload specific memory and storage system design

- Investigate new computational paradigms
  - Quantum
  - Neuromorphic
  - Advanced Digital
  - Emerging Memory Devices



Various Markets are Already Specialized –

**HPC underway** 



Qualcomm 855 SoC (SM8510P) Snapdragon™

Experimental Computing Lab (ExCL) managed by the ORNL Future Technologies Group



- Snapdragon X50 5G (external) modem (for 5G devices)
   Qualcomm Wi-Fi 6-ready mobile platform: (802.11ax-ready, 802.11ac Wave 2, 802.11ay, 802.11ad)
- Qualcomm 60 GHz Wi-Fi mobile platform: (802.11ay, 802.11ad)
- Bluetooth Version: 5.0
- Bluetooth Speed: 2 Mbps
- High accuracy location with dual-frequency GNSS



https://excl.ornl.gov/

- Connected Qualcomm board to HPZ820 through USB
- Development Environment: Android SDK/NDK
- Login to mcmurdo machine
   S ssh –Y mcmurdo
- Setup Android platform tools and development environment
   S source /home/ngx/setup\_android.source
- Run Hello-world on ARM cores
  - \$ git clone https://code.ornl.gov/ngx/helloworld-android
- \$ make compile push run • Run OpenCL example on GPU
  - \$ git clone https://code.ornl.gov/nox/opencl-img-processing
- Run Sohel edge detection

https://excl.ornl.gov/

#### **NVIDIA Jetson AGX Xavier SoC**

Experimental Computing Lab (ExCL) managed by the ORNL Future Technologies Group

NVIDIA Jetson AGX Xavier:

Quad threaded Scalar Core

- High-performance system on a chip for autonomous machines
- · Heterogeneous SoC contains:
  - Eight-core 64-bit ARMv8.2 CPU cluster (Carmel)
  - 1.4 CUDA TFLOPS (FP32) GPU with additional inference optimizations (Volta)
  - 11.4 DL TOPS (INT8) Deep learning accelerator (NVDLA)
  - 1.7 CV TOPS (INT8) 7-slot VLIW dual-processor Vision accelerator (PVA)
  - A set of multimedia accelerators (stereo, LDC, optical flow)
- Provides researchers access to advanced highperformance SOC environment











## **Example: Qualcomm Snapdragon Architecture and Programming**

#### **Snapdragon Programming Models**

- ARM Kryo CPU: C/C++ with OpenMP
- Adreno GPU: OpenCL, OpenGL, Vulkan, and DirectX
- Hexagon DSP: C/C++ and Assembly language
- Spectra ISP, MDSP, NPU: custom Qualcomm API

#### **Hexagon DSP SDK**





Qualcomm Snapdragon Architecture OAK RIDGE National Laboratory

## Future -> Open Source Hardware Enables a Rapid Design of **Specialized Chips and Effectively Mass Customization**



### **RISC-V Ecosystem**

#### **Open-source software:**

Gcc, binutils, glibc, Linux, BSD, LLVM, QEMU, FreeRTOS, ZephyrOS, LiteOS, SylixOS, ...

#### Software



ISA specification | 6

#### **Hardware**

#### **Open-source cores:**

Rocket, BOOM, RI5CY, Ariane, PicoRV32, Piccolo, SCR1, Hummingbird, ...

#### Commercial cor

Andes, Bluespec Codasip, Cortus Nuclei, SiFive, Sv

#### Commercial software:

Lauterbach, Segger, Micrium, ExpressLogic, ...



\$ git clone https://github.com/darpa/idea \$ git clone https://github.com/darpa/posh \$ cd posh \$ make soc42



Distribution Statement "A" (Approved for Public Release, Distribution Unlimited)

A. Olofsson, 2018

#### Open-source computing

#### A new blueprint for microprocessors challenges the industry's giants

RISC-V is an alternative to proprietary designs



Print edition | Science and technology









set architectures (ISAS), which are owned either by Intel, an American giant, or by Arm, a Japanese one. Intel's ISAS power desktop computers, servers and laptops. Arm's power phones, watches and other mobile devices. Together, these two firms dominate the market. Almost every one of the 5.1bn mobile phones on the planet, for example, relies on an Arm-designed ISA. The past year, however, has seen a boomlet in chips made using an ISA called RISC-V. If boomlet becomes boom, it may change the chip industry dramatically, to the detriment of Arm and Intel, because unlike the ISAS from those two firms, which are proprietary, RISC-V is available to anyone, anywhere, and is free.

An ISA is a standardised description of how a chip works at the most basic level, and instructions for writing software to run on it. To draw an analogy, a house might have two floors or three, five bedrooms or six, one bathroom or two. That is up to the architect. An ISA, however, is the equivalent of insisting that the same sorts of electrical sockets and water inlets and outlets be put in the same places in every appropriate room, so that an electrician or a plumber can find them instantly and carry the correct kit to connect to them.



## DOE HPC Roadmap to Exascale Systems





## Frontier: The OLCF 2021 Exascale computer





## **Frontier Node Architecture**



AMD GPU (ORNL)





## Take away message → During this Sixth Wave transition, Complexity is our major challenge!

Architecture

- How do we design future systems so that they are better than current systems on important applications?
- Simulation and modeling are more difficult
- Entirely possible that the new system will be slower than the old system!
- Expect 'disaster' procurements

Programmabili ty

- How do we design applications with some level of performance portability?
- Software lasts much longer than transient hardware platforms
- Proper abstractions for flexibility and efficiency
- Adapt or die



Final Report on Workshop on Extreme Heterogeneity: <a href="https://doi.org/10.2172/1473756">https://doi.org/10.2172/1473756</a>
Semiconductor Research Corp Decadal Plan: <a href="https://www.src.org/about/decadal-plan/">https://www.src.org/about/decadal-plan/</a>



# **Software Strategies for Extreme Heterogeneity**



# Strategies for Programming Systems in this Era of Rapidly Designed, Diverse Architectures

#### Goals

- Strive for 'Write one, run anywhere'
- Descriptive models of parallelism and data movement that enable effective code generation
- Introspective runtime systems
- Layered, modular, open source approaches required
  - One organization can't do it all

#### **Examples**

- Contributing to LLVM
  - FORTRAN with GPU offloading
- Programming FPGAs
  - Without Verilog
- Memory systems are changing too
  - Language support for NVM

# The Open Source LLVM Compiler Ecosystem for Heterogeneous Computing



## The three technical areas in ECP have the necessary components to meet national goals

#### Performant mission and science applications @ scale

Foster application development

Ease of use

Diverse architectures

HPC leadership

#### Application Development (AD)

Develop and enhance the predictive capability of applications critical to the DOE

#### Software Technology (ST)

Produce expanded and vertically integrated software stack to achieve full potential of exascale computing

## Hardware and Integration (HI)

Integrated delivery of ECP products on targeted systems at leading DOE computing facilities

25 applications ranging from national security, to energy, earth systems, economic security, materials, and data

80+ unique software products spanning programming models and run times, math libraries, data and visualization

6 vendors supported by PathForward focused on memory, node, connectivity advancements; deployment to facilities



## ECP is Improving the LLVM Compiler Ecosystem



#### LLVM

- Very popular opensource compiler infrastructure
- Permissive license
- Modular, welldefined IR allows use by a lot of different languages, ML frameworks, etc.
- Backend infrastructure allowing the efficient creation of backends for new (heterogeneous) hardware.
- A state-of-the-art C++ frontend, CUDA support, scalable LTO, sanitizers and other debugging capabilities, and more.

#### + SOLLVE

- Enhancing the implementation of OpenMP in LLVM
- Unified memory
- Prototype OMP features for LLVM
- OMP Optimizations
- •OMP test suite
- Tracking OMP implementation quality
- Training

#### + PROTEAS-TUNE

- Core optimization improvements to LLVM
- OpenMP offload
- OpenACC capability for LLVM
- •Clacc
- •Flacc
- Autotuning for OpenACC and OpenMP in LLVM
- •Integration with Tau performance tools
- SYCL characterizing and benchmarking
- •Leading LLVM-DOE fork
- Training

#### + FLANG

- Developing an opensource, production Fortran frontend
- Upstream to LLVM public release
- Support for OpenMP and OpenACC
- Recently approved by LLVM
- Initial implementation of serial F77 compiler for CPUs under review

#### + HPCToolkit

- Improvements to OpenMP profiling interface OMPT
- OMPT specification improvements
- •Refine HPCT for OMPT improvements

#### + NNSA

- Enhancing LLVM to optimize template expansion for FlexCSI, Kokkos, RAJA, etc.
- Flang testing and evaluation
- •Kitsune and Tapir

#### **Vendors**

- Increasing dependence on LLVM
- Many vendors import and redistribute LLVM
- Contributions and collaborations with many vendors through LLVM
- AMD
- •ARM
- Cray
- •HPÉ
- •IBM
- •Intel
- •NVIDIA

Active involvement with broad LLVM community: LLVM Dev, EuroLLVM ECP personnel had 10+ presentations at the 2020 Dev Meeting





## Leveraging LLVM Ecosystem Many other contributors: NN to Meet a Critical ECP (community) need: FORTRAN

- Fortran support continues to be an ongoing requirement
- Flang project started in NNSA funding NVIDIA/PGI to open source compiler frontend into LLVM ecosystem
- SOLLVE is improving OpenMP dialect, implementation, and core optimizations
- PROTEAS-TUNE is creating OpenACC dialect and improving MLIR
- ECP projects are contributing many changes upstream to LLVM core, MLIR, etc
- Many others are contributing: backends for processors, optimizations in toolchain, ...
  - Google contributed MLIR





## Performance of OpenMP Offload in QMCPACK

https://github.com/QMCPACK/minigmc/wiki/OpenMP-offload#passfail-dashboard

|                            |             | <u> </u>                                |
|----------------------------|-------------|-----------------------------------------|
| Compiler                   | Clang<br>11 |                                         |
| device                     | NVIDIA      |                                         |
| math header conflict       | Pass        |                                         |
| complex arithmetic         | Pass        |                                         |
| math linker error          | Pass        | <b>N A C C C C C C C C C C</b>          |
| declare target static data | Pass        | Most compilers mee functionality needs. |
| static linking             | Fail        | functionality needs.                    |
| Async tasking              | FC          |                                         |
| multiple stream            | Pass        |                                         |
| check_spo                  | Pass        |                                         |
| check_spo_batched          | Pass        |                                         |
| miniqmc_sync_move          | Pass        |                                         |
|                            |             |                                         |



LLVM/Clang outperforms IBM XL on summit and is widely available.

- → QMCPACK tracks performance of latest OpenMP implementations available on ECP systems
- → Rapid development of LLVM OpenMP has shown significant promise; better performance than IBM XL



## ECP LLVM Integration and Deployment

- Projects contribute directly to LLVM monorepo
- Develop an integrated a DOE LLVM distribution
  - Integrating different ECP projects using LLVM
  - CI on target architectures
  - Shared vehicle for improvements in LLVM
  - Increased collaboration within ECP and DOE
- Operations
  - DOE LLVM distro will be closely maintained fork of LLVM mono repo
  - Multiple branches of individual ECP projects will exist at git branches
  - Branches will be integrated into DOE LLVM
- functionality to LLVM monorepo





## **Introspective Runtime Systems**



## No De Facto Standard for Heterogeneous Programming

ORNL Experimental Computing Laboratory (ExCL) systems\*

| Syst | tems | Snapdragon | Jetson | Zynq   | DGX    | Oswald  |       | Summit | Frontier |     |
|------|------|------------|--------|--------|--------|---------|-------|--------|----------|-----|
| CI   | PU   | ARM        | ARM    | ARM    | 1 1 1  | 1 1     | 1     | IBM    | AMD      |     |
| GI   | PU   | Qualcomm   | NVIDIA |        | NVIDIA | NV      | NV    | NVIDIA | AMD      | AMD |
| FP   | GA   |            |        | Xilinx |        | Intel I | Intel |        |          |     |
| D    | SP   | Qualcomm   |        |        |        |         |       |        |          |     |





## We Need Portability in Heterogeneous Programming

Not portable program across different HW configurations

| Systems | Snapdragon | Jetson | Zynq   | DGX Oswald |       | ald   | Summit | Frontier |     |
|---------|------------|--------|--------|------------|-------|-------|--------|----------|-----|
| CPU     | ARM        | ARM    | ARM    | 1 1 1      | 1 1   | 1     | IBM    | AMD      |     |
| GPU     | Qualcomm   | NVIDIA |        | NVIDIA     | NV    | NV    | NVIDIA | AMD      | AMD |
| FPGA    |            |        | Xilinx |            | Intel | Intel |        |          |     |
| DSP     | Qualcomm   |        |        |            |       |       |        |          |     |







## IRIS Runtime System: Orchestrating Multiple Architectures and Programming Systems

The IRIS Architecture



- Compilers
  - High level application →
     IRIS unified host code + native kernels
- Dynamic Platform Loader
  - Automatically discover all available accelerators and their programming systems
- Task Scheduler
  - Task: memory copy + kernel launch
  - DAG-style tasks graph across multiple devices
  - Device selection policies
- Shared Virtual Device Memory (SVDM)
  - An Illusion of single logical device memory across all physical device memories
  - Multiple local copies on multiple device memories (relaxed consistency model)



# Unified Host + Multiple Native Kernels + Shared VDM → Flexible Task Scheduling & Portable Application



- A task can be freely scheduled and run on any device.
- An IRIS application is portable across different heterogeneous systems.



#### Task Scheduling in IRIS

- A task
  - A scheduling unit
  - Contains multiple in-order commands
    - Kernel launch command
    - Memory copy command (device-to-host, host-to-device)
  - May have DAG-style dependencies with other tasks
  - Enqueued to the application task queue with a device selection policy
    - Available device selection policies
      - Specific Device (compute device #)
      - Device Type (CPU, GPU, FPGA, XeonPhi)
      - Profile-based
      - Locality-aware
      - Ontology-base
      - Performance models (Aspen)
      - Any, All, Random, 3rd-party users' custom policies
- The task scheduler dispatches the tasks in the application task queue to available compute devices
  - Select the optimal target compute device according to task's device selection policy





## **SAXPY Example on Xavier**

- Computation
  - S[] = A \* X[] + Y[]
- Two tasks
  - S[] = A \* X[] on NVIDIA GPU (CUDA)
  - S[] += Y[] on ARM CPU (OpenMP)
    - S[] is shared between two tasks
    - Read-after-write (RAW), true dependency
- Low-level Python IRIS host code + CUDA/OpenMP kernels
  - saxpy.py
  - kernel.cu
  - kernel.openmp.h





## **SAXPY: Python host code & CUDA kernel code**

#### saxpy.py (1/2)

```
#!/usr/bin/env python
import iris
import numpy as np
import sys
iris.init()
SIZE = 1024
A = 10.0
x = np.arange(SIZE, dtype=np.float32)
y = np.arange(SIZE, dtype=np.float32)
s = np.arange(SIZE, dtype=np.float32)
print 'X', x
print 'Y', y
mem x = iris.mem(x.nbytes)
mem y = iris.mem(y.nbytes)
mem s = iris.mem(s.nbytes)
```

#### saxpy.py (2/2)

```
kernel0 = iris.kernel("saxpy0")
kernel0.setmem(0, mem s, iris.iris w)
kernel0.setint(1, A)
kernel0.setmem(2, mem x, iris.iris r)
off = [0]
ndr = [SIZE]
task0 = iris.task()
task0.h2d full(mem x, x)
task0.kernel(kernel0, 1, off, ndr)
task0.submit(iris.iris gpu)
kernel1 = iris.kernel("saxpy1")
kernel1.setmem(0, mem_s, iris.iris_rw)
kernel1.setmem(1, mem y, iris.iris r)
task1 = iris.task()
task1.h2d full(mem y, y)
task1.kernel(kernel1, 1, off, ndr)
task1.d2h full(mem s, s)
task1.submit(iris.iris cpu)
print 'S =', A, '* X + Y', s
iris.finalize()
```

#### kernel.cu (CUDA)

```
extern "C" __global__ void saxpy0(float* S, float
A, float* X) {
  int id = blockldx.x * blockDim.x + threadIdx.x;
  S[id] = A * X[id];
}

extern "C" __global__ void saxpy1(float* S,
float* Y) {
  int id = blockldx.x * blockDim.x + threadIdx.x;
  S[id] += Y[id];
}
```



#### **Evaluation**



IRIS

**CUDA** 

**DSP** 

Hexagon

6GPU CPU+6GPU

**IBM** 

**CUDA** 

National Laboratory

OpenCL

IRIS

HIP

IRIS

## Reconfigurable Computing with FPGAs

#### **FPGA Computing**

#### Benefits

- Can provide excellent performance and power advantages
  - For specific workloads
  - For certain real-time and signal processing tasks, they are indispensable
- Can provide domains specific acceleration
  - Mass customization
  - 17b math?



#### Challenges

- Programmability and Portability Issues
  - Best performance for FPGAs requires writing Hardware Description Languages (HDLs) such as VHDL and Verilog; too complex and low-level
    - HDL requires substantial knowledge on hardware (digital circuits).
    - Programmers must think in terms of a state machine.
    - HDL programming is a kind of digital circuit design.
  - High-Level Synthesis (HLS) to provide better FPGA programmability
    - SRC platforms, Handel-C, Impulse C-to-FPGA compiler, Xilinx Vivado (AutoPilot), FCUDA, etc.
    - None of these use a portable, open standard.
- Reconfiguration time
- Cost (for high end FPGAs)



#### Standard, Portable Programming Models for Heterogeneous Computing

#### OpenCL

- Open standard portable across diverse heterogeneous platforms (e.g., CPUs, GPUs, DSPs, Xeon Phis, FPGAs, etc.)
- Much higher than HDL, but still complex for typical programmers.
- Directive-based accelerator programming models
  - OpenACC, OpenMP4, etc.
  - Provide higher abstraction than OpenCL.
  - Most of existing OpenACC/OpenMP4 compilers target only specific architectures; none supports FPGAs.



# Directive-based Strategy with OpenARC: Open Accelerator Research Compiler

- Open-Sourced, High-Level Intermediate Representation (HIR)-Based, Extensible Compiler Framework.
  - Perform source-to-source translation from OpenACC C to target accelerator models.
    - Support full features of OpenACC V1.0 ( + array reductions and function calls)
    - Support both CUDA and OpenCL as target accelerator models
  - Provide common runtime APIs for various backends
  - Can be used as a research framework for various study on directive-based accelerator computing.
    - Built on top of Cetus compiler framework, equipped with various advanced analysis/transformation passes and built-in tuning tools.
    - OpenARC's IR provides an AST-like syntactic view of the source program, easy to understand, access, and transform the input program.





## FPGAs | Approach

- Design and implement an OpenACC-to-FPGA translation framework, which is the first work to use a standard and portable directive-based, high-level programming system for FPGAs.
- Propose FPGA-specific optimizations and novel pragma extensions to improve performance.
- Evaluate the functional and performance portability of the framework across diverse architectures (Altera FPGA, NVIDIA GPU, AMD GPU, and Intel Xeon Phi).



## **Baseline Translation of OpenACC-to-FPGA**

- Use OpenCL as the output model and the Altera Offline Compiler (AOC) as its backend compiler.
- Translates the input OpenACC program into a host code containing HeteroIR constructs and device-specific kernel codes.
  - Use the same HeteroIR runtime system of the existing OpenCL backends, except for the device initialization.
  - Reuse most of compiler passes for kernel generation.



#### **FPGA OpenCL Architecture**





## **Kernel-Pipelining Transformation Optimization**

- Kernel execution model in OpenACC
  - Device kernels can communicate with each other only through the device global memory.
  - Synchronizations between kernels are at the granularity of a kernel execution.
- Altera OpenCL channels
  - Allows passing data between kernels and synchronizing kernels with high efficiency and low latency



Kernel communications through global memory in OpenACC



Kernel communications with Altera channels



## **Kernel-Pipelining Transformation Optimization (2)**

#### (a) Input OpenACC code

```
#pragma acc data copyin (a) create (b) copyout (c)
{
    #pragma acc kernels loop gang worker present (a, b)
    for(i=0; i<N; i++) { b[i] = a[i]*a[i]; }
    #pragma acc kernels loop gang worker present (b, c)
    for(i=0; i<N; i++) {c[i] = b[i]; }
}</pre>
```



#### (b) Altera OpenCL code with channels

```
channel float pipe_b;
__kernel void kernel1(__global float* a) {
   int i = get_global_id(0);
   write_channel_altera(pipe_b, a[i]*a[i]);
}
__kernel void kernel2(__global float* c) {
   int i = get_global_id(0);
   c[i] = read_channel_altera(pipe_b);
}
```





## **Kernel-Pipelining Transformation Optimization (3)**

#### (a) Input OpenACC code

```
#pragma acc data copyin (a) create (b) copyout (c)
{
    #pragma acc kernels loop gang worker present (a, b)
    for(i=0; i<N; i++) { b[i] = a[i]*a[i]; }
    #pragma acc kernels loop gang worker present (b, c)
    for(i=0; i<N; i++) {c[i] = b[i]; }
}</pre>
```





Kernel-pipelining transformation





#### (c) Modified OpenACC code for kernel-pipelining

```
#pragma acc data copyin (a) pipe (b) copyout (c)
{
    #pragma acc kernels loop gang worker pipeout (b) present (a)
    For(i=0; i<N; i++) { b[i] = a[i]*a[i]; }
    #pragma acc kernels loop gang worker pipein (b) present (c)
    For(i=0; i<N; i++) {c[i] = b[i];}
}</pre>
```





## **FPGA-specific Optimizations**

- Single work-item
- Collapse
- Reduction
- Sliding window
- (Branch-variant code motion)
- (Custom unrolling)



#### Overall Performance of OpenARC FPGA Evaluation



FPGAs prefer applications with deep execution pipelines (e.g., FFT-1D and FFT-2D), performing much higher than other accelerators.

For traditional HPC applications with abundant parallel floating-point operations, it seems to be difficult for FPGAs to beat the performance of other accelerators, even though FPGAs can be much more power-efficient.

 Tested FPGA does not contain dedicated, embedded floating-point cores, while others have fully-optimized floating-point computation units.

Current and upcoming high-end FPGAs are equipped with hardened floatingpoint operators, whose performance will be comparable to other accelerators, while remaining power-efficient.



#### Recent results from Arria 10 and Stratix 10



Figure 1: Runtime performance (in seconds) of Sobel with different FPGA-specific optimizations applied. black bars indicate the multi-threaded approach, and orange bars indicate the single work-item approach (smaller is better).





Figure 3: Runtime performance (in seconds) of SRAD with different FPGA-specific optimizations applied. Blue bars indicate the multi-threaded approach, and orange bars indicate the single work-item approach (smaller is better).

#### HotSpot Performance



Figure 2: Runtime performance (in seconds) of HotSpot with different FPGA-specific optimizations applied. Blue bars indicate the multi-threaded approach, and orange bars indicate the single work-item approach (smaller is better).

- nd: multi-threaded kernel
- numcX: number of compute units (X: replication factor)
- simdX: vectorization (X: replication factor)
- elim: kernel boundary elimination optimization
- coll collapse optimization
- swi: single work-item kernel
- redX: reduction optimization (X: unroll factor)
- swX: sliding window optimization (X: unroll factor)
- hoist: code motion optimization
- flat: 2D arrays manually flattened to 1D array.



#### Recent trends in computing paint an ambiguous future for architectures

- Power constraints initially drove architectural changes
- Now, vendors are forced to use 2-3 foundries if they want access to leading-edge CMOS production
- Forces vendors to add value with domain specific architectures by specializing processors, node design, memory systems, I/O
- Explosion of new architectures
- Devices: GPUs, FPGAs, DSPs, SoCs
- Deployment: HPC, AI, Edge, Cloud
- OpenHW: RISC-V
- Entering an era of Extreme Heterogeneity



## As a result, applications and software systems are all reaching a state of crisis

- Proliferation of diverse and often immature programming ecosystems
- In fact, programming and operating systems need major investment to address current and future architectural changes
- Applications will not be functionally or performance portable across architectures
- Additionally, procurements, acceptance testing, and operations of today's new platforms depend on performance prediction and benchmarking.
- Complexity is our main challenge
- This is a crisis!

## Programming systems must provide performance portability (beyond functional portability)!!

Recap

- Ultimately, we should strive for 'Write once, perform satisfactorily anywhere'
- Descriptive models of parallelism and data movement
- Introspective runtime systems
- Layered, modular, open-source approaches required
- Performance prediction tools for design, procurement, and operations
- Examples
- ECP investments in LLVM
- FORTRAN with GPU offloading
- Introspective Runtime Systems
- Programming FPGAs
- Without Verilog

- Experimental Computing Lab
  - Lots of emerging archs
  - https://excl.ornl.gov
- Visit us (post COVID ©)
  - We host interns and other visitors year round
    - Faculty, grad, undergrad, high school, industry
- Jobs at ORNL
  - Visit <a href="https://jobs.ornl.gov">https://jobs.ornl.gov</a>
- Contact me vetter@ornl.gov







## Final Report on Workshop on Extreme Heterogeneity

- Maintaining and improving programmer productivity
  - Flexible, expressive, programming models and languages
  - Intelligent, domain-aware compilers and tools
  - Composition of disparate software components
- Managing resources intelligently
  - Automated methods using introspection and machine learning
  - Optimize for performance, energy efficiency, and availability
- Modeling & predicting performance
  - Evaluate impact of potential system designs and application mappings
  - Model-automated optimization of applications
- Enabling reproducible science despite non-determinism & asynchrony
  - Methods for validation on non-deterministic architectures
  - Detection and mitigation of pervasive faults and errors
- Facilitating Data Management, Analytics, and Workflows
  - Mapping of science workflows to heterogeneous hardware and software services
  - Adapting workflows and services to meet facility-level objectives through learning approaches



























