Quick Notes - Hardware-aware programming on MacOS

Purpose

The purpose of this document is to help with a quick overview and reference for developers on the Apple platform who’d like to be aware of what facilities are available for optimal utilisation of their hardware, in the current context of an upcoming AI-focused tech world where parallel processing is a fundamental need. The ideas are general and developers will be able to map these to other platforms, and know what to look for in general when faced with unknowns.

Key Terms and Techniques

SIMD - Single Instruction, Multiple Data

SIMD allows for working on more data in lesser number of CPU cycles, by optimally using the available registers (typically, under-utilised) and the bandwidth between the CPU and memory (again, typically under-utilised). Recall that moving data around is counted as IO, which is generally orders of magnitude slower than compute.

It’s useful here to quote Wikipedia

ARM NEON

Single Instruction Multiple Data (SIMD) architecture extension for the A-profile and R-profile processors. Neon extends the basic SIMD instruction set to cover wider-width values (like 64- and 128-bit values) Note: NEON is a trademark associated with ARM, not Apple. Apple Silicon uses the ARM architecture - which they are licensed to use. Knowing NEON is useful beyond the Apple ecosystem on the ARM platform.

Metal

Metal is a low-level graphics API that is useful for general-purpose GPU programming - GPGPU. This is Apple-specific.

Accelerate

Acclerate is not related to the GPU. It provides a unified computing framework for various devices in the Apple ecosystem, leveraging the hardware as efficiently as possible, including hardware-specific SIMD instructions. It takes away from the programmer the need to program differently for different underlying hardware.

Loop Unrolling technique

Do more in a single loop - because every loop has an overhead, and a single loop may not optimally utilise all the resources that are reserved for that round.

Data prefetching technique

Loading data into cache before it is needed, so wait times are reduced when actually operating on the data.

ARM Neon

The NEON register bank consists of 32 64-bit registers. Refer this link. The NEON unit can view the same register bank as

sixteen 128-bit quadword registers, Q0-Q15
thirty-two 64-bit doubleword registers, D0-D31

The Q and D prefixes indicate what view is in use.

An example program to add two vectors

#include <arm_neon.h>

void add_vectors(int32_t* a, int32_t* b, int32_t* c, int n) {
    for (int i = 0; i < n; i += 4) {
        int32x4_t va = vld1q_s32(a + i);
        int32x4_t vb = vld1q_s32(b + i);
        int32x4_t vc = vaddq_s32(va, vb);
        vst1q_s32(c + i, vc);
    }
}

The function name vld1q_s32 is a mnemonic.

v: vector operation
ld1: load 1 vector
q: quad word operation. Uses 128-bit register
s32: 32-bit signed integer values

Other function names (like vst1q_s32 - to store) can be similarly broken down. The type int32x4_t denotes 4 32-bit integers.

While ld1 denotes loading of 1 vector, ld2, ld3 and ld4, while loading 2, 3 and 4 vectors respectively, do so in an interleaved manner. So, if you are looking to sequentially load vectors, use vld1q_s32 (or appropriate for different types) multiple times. Interleaving is useful for many use-cases and hence the interleaving load routines exist to support them.

Metal

Knowledge of the fundamentals of GPU programming will help substantially while programming with the Metal APIs. Here are the key things to keep in mind when using Metal, vis-a-vis standard systems programming in a language like C.

Terms/Concepts

Device (MTLDevice): Awareness of the device you’ll be using to program is key to the code we write
Command Queue (MTLCommandQueue): Commands (executable logic) need to be ordered and executed using this component
Command Buffer (MTLCommandBuffer): A space you will manage, where the commands to be executed on the GPUs are encoded and stored

Command Encoders - Render Command Encoder (MTLRenderCommandEncoder): Encodes rendering commands - Compute Command Encoder (MTLComputeCommandEncoder): Encodes computing commands for GPGPU programming - Blit Command Encoder (MTLBlitCommandEncoder): Encodes commands for copying and managing resources

Pipeline State - Render Pipeline State (MTLRenderPipelineState): Defines configurations for rendering operations - Compute Pipeline State (MTLComputePipelineState): Defines configurations for compute operations

Shaders

Programs written in the Metal Shading Language (MSL)

Vertex Shader: Processes vertices
Fragment Shader: Processes each fragment (pixel)
Compute Shader: General purpose computation

Buffers (MTLBuffer)

Memory storage for data-types accessible by the GPU

Textures (MTLTexture)

Image data used in rendering or computing

Libraries (MTLLibrary)

Collections of shader functions that can be linked into pipeline states

Typically, the following steps are executed in a full-fledged computation on the GPU

Initialization - create a device reference in code
Command queue creation
Buffer creation
Loading and compiling shaders
Creating pipeline states - can be for render or compute
Encoding commands - again, for render or compute
Submitting command buffers

Accelerate

Accelerate abstracts programming at a higher level for various application areas, like scientific computing, machine learning, image processing and such. It enables a more convenient way to implement logic for computationally intensive tasks, without having to write code conditioned on the hardware differences. It attempts to provide high performance at low energy consumption by using the hardware instructions well.

The key components and uses of the Accelerate framework are

Vector Digital Signal Processing
- Provides functions for various DSP tasks like convolution, correlation, Fourier transforms
- It is optimised for both scalar and vector operations
Image Processing
- Optimised image processing routines, including geometric tranformations
Linear Algebra Library
- BLAS (Basic Linear Algebra Subprograms)
- LAPACK (Linear Algebra Package)
Basic Neural Network Subroutines (BNNS)
Quadrature - Efficient routines for approximating the definite integrals of one-dimensional functions.

Here’s a sample piece of self-contained code to show how straightforward it is to use Accelerate.

#include <stdio.h>
#include <math.h>
#include <Accelerate/Accelerate.h>

int main() {
    const int n = 1000;  // number of intervals
    double a = 0.0;
    double b = M_PI;
    double h = (b - a) / n;

    // Create arrays for x values and sin(x) values
    double x[n+1], y[n+1];

    // Generate x values
    vDSP_vrampD(&a, &h, x, 1, n+1);

    // Compute sin(x) for each x
    vvsin(y, x, &n);

    // Compute the sum of y values (excluding first and last)
    double sum;
    vDSP_sveD(y+1, 1, &sum, n-1);

    // Add half of first and last y values
    sum += 0.5 * (y[0] + y[n]);

    // Multiply by h to get the final result
    double result = h * sum;

    printf("Approximate integral of sin(x) from 0 to pi using Accelerate: %f\n", result);
    printf("Exact result should be 2.0\n");

    return 0;
}

Compile the above as follows

clang -framework Accelerate filename.c # assuming code saved in filename.c

Acknowledgements

ChatGPT was used extensively over extended periods, apart from reading the documents and other searches. But ChatGPT (and for some purposes, Claude when ChatGPT struggled) was extremely helpful, using it in a conversational style, with multiple rounds of diving deeper and clarifying.