API Overview#

AOCL-DLP provides a comprehensive set of APIs for high-performance deep learning primitives optimized for AMD processors. This overview introduces the key concepts, design principles, and usage patterns that apply across all APIs.

Core Concepts#

Matrix Operations#

AOCL-DLP is built around optimized matrix operations, primarily General Matrix Multiplication (GEMM):

\[C = \text{post_ops} (\alpha \cdot op(A) \cdot op(B) + \beta \cdot C)\]

Where:

\(op(X)\) can be \(X\) (no transpose) or \(X^T\) (transpose)
\(\alpha, \beta\) are scalar multipliers
\(\text{post_ops}\) represents fused post-processing operations

Data Type Support#

AOCL-DLP supports multiple precision formats to balance accuracy and performance:

Supported Data Types#
Type	Size (bits)	Range	Use Case
`float32`	32	±3.4×10³⁸	High-precision training and inference
`bfloat16`	16	±3.4×10³⁸	Memory-efficient inference with good range
`int8`	8	-128 to 127	Quantized weights and activations
`uint8`	8	0 to 255	Quantized activations (unsigned)
`int4`	4	-8 to 7	Extreme quantization (packed in int8)
`int32`	32	±2.1×10⁹	Accumulation and intermediate results

API Categories#

Core GEMM Operations#

The fundamental matrix multiplication APIs:

// Float32 precision
aocl_gemm_f32f32f32of32(...)

// BFloat16 with float32 accumulation
aocl_gemm_bf16bf16f32of32(...)

// Quantized integer operations
aocl_gemm_u8s8s32os32(...)
aocl_gemm_s8s8s32os8(...)

Batch Operations#

For processing multiple GEMM operations efficiently:

// Batch processing
aocl_batch_gemm_f32f32f32of32(...)
aocl_batch_gemm_bf16bf16f32of32(...)

Element-wise Operations#

For applying operations without matrix multiplication:

// Element-wise operations
aocl_gemm_eltwise_ops_f32of32(...)
aocl_gemm_eltwise_ops_bf16of32(...)

Utility Functions#

Standalone mathematical operations:

// Activation functions
aocl_gemm_gelu_tanh_f32(...)
aocl_gemm_gelu_erf_f32(...)
aocl_gemm_softmax_f32(...)

Matrix Reordering#

For optimizing repeated operations:

// Get buffer size for reordering
aocl_get_reorder_buf_size_f32f32f32of32(...)

// Reorder matrix for optimal access
aocl_reorder_f32f32f32of32(...)

Library Management#

For configuration and feature detection:

// Thread configuration
dlp_thread_set_num_threads(...)
dlp_thread_set_ways(...)

// Hardware feature detection
dlp_aocl_enable_instruction_query()

API Design Principles#

Consistent Naming#

All APIs follow a systematic naming convention:

aocl_[operation]_[input_types]o[output_type]

Examples: - aocl_gemm_f32f32f32of32: GEMM with float32 inputs and output - aocl_gemm_u8s8s32os8: GEMM with uint8/int8 inputs, int32 accumulation, int8 output - aocl_batch_gemm_bf16bf16f32of32: Batch GEMM with bfloat16 inputs, float32 output

Memory Layout Flexibility#

All APIs support multiple memory layouts:

Memory Layout Options#
Format	Description	Use Case
Row-major (`'R'`)	C-style layout	Most common, cache-friendly for many operations
Column-major (`'C'`)	Fortran-style layout	Interoperability with Fortran/BLAS libraries
Reordered (`'R'`)	Optimized layout	Repeated operations with same matrix

Parameter Validation#

AOCL-DLP APIs perform robust parameter validation:

Dimension checks: Ensure matrix dimensions are compatible
Pointer validation: Handle NULL pointers gracefully
Range validation: Check for valid enumeration values
Memory format validation: Verify supported layout combinations

Hardware Abstraction#

The library provides automatic hardware optimization:

Feature detection: Runtime detection of CPU capabilities
Automatic fallbacks: Graceful degradation when features unavailable
Optimal path selection: Choose best implementation for current hardware

Post-Operations Framework#

AOCL-DLP supports fusing common operations with GEMM to improve performance:

Operation Types#

Post-Operation Categories#
Category	Operations	Performance Benefit
Activation	ReLU, PReLU, GeLU, Tanh, Sigmoid, SWISH	Eliminates separate activation pass
Scaling	Scale, Clip	Fuses quantization/normalization
Addition	Bias, Matrix Add	Combines common DNN operations
Multiplication	Matrix Multiply	Enables element-wise scaling

Usage Pattern#

// Configure post-operations via dlp_metadata_t
float bias_vector[N] = { /* ... */ };

dlp_post_op_bias bias_op = {
    .bias = bias_vector, .stor_type = DLP_F32,
    .sf = NULL, .zp = NULL
};

dlp_post_op_eltwise relu_op = {
    .sf = NULL,
    .algo = { .alpha = NULL, .beta = NULL,
              .algo_type = RELU, .stor_type = DLP_F32 }
};

DLP_POST_OP_TYPE seq[] = { BIAS, ELTWISE };

dlp_metadata_t meta = {0};
meta.seq_length  = 2;
meta.seq_vector  = seq;
meta.bias        = &bias_op;
meta.eltwise     = &relu_op;
meta.num_eltwise = 1;

// Use in GEMM
aocl_gemm_f32f32f32of32('R', 'N', 'N', m, n, k,
    1.0f, a, lda, 'N', b, ldb, 'N',
    0.0f, c, ldc, &meta);

Performance Optimization#

Hardware Utilization#

AOCL-DLP automatically leverages available CPU features:

Hardware Features#
Feature	Availability	Benefit
AVX2/FMA3	AMD Zen 1+	Vectorized floating-point operations
AVX512	AMD Zen 4+	Wider vector operations
AVX512_VNNI	AMD Zen 4+	Accelerated integer GEMM
AVX512_BF16	AMD Zen 4+	Native bfloat16 operations

Memory Optimization#

Key strategies for optimal memory performance:

Data Layout: Use row-major layout when possible
Alignment: Align matrices to cache line boundaries (64 bytes)
Reordering: Reorder frequently-used matrices
Batch Processing: Group similar operations 5. Memory Bandwidth: Consider bandwidth limitations for large matrices

Threading Configuration#

Optimize parallel execution:

// Set thread count (typically number of CPU cores)
dlp_thread_set_num_threads(8);

// Configure workload distribution
dlp_thread_set_ways(2, 4);  // JC=2, IC=4 loop parallelization

Common Usage Patterns#

Neural Network Inference#

Typical workflow for neural network layers:

// 1. Initialize weights (once)
float *weights = load_weights();

// 2. Reorder weights for optimal performance (once)
size_t reorder_size = aocl_get_reorder_buf_size_f32f32f32of32(...);
float *weights_reordered = malloc(reorder_size);
aocl_reorder_f32f32f32of32(..., weights, weights_reordered, ...);

// 3. Set up post-operations (bias + activation)
aocl_post_op post_ops;
setup_post_ops(&post_ops, bias, activation_type);

// 4. Process inputs (repeated)
for (int batch = 0; batch < num_batches; batch++) {
    aocl_gemm_f32f32f32of32(
        'R', 'N', 'N', batch_size, output_dim, input_dim,
        1.0f, input[batch], input_dim, 'N',
        weights_reordered, output_dim, 'R',
        0.0f, output[batch], output_dim,
        &post_ops
    );
}

Quantized Inference#

Workflow for quantized neural networks:

// 1. Load quantized weights and scales
int8_t *weights_q = load_quantized_weights();
float *scales = load_scales();

// 2. Set up quantization post-ops
aocl_post_op post_ops;
setup_quantization_post_ops(&post_ops, scales, zero_points);

// 3. Process quantized inputs
aocl_gemm_u8s8s32os8(
    'R', 'N', 'N', m, n, k,
    1, input_q, k, 'N',
    weights_q, n, 'N',
    0, output_q, n,
    &post_ops
);

Batch Processing#

Efficient processing of multiple similar operations:

// Prepare batch data
float **a_array = malloc(batch_count * sizeof(float*));
float **b_array = malloc(batch_count * sizeof(float*));
float **c_array = malloc(batch_count * sizeof(float*));

// Fill arrays with matrix pointers
for (int i = 0; i < batch_count; i++) {
    a_array[i] = &input_matrices[i * m * k];
    b_array[i] = &weight_matrices[i * k * n];
    c_array[i] = &output_matrices[i * m * n];
}

// Process batch
aocl_batch_gemm_f32f32f32of32(
    'R', 'N', 'N', m, n, k,
    1.0f, a_array, k,
    b_array, n,
    0.0f, c_array, n,
    batch_count, NULL
);

Error Handling#

AOCL-DLP uses defensive programming practices:

Parameter Validation#

// APIs validate parameters and handle gracefully
if (m <= 0 || n <= 0 || k <= 0) {
    // No operation performed, function returns safely
    return;
}

if (a == NULL || b == NULL || c == NULL) {
    // NULL pointers handled without crash
    return;
}

Hardware Compatibility#

// Check hardware support
if (!dlp_aocl_enable_instruction_query()) {
    printf("Warning: Some optimizations not available\n");
    // Library will use fallback implementations
}

Best Practices#

Choose Appropriate Precision - Use lowest precision that meets accuracy requirements - Consider mixed precision (e.g., bf16 inputs, f32 accumulation)
Optimize Memory Access - Prefer row-major layout - Align matrices to cache boundaries - Use reordering for repeated operations
Leverage Hardware Features - Use feature detection to select optimal algorithms - Test on target hardware for validation
Fuse Operations - Use post-operations to minimize memory traffic - Group related computations
Profile and Validate - Measure performance with representative workloads - Validate numerical accuracy for your use case

Migration Guide#

From Other BLAS Libraries#

AOCL-DLP APIs are designed to be familiar to BLAS users:

BLAS to AOCL-DLP Mapping#
BLAS Function	AOCL-DLP Equivalent	Key Differences
`sgemm`	`aocl_gemm_f32f32f32of32`	Additional post-operations support
`dgemm`	Use `f32f32f32of32` variant	AOCL-DLP focuses on single precision
Custom quantized GEMM	`aocl_gemm_u8s8s32os8`	Built-in quantization support

From Previous AOCL-DLP Versions#

When upgrading:

Check API compatibility: Review function signatures
Update post-operations: New post-op framework may require changes
Validate performance: Re-benchmark with new version
Test accuracy: Verify numerical results remain acceptable

API Overview

Contents

API Overview#

Core Concepts#

Matrix Operations#

Data Type Support#

API Categories#

Core GEMM Operations#

Batch Operations#

Element-wise Operations#

Utility Functions#

Matrix Reordering#

Library Management#

API Design Principles#

Consistent Naming#

Memory Layout Flexibility#

Parameter Validation#

Hardware Abstraction#

Post-Operations Framework#

Operation Types#

Usage Pattern#

Performance Optimization#

Hardware Utilization#

Memory Optimization#

Threading Configuration#

Common Usage Patterns#

Neural Network Inference#

Quantized Inference#

Batch Processing#

Error Handling#

Parameter Validation#

Hardware Compatibility#

Best Practices#

Migration Guide#

From Other BLAS Libraries#

From Previous AOCL-DLP Versions#

See Also#