API Overview#
AOCL-DLP provides a comprehensive set of APIs for high-performance deep learning primitives optimized for AMD processors. This overview introduces the key concepts, design principles, and usage patterns that apply across all APIs.
Core Concepts#
Matrix Operations#
AOCL-DLP is built around optimized matrix operations, primarily General Matrix Multiplication (GEMM):
Where:
\(op(X)\) can be \(X\) (no transpose) or \(X^T\) (transpose)
\(\alpha, \beta\) are scalar multipliers
\(\text{post_ops}\) represents fused post-processing operations
Data Type Support#
AOCL-DLP supports multiple precision formats to balance accuracy and performance:
Type |
Size (bits) |
Range |
Use Case |
|---|---|---|---|
|
32 |
±3.4×10³⁸ |
High-precision training and inference |
|
16 |
±3.4×10³⁸ |
Memory-efficient inference with good range |
|
8 |
-128 to 127 |
Quantized weights and activations |
|
8 |
0 to 255 |
Quantized activations (unsigned) |
|
4 |
-8 to 7 |
Extreme quantization (packed in int8) |
|
32 |
±2.1×10⁹ |
Accumulation and intermediate results |
API Categories#
Core GEMM Operations#
The fundamental matrix multiplication APIs:
// Float32 precision
aocl_gemm_f32f32f32of32(...)
// BFloat16 with float32 accumulation
aocl_gemm_bf16bf16f32of32(...)
// Quantized integer operations
aocl_gemm_u8s8s32os32(...)
aocl_gemm_s8s8s32os8(...)
Batch Operations#
For processing multiple GEMM operations efficiently:
// Batch processing
aocl_batch_gemm_f32f32f32of32(...)
aocl_batch_gemm_bf16bf16f32of32(...)
Element-wise Operations#
For applying operations without matrix multiplication:
// Element-wise operations
aocl_gemm_eltwise_ops_f32of32(...)
aocl_gemm_eltwise_ops_bf16of32(...)
Utility Functions#
Standalone mathematical operations:
// Activation functions
aocl_gemm_gelu_tanh_f32(...)
aocl_gemm_gelu_erf_f32(...)
aocl_gemm_softmax_f32(...)
Matrix Reordering#
For optimizing repeated operations:
// Get buffer size for reordering
aocl_get_reorder_buf_size_f32f32f32of32(...)
// Reorder matrix for optimal access
aocl_reorder_f32f32f32of32(...)
Library Management#
For configuration and feature detection:
// Thread configuration
dlp_thread_set_num_threads(...)
dlp_thread_set_ways(...)
// Hardware feature detection
dlp_aocl_enable_instruction_query()
API Design Principles#
Consistent Naming#
All APIs follow a systematic naming convention:
aocl_[operation]_[input_types]o[output_type]
Examples:
- aocl_gemm_f32f32f32of32: GEMM with float32 inputs and output
- aocl_gemm_u8s8s32os8: GEMM with uint8/int8 inputs, int32 accumulation, int8 output
- aocl_batch_gemm_bf16bf16f32of32: Batch GEMM with bfloat16 inputs, float32 output
Memory Layout Flexibility#
All APIs support multiple memory layouts:
Format |
Description |
Use Case |
|---|---|---|
Row-major ( |
C-style layout |
Most common, cache-friendly for many operations |
Column-major ( |
Fortran-style layout |
Interoperability with Fortran/BLAS libraries |
Reordered ( |
Optimized layout |
Repeated operations with same matrix |
Parameter Validation#
AOCL-DLP APIs perform robust parameter validation:
Dimension checks: Ensure matrix dimensions are compatible
Pointer validation: Handle NULL pointers gracefully
Range validation: Check for valid enumeration values
Memory format validation: Verify supported layout combinations
Hardware Abstraction#
The library provides automatic hardware optimization:
Feature detection: Runtime detection of CPU capabilities
Automatic fallbacks: Graceful degradation when features unavailable
Optimal path selection: Choose best implementation for current hardware
Post-Operations Framework#
AOCL-DLP supports fusing common operations with GEMM to improve performance:
Operation Types#
Category |
Operations |
Performance Benefit |
|---|---|---|
Activation |
ReLU, PReLU, GeLU, Tanh, Sigmoid, SWISH |
Eliminates separate activation pass |
Scaling |
Scale, Clip |
Fuses quantization/normalization |
Addition |
Bias, Matrix Add |
Combines common DNN operations |
Multiplication |
Matrix Multiply |
Enables element-wise scaling |
Usage Pattern#
// Initialize post-operations
aocl_post_op post_ops;
aocl_post_op_init(&post_ops);
// Add bias operation
aocl_post_op_bias bias_op = {.bias = bias_vector};
aocl_post_op_append_bias(&post_ops, &bias_op);
// Add activation
aocl_post_op_eltwise relu_op = {.algo = AOCL_ELTWISE_RELU};
aocl_post_op_append_eltwise(&post_ops, &relu_op);
// Use in GEMM
aocl_gemm_f32f32f32of32(..., &post_ops);
Performance Optimization#
Hardware Utilization#
AOCL-DLP automatically leverages available CPU features:
Feature |
Availability |
Benefit |
|---|---|---|
AVX2/FMA3 |
AMD Zen 1+ |
Vectorized floating-point operations |
AVX512 |
AMD Zen 4+ |
Wider vector operations |
AVX512_VNNI |
AMD Zen 4+ |
Accelerated integer GEMM |
AVX512_BF16 |
AMD Zen 4+ |
Native bfloat16 operations |
Memory Optimization#
Key strategies for optimal memory performance:
Data Layout: Use row-major layout when possible
Alignment: Align matrices to cache line boundaries (64 bytes)
Reordering: Reorder frequently-used matrices
Batch Processing: Group similar operations
Memory Bandwidth: Consider bandwidth limitations for large matrices
Threading Configuration#
Optimize parallel execution:
// Set thread count (typically number of CPU cores)
dlp_thread_set_num_threads(8);
// Configure workload distribution
dlp_thread_set_ways(2, 2, 2); // 3D parallelization
Common Usage Patterns#
Neural Network Inference#
Typical workflow for neural network layers:
// 1. Initialize weights (once)
float *weights = load_weights();
// 2. Reorder weights for optimal performance (once)
size_t reorder_size = aocl_get_reorder_buf_size_f32f32f32of32(...);
float *weights_reordered = malloc(reorder_size);
aocl_reorder_f32f32f32of32(..., weights, weights_reordered, ...);
// 3. Set up post-operations (bias + activation)
aocl_post_op post_ops;
setup_post_ops(&post_ops, bias, activation_type);
// 4. Process inputs (repeated)
for (int batch = 0; batch < num_batches; batch++) {
aocl_gemm_f32f32f32of32(
'R', 'N', 'N', batch_size, output_dim, input_dim,
1.0f, input[batch], input_dim, 'N',
weights_reordered, output_dim, 'R',
0.0f, output[batch], output_dim,
&post_ops
);
}
Quantized Inference#
Workflow for quantized neural networks:
// 1. Load quantized weights and scales
int8_t *weights_q = load_quantized_weights();
float *scales = load_scales();
// 2. Set up quantization post-ops
aocl_post_op post_ops;
setup_quantization_post_ops(&post_ops, scales, zero_points);
// 3. Process quantized inputs
aocl_gemm_u8s8s32os8(
'R', 'N', 'N', m, n, k,
1, input_q, k, 'N',
weights_q, n, 'N',
0, output_q, n,
&post_ops
);
Batch Processing#
Efficient processing of multiple similar operations:
// Prepare batch data
float **a_array = malloc(batch_count * sizeof(float*));
float **b_array = malloc(batch_count * sizeof(float*));
float **c_array = malloc(batch_count * sizeof(float*));
// Fill arrays with matrix pointers
for (int i = 0; i < batch_count; i++) {
a_array[i] = &input_matrices[i * m * k];
b_array[i] = &weight_matrices[i * k * n];
c_array[i] = &output_matrices[i * m * n];
}
// Process batch
aocl_batch_gemm_f32f32f32of32(
'R', 'N', 'N', m, n, k,
1.0f, a_array, k,
b_array, n,
0.0f, c_array, n,
batch_count, NULL
);
Error Handling#
AOCL-DLP uses defensive programming practices:
Parameter Validation#
// APIs validate parameters and handle gracefully
if (m <= 0 || n <= 0 || k <= 0) {
// No operation performed, function returns safely
return;
}
if (a == NULL || b == NULL || c == NULL) {
// NULL pointers handled without crash
return;
}
Hardware Compatibility#
// Check hardware support
if (!dlp_aocl_enable_instruction_query()) {
printf("Warning: Some optimizations not available\n");
// Library will use fallback implementations
}
Best Practices#
Choose Appropriate Precision - Use lowest precision that meets accuracy requirements - Consider mixed precision (e.g., bf16 inputs, f32 accumulation)
Optimize Memory Access - Prefer row-major layout - Align matrices to cache boundaries - Use reordering for repeated operations
Leverage Hardware Features - Use feature detection to select optimal algorithms - Test on target hardware for validation
Fuse Operations - Use post-operations to minimize memory traffic - Group related computations
Profile and Validate - Measure performance with representative workloads - Validate numerical accuracy for your use case
Migration Guide#
From Other BLAS Libraries#
AOCL-DLP APIs are designed to be familiar to BLAS users:
BLAS Function |
AOCL-DLP Equivalent |
Key Differences |
|---|---|---|
|
|
Additional post-operations support |
|
Use |
AOCL-DLP focuses on single precision |
Custom quantized GEMM |
|
Built-in quantization support |
From Previous AOCL-DLP Versions#
When upgrading:
Check API compatibility: Review function signatures
Update post-operations: New post-op framework may require changes
Validate performance: Re-benchmark with new version
Test accuracy: Verify numerical results remain acceptable
See Also#
GEMM Operations - Detailed GEMM API documentation
Post-Operations - Post-operations reference