Triton-XDNA Examples

These examples demonstrate how to write Triton kernels that compile and run on AMD XDNA™ NPUs via the MLIR-AIR compilation flow.

End-to-End Models

Complete models—not just single operators—built by composing Triton kernels and running across the iGPU and NPU.

Model	Description	Datatype(s)	Example
GPT-2	End-to-end GPT-2 inference (all four sizes: small/medium/large/xl) composed from Triton kernels, running across iGPU and NPU.	bf16	gpt2/
Qwen2.5	End-to-end Qwen2.5-Instruct inference (0.5B/1.5B) composed from Triton kernels, with KV-cached autoregressive generation running across iGPU and NPU.	bf16	qwen2_5/

Operator Dashboard

Category	Operation	Datatype(s)	AIE2	AIE2P	Example
Matrix	Matrix Multiplication (BF16)	bf16	✅	✅	matmul_bf16_m64_n64_k64/
Matrix	Padded Matrix Multiplication (F32, A Transposed)	f32 (bf16 emulation)	—	✅	matmul_f32_m64_n32_k16_padded_atransposed/
Matrix	Matrix Multiplication (INT8)	i8	—	✅	matmul_i8_m64_n64_k64/
Matrix	Matrix Multiplication (INT8, Large Tile)	i8	—	✅	matmul_i8_m128_n64_k64/
Matrix	Matrix Multiplication (Autotune)	bf16	✅	—	autotune-matmul/
Element-wise	ReLU	bf16	✅	✅	relu/
Element-wise	Sigmoid	bf16	✅	✅	sigmoid/
Element-wise	SiLU	bf16	✅	✅	silu/
Element-wise	GELU	bf16	—	✅	gelu/
Element-wise	Leaky ReLU	bf16	✅	✅	leaky_relu/
Element-wise	SwiGLU	bf16	✅	✅	swiglu/
Element-wise	AXPY	bf16	✅	✅	axpy/
Element-wise	Vector Add	bf16	✅	✅	vec-add/
Normalization	RMS Normalization	bf16	—	✅	rms_norm/
Normalization	Weighted RMS Normalization	bf16	—	✅	weighted_rms_norm/
Normalization	Softmax	bf16	✅	✅	test_softmax/
Normalization	Layer Normalization	f32	✅	✅	test_layernorm/
Pooling	Average Pool	bf16	✅	✅	average_pool/
Special	2D Block Load	f32	—	—	load_2d_block/
Special	Multi-Driver	bf16	✅	✅	multi_drivers/

Legend

✅ Transform file available (device target supported)
— Not yet available

AIE2 = AMD Ryzen™ AI (Phoenix, NPU1) AIE2P = AMD Ryzen™ AI (Strix, NPU2)

Running Examples

Make sure XRT is sourced and a virtual environment with triton-xdna is active (see top-level README):

source /opt/xilinx/xrt/setup.sh

# Run an example on AIE2 (NPU1):
cd matmul_bf16_m64_n64_k64
AIR_TRANSFORM_TILING_SCRIPT=transform_aie2.mlir python matmul_bf16_m64_n64_k64.py

# Run on AIE2P (NPU2):
AIR_TRANSFORM_TILING_SCRIPT=transform_aie2p.mlir python matmul_bf16_m64_n64_k64.py

Running All Tests

python scripts/run_tests.py --device aie2 --verbose
python scripts/run_tests.py --device aie2p --verbose