VOOZH about

URL: https://gpuopen.com/learn/wmma-guide-amd-rdna-4-gpus-part-3/

⇱ WMMA guide for AMD RDNA 4 architecture GPUs - part 3 - AMD GPUOpen


WMMA guide for AMD RDNA 4 architecture GPUs - part 3

Originally posted:

In-register matrix transpose for RDNA 4 architecture GPUs

In-register matrix transpose is an important optimization technique in FFT and Neural Texture Compression. This article explores practical implementations using WMMA on AMD RDNA™ 4 architecture graphics cards.

1. Problem description

Matrix transpose loading is critical in modern GPGPU computing. However, RDNA 4 architecture GPUs lack both shared-memory transpose loading and in-register matrix transpose capabilities, making it difficult for programmers to achieve peak performance. Consequently, an efficient in-register matrix transpose solution is essential for RDNA 4.

2. Transpose by warp shuffle instruction

A straightforward approach uses warp shuffle instructions to exchange data between threads in the warp. However, as the RDNA 4 WMMA layout and shuffle instructions only support constant index of a value, requiring redundant data exchanges that limit performance. Here is a sample code:

#defineWMMA_DATA_WIDTH8
typedef_Float16 frag_type __attribute__((ext_vector_type(WMMA_DATA_WIDTH)));
__device__ __forceinline__ frag_type shfl_movmatrix(const frag_type& t) {
frag_type ret;
auto lane_id =__lane_id();
auto v_src = lane_id %8;
auto TLayout = [] (autolane_id) {
constexprunsigned shape[] = {8, 2, 2};
constexprunsigned stride[] = {0, 16, 8};
unsigned result =0;
for(int i =0; i <sizeof(shape) /sizeof(shape[0]); ++i) {
result += lane_id % shape[i] * stride[i];
lane_id /= shape[i];
}
return result;
};
auto t_trans =TLayout(lane_id);
#pragmaunroll
for(int v =0; v <8; ++v) {
auto t_src = t_trans + v;
frag_type reg;
uint32_t* in_reg = (uint32_t*)(&t);
uint32_t* out_reg = (uint32_t*)(&reg);
static_assert(sizeof(reg) %sizeof(*in_reg) ==0, "frag_type must be dividend by uint32_t evenly");
#pragmaunroll
for(int tv =0; tv <sizeof(reg) /sizeof(*in_reg); ++tv) {
out_reg[tv] =__shfl(in_reg[tv], t_src);
}
ret[v] = reg[v_src];
}
return ret;
}

3. Transpose by WMMA

To implement in-register matrix transpose effectively, one must first understand the Wide Matrix Multiply Accumulate (WMMA) layout in RDNA 4. Using the Matrix Cores of AMD RDNA 4 architecture GPUs provides a comprehensive introduction to this topic.

The following illustrates the WMMA layout in RDNA 4. All data types—including FP16, INT8, and INT4—utilize this unified layout:

MatrixPositionDimensions
ALower leftM rows × K columns
BUpper rightK rows × N columns
DLower rightM rows × N columns

👁 Image

Both matrix A and B are K-major, with each thread holding 8 contiguous elements. This layout enables efficient 128-bit vectorized loads. Matrix D is M-major. So, matrix D is the transposed version of matrix A.

Leveraging the RDNA 4 architecture WMMA layout, we can construct an identity matrix in register B while loading the source matrix into register A. A single WMMA operation then performs the transpose entirely in-register—no additional memory operations required.

4. Sample code

The following code demonstrates the in-register matrix transpose implementation on RDNA 4.

#include<hip/hip_runtime.h>
#include<hip/hip_fp16.h>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
usingnamespacestd;
#defineWMMA_DATA_WIDTH8
typedef_Float16 frag_type __attribute__((ext_vector_type(WMMA_DATA_WIDTH)));
__device__ __forceinline__ voidmake_identity(frag_type& m) {
constint lIdx =__lane_id();
constint row = lIdx /16;
constint col = lIdx %16/8;
const __half num = row == col;
constint idx = lIdx %16%8;
m[idx] = num;
}
__global__ voidwmma_movmatrix(__half* a, __half* c) {
constint gIdx = blockIdx.x * blockDim.x + threadIdx.x;
constint lIdx = threadIdx.x;
constint lane = lIdx %16;
constint laneGroup = lIdx /16;
frag_type a_frag = {0};
frag_type b_frag = {0};
frag_type c_frag = {0};
for(int ele =0; ele < WMMA_DATA_WIDTH; ++ele) {
a_frag[ele] = a[16* lane + ele+laneGroup * WMMA_DATA_WIDTH];
}
make_identity(b_frag);
c_frag =__builtin_amdgcn_wmma_f16_16x16x16_f16_w32_gfx12(a_frag, b_frag, c_frag);
for( int ele =0; ele < WMMA_DATA_WIDTH; ++ele ) {
c[16* lane + ele+laneGroup * WMMA_DATA_WIDTH] = c_frag[ele];
}
}
intmain(intargc, char*argv[]) {
thrust::host_vector<__half>h_a(16*16);
thrust::host_vector<__half>h_c(16*16);
for(int i =0; i < h_a.size(); ++i) {
h_a[i] = (float)i;
}
thrust::device_vector<__half> d_a = h_a;
thrust::device_vector<__half> d_c = h_c;
wmma_movmatrix<<<dim3(1), dim3(32, 1, 1), 0, 0>>>(d_a.data().get(), d_c.data().get());
h_c = d_c;
printf("original:\n");
for (int i =0; i <16; ++i) {
for (int j =0; j <16; ++j) {
printf("%3i ", (int)h_a[i*16+ j]);
}
printf("\n");
}
printf("transposed:\n");
for (int i =0; i <16; ++i) {
for (int j =0; j <16; ++j) {
printf("%3i ", (int)h_c[i*16+ j]);
}
printf("\n");
}
return0;
}

5. Conclusion

Using WMMA for in-register matrix transpose on AMD RDNA 4 architecture GPUs provides a lightweight alternative to CUDA’s ldmatrix.trans and movmatrix instructions, particularly when matrix core utilization is low.

This technique has been deployed in Llama.cpp to implement ldmatrix_trans in Flash Attention on RDNA 4, serving as a real-world validation of the approach.

Footnotes

Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied. GD-97.

Hui Zhang

Zhang Hui is a Member of Technical Staff in the AMD Devtech team where he focuses on helping developers utilize AMD CPU cores efficiently and make deep learning solutions for AMD AI products.

Related software

HIP Ray Tracing
HIP RT is a ray tracing library for HIP, making it easy to write ray tracing applications in HIP.
Orochi
Orochi is a library which loads HIP and CUDA® APIs dynamically, allowing the user to switch APIs at runtime.

Related news and technical articles

WMMA guide for AMD RDNA 4 architecture GPUs - part 2
Achieve peak AMD RDNA™ 4 architecture memory bandwidth for low-precision GEMM by fusing WMMA to double the K dimension, enabling 128-bit loads for FP8/INT8, and matching hipBLAS results bit-for-bit.
WMMA guide for AMD RDNA 4 architecture GPUs - part 1
Practical guide to fusing GEMMs on AMD RDNA™ 4 architecture, covering WMMA layout, a transpose-by-swapping A/B technique, HIP sample code, and hipBLAS-verified results used in Llama.cpp.
Introducing HIP RT v2.2
With the release of v2.2, HIP RT now support multi-level instancing. Multi-level instancing can help to reduce memory requirements, allowing you to render large scenes with limited memory.
Jacobi Solver with HIP and OpenMP offloading
In this blog, we explore GPU offloading using HIP and OpenMP target directives and discuss their relative merits in terms of implementation efforts and performance.