On this page
WMMA guide for AMD RDNA 4 architecture GPUs - part 3
In-register matrix transpose for RDNA 4 architecture GPUs
In-register matrix transpose is an important optimization technique in FFT and Neural Texture Compression. This article explores practical implementations using WMMA on AMD RDNA™ 4 architecture graphics cards.
1. Problem description
Matrix transpose loading is critical in modern GPGPU computing. However, RDNA 4 architecture GPUs lack both shared-memory transpose loading and in-register matrix transpose capabilities, making it difficult for programmers to achieve peak performance. Consequently, an efficient in-register matrix transpose solution is essential for RDNA 4.
2. Transpose by warp shuffle instruction
A straightforward approach uses warp shuffle instructions to exchange data between threads in the warp. However, as the RDNA 4 WMMA layout and shuffle instructions only support constant index of a value, requiring redundant data exchanges that limit performance. Here is a sample code:
#defineWMMA_DATA_WIDTH8typedef_Float16 frag_type __attribute__((ext_vector_type(WMMA_DATA_WIDTH)));__device__ __forceinline__ frag_type shfl_movmatrix(const frag_type& t) {frag_type ret;auto lane_id =__lane_id();auto v_src = lane_id %8;auto TLayout = [] (autolane_id) {constexprunsigned shape[] = {8, 2, 2};constexprunsigned stride[] = {0, 16, 8};unsigned result =0;for(int i =0; i <sizeof(shape) /sizeof(shape[0]); ++i) {result += lane_id % shape[i] * stride[i];lane_id /= shape[i];}return result;};auto t_trans =TLayout(lane_id);#pragmaunrollfor(int v =0; v <8; ++v) {auto t_src = t_trans + v;frag_type reg;uint32_t* in_reg = (uint32_t*)(&t);uint32_t* out_reg = (uint32_t*)(®);static_assert(sizeof(reg) %sizeof(*in_reg) ==0, "frag_type must be dividend by uint32_t evenly");#pragmaunrollfor(int tv =0; tv <sizeof(reg) /sizeof(*in_reg); ++tv) {out_reg[tv] =__shfl(in_reg[tv], t_src);}ret[v] = reg[v_src];}return ret;}3. Transpose by WMMA
To implement in-register matrix transpose effectively, one must first understand the Wide Matrix Multiply Accumulate (WMMA) layout in RDNA 4. Using the Matrix Cores of AMD RDNA 4 architecture GPUs provides a comprehensive introduction to this topic.
The following illustrates the WMMA layout in RDNA 4. All data types—including FP16, INT8, and INT4—utilize this unified layout:
| Matrix | Position | Dimensions |
|---|---|---|
| A | Lower left | M rows × K columns |
| B | Upper right | K rows × N columns |
| D | Lower right | M rows × N columns |
Both matrix A and B are K-major, with each thread holding 8 contiguous elements. This layout enables efficient 128-bit vectorized loads. Matrix D is M-major. So, matrix D is the transposed version of matrix A.
Leveraging the RDNA 4 architecture WMMA layout, we can construct an identity matrix in register B while loading the source matrix into register A. A single WMMA operation then performs the transpose entirely in-register—no additional memory operations required.
4. Sample code
The following code demonstrates the in-register matrix transpose implementation on RDNA 4.
#include<hip/hip_runtime.h>#include<hip/hip_fp16.h>#include<thrust/host_vector.h>#include<thrust/device_vector.h>usingnamespacestd;#defineWMMA_DATA_WIDTH8typedef_Float16 frag_type __attribute__((ext_vector_type(WMMA_DATA_WIDTH)));__device__ __forceinline__ voidmake_identity(frag_type& m) {constint lIdx =__lane_id();constint row = lIdx /16;constint col = lIdx %16/8;const __half num = row == col;constint idx = lIdx %16%8;m[idx] = num;}__global__ voidwmma_movmatrix(__half* a, __half* c) {constint gIdx = blockIdx.x * blockDim.x + threadIdx.x;constint lIdx = threadIdx.x;constint lane = lIdx %16;constint laneGroup = lIdx /16;frag_type a_frag = {0};frag_type b_frag = {0};frag_type c_frag = {0};for(int ele =0; ele < WMMA_DATA_WIDTH; ++ele) {a_frag[ele] = a[16* lane + ele+laneGroup * WMMA_DATA_WIDTH];}make_identity(b_frag);c_frag =__builtin_amdgcn_wmma_f16_16x16x16_f16_w32_gfx12(a_frag, b_frag, c_frag);for( int ele =0; ele < WMMA_DATA_WIDTH; ++ele ) {c[16* lane + ele+laneGroup * WMMA_DATA_WIDTH] = c_frag[ele];}}intmain(intargc, char*argv[]) {thrust::host_vector<__half>h_a(16*16);thrust::host_vector<__half>h_c(16*16);for(int i =0; i < h_a.size(); ++i) {h_a[i] = (float)i;}thrust::device_vector<__half> d_a = h_a;thrust::device_vector<__half> d_c = h_c;wmma_movmatrix<<<dim3(1), dim3(32, 1, 1), 0, 0>>>(d_a.data().get(), d_c.data().get());h_c = d_c;printf("original:\n");for (int i =0; i <16; ++i) {for (int j =0; j <16; ++j) {printf("%3i ", (int)h_a[i*16+ j]);}printf("\n");}printf("transposed:\n");for (int i =0; i <16; ++i) {for (int j =0; j <16; ++j) {printf("%3i ", (int)h_c[i*16+ j]);}printf("\n");}return0;}5. Conclusion
Using WMMA for in-register matrix transpose on AMD RDNA 4 architecture GPUs provides a lightweight alternative to CUDA’s ldmatrix.trans and movmatrix instructions, particularly when matrix core utilization is low.
This technique has been deployed in Llama.cpp to implement ldmatrix_trans in Flash Attention on RDNA 4, serving as a real-world validation of the approach.
Footnotes
Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied. GD-97.
