On this page

WMMA guide for AMD RDNA 4 architecture GPUs - part 3

Originally posted: June 2, 2026

Hui Zhang

In-register matrix transpose for RDNA 4 architecture GPUs

In-register matrix transpose is an important optimization technique in FFT and Neural Texture Compression. This article explores practical implementations using WMMA on AMD RDNA™ 4 architecture graphics cards.

1. Problem description

Matrix transpose loading is critical in modern GPGPU computing. However, RDNA 4 architecture GPUs lack both shared-memory transpose loading and in-register matrix transpose capabilities, making it difficult for programmers to achieve peak performance. Consequently, an efficient in-register matrix transpose solution is essential for RDNA 4.

2. Transpose by warp shuffle instruction

A straightforward approach uses warp shuffle instructions to exchange data between threads in the warp. However, as the RDNA 4 WMMA layout and shuffle instructions only support constant index of a value, requiring redundant data exchanges that limit performance. Here is a sample code:

#defineWMMA_DATA_WIDTH8
typedef_Float16 frag_type __attribute__((ext_vector_type(WMMA_DATA_WIDTH)));
__device__ __forceinline__ frag_type shfl_movmatrix(const frag_type& t) {
frag_type ret;
auto lane_id =__lane_id();
auto v_src = lane_id %8;
auto TLayout = [] (autolane_id) {
constexprunsigned shape[] = {8, 2, 2};
constexprunsigned stride[] = {0, 16, 8};
unsigned result =0;
for(int i =0; i <sizeof(shape) /sizeof(shape[0]); ++i) {
result += lane_id % shape[i] * stride[i];
lane_id /= shape[i];
}
return result;
};
auto t_trans =TLayout(lane_id);
#pragmaunroll
for(int v =0; v <8; ++v) {
auto t_src = t_trans + v;
frag_type reg;
uint32_t* in_reg = (uint32_t*)(&t);
uint32_t* out_reg = (uint32_t*)(&reg);
static_assert(sizeof(reg) %sizeof(*in_reg) ==0, "frag_type must be dividend by uint32_t evenly");
#pragmaunroll
for(int tv =0; tv <sizeof(reg) /sizeof(*in_reg); ++tv) {
out_reg[tv] =__shfl(in_reg[tv], t_src);
}
ret[v] = reg[v_src];
}
return ret;
}

3. Transpose by WMMA

To implement in-register matrix transpose effectively, one must first understand the Wide Matrix Multiply Accumulate (WMMA) layout in RDNA 4. Using the Matrix Cores of AMD RDNA 4 architecture GPUs provides a comprehensive introduction to this topic.

The following illustrates the WMMA layout in RDNA 4. All data types—including FP16, INT8, and INT4—utilize this unified layout:

Matrix	Position	Dimensions
A	Lower left	M rows × K columns
B	Upper right	K rows × N columns
D	Lower right	M rows × N columns

👁 Image

Both matrix A and B are K-major, with each thread holding 8 contiguous elements. This layout enables efficient 128-bit vectorized loads. Matrix D is M-major. So, matrix D is the transposed version of matrix A.

Leveraging the RDNA 4 architecture WMMA layout, we can construct an identity matrix in register B while loading the source matrix into register A. A single WMMA operation then performs the transpose entirely in-register—no additional memory operations required.

4. Sample code

The following code demonstrates the in-register matrix transpose implementation on RDNA 4.

#include<hip/hip_runtime.h>
#include<hip/hip_fp16.h>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
usingnamespacestd;
#defineWMMA_DATA_WIDTH8
typedef_Float16 frag_type __attribute__((ext_vector_type(WMMA_DATA_WIDTH)));
__device__ __forceinline__ voidmake_identity(frag_type& m) {
constint lIdx =__lane_id();
constint row = lIdx /16;
constint col = lIdx %16/8;
const __half num = row == col;
constint idx = lIdx %16%8;
m[idx] = num;
}
__global__ voidwmma_movmatrix(__half* a, __half* c) {
constint gIdx = blockIdx.x * blockDim.x + threadIdx.x;
constint lIdx = threadIdx.x;
constint lane = lIdx %16;
constint laneGroup = lIdx /16;
frag_type a_frag = {0};
frag_type b_frag = {0};
frag_type c_frag = {0};
for(int ele =0; ele < WMMA_DATA_WIDTH; ++ele) {
a_frag[ele] = a[16* lane + ele+laneGroup * WMMA_DATA_WIDTH];
}
make_identity(b_frag);
c_frag =__builtin_amdgcn_wmma_f16_16x16x16_f16_w32_gfx12(a_frag, b_frag, c_frag);
for( int ele =0; ele < WMMA_DATA_WIDTH; ++ele ) {
c[16* lane + ele+laneGroup * WMMA_DATA_WIDTH] = c_frag[ele];
}
}
intmain(intargc, char*argv[]) {
thrust::host_vector<__half>h_a(16*16);
thrust::host_vector<__half>h_c(16*16);
for(int i =0; i < h_a.size(); ++i) {
h_a[i] = (float)i;
}
thrust::device_vector<__half> d_a = h_a;
thrust::device_vector<__half> d_c = h_c;
wmma_movmatrix<<<dim3(1), dim3(32, 1, 1), 0, 0>>>(d_a.data().get(), d_c.data().get());
h_c = d_c;
printf("original:\n");
for (int i =0; i <16; ++i) {
for (int j =0; j <16; ++j) {
printf("%3i ", (int)h_a[i*16+ j]);
}
printf("\n");
}
printf("transposed:\n");
for (int i =0; i <16; ++i) {
for (int j =0; j <16; ++j) {
printf("%3i ", (int)h_c[i*16+ j]);
}
printf("\n");
}
return0;
}

5. Conclusion

Using WMMA for in-register matrix transpose on AMD RDNA 4 architecture GPUs provides a lightweight alternative to CUDA’s ldmatrix.trans and movmatrix instructions, particularly when matrix core utilization is low.

This technique has been deployed in Llama.cpp to implement ldmatrix_trans in Flash Attention on RDNA 4, serving as a real-world validation of the approach.

Footnotes

Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied. GD-97.

👁 Hui Zhang's avatar

Hui Zhang

Zhang Hui is a Member of Technical Staff in the AMD Devtech team where he focuses on helping developers utilize AMD CPU cores efficiently and make deep learning solutions for AMD AI products.

Related software

👁 HIP Ray Tracing

HIP Ray Tracing

HIP RT is a ray tracing library for HIP, making it easy to write ray tracing applications in HIP.

👁 Orochi

Orochi

Orochi is a library which loads HIP and CUDA® APIs dynamically, allowing the user to switch APIs at runtime.

URL: https://gpuopen.com/learn/wmma-guide-amd-rdna-4-gpus-part-3/