Voozh

Summary

Graphics cards are specialized hardware designed for parallel processing in tasks like graphics rendering and scientific research.
The key components of a graphics card include the GPU core, GPU memory, VRM, display interfaces, and cooling system.
Modern GPUs have advanced architectures that allow for efficient processing, including unified shader architecture, instruction pipelining, cache hierarchy, and multi-level parallelism.

To understand how a graphics card works, it's essential to dive deep into the components and processes that enable these devices to render images, videos, and animations on our screens. A graphics card, often referred to as a GPU (Graphics Processing Unit), is specialized hardware designed to accelerate the creation and rendering of images, videos, and animations. It works differently from the CPU (Central Processing Unit), excelling at parallel processing. This is key for tasks like graphics rendering and other computations that require handling many operations at once.

👁 AMD Radeon RX 7800 XT fans

3 reasons why 2023 was a colossal disappointment for graphics cards

Coming out of the GPU crisis, 2023 was supposed to bring a breath of fresh air. It did bring something fresh, but not what we expected

By Tanveer Singh

Core components of a Graphics Card👁 An image showing the ASRock Challenger RX 7700 XT 12G OC GPU kept on a white-colored PC case.

A graphics card is an essential piece of hardware for any computer, without which image output would have been impossible. It translates the data from the CPU into images that can be displayed on your monitor. The performance and capabilities of a graphics card depend largely on its core components, which are:

GPU (Graphics Processing Unit) Core: At the core of every graphics card, the GPU stands as a dedicated processor optimized for accelerating graphics rendering. With its ability to process thousands of operations simultaneously, the GPU excels in tasks that require parallel processing. This makes it indispensable not just for gaming and visual applications, but also for specific computational tasks in areas such as scientific research and machine learning.
GPU Memory: Graphics cards are equipped with a specialized form of memory, known as Video RAM (VRAM), which is optimized to meet the high-speed, high-bandwidth demands of processing and rendering visuals. VRAM is used to store textures, frame buffers, and other essential data needed for rendering images, contributing directly to the speed and quality of the graphics produced.
VRM: To meet their high-power draws, high-end graphics cards often require more power than the motherboard can supply. This leads to the necessity for direct power connections from the computer’s power supply unit (PSU) via one or more dedicated 8-pin connectors. The PSU provides the additional power over a 12v rail which is then converted to the ~1V required by the GPU die, as well as to various other voltages needed for components like memory on the card. This conversion is handled by the Voltage Regulation Module (VRM).
Display interfaces: The main objective of graphics cards is to output the rendered images to display devices. They are equipped with various output interfaces such as HDMI, DisplayPort, DVI, and VGA ports. These interfaces enable connectivity with a wide range of display devices, including monitors, TVs, and projectors, catering to different needs and ensuring compatibility with various display technologies.
Cooling: High-performance graphics cards generate a substantial amount of heat due to their intense processing tasks. To counter this, a robust cooling system is used, which may comprise heat sinks to dissipate heat, fans to circulate air, and in some cases, liquid cooling solutions. These systems work together to ensure the GPU operates within safe temperature ranges, thus maintaining performance and longevity.

What are the components of the Graphics Processing Unit (GPU) core?👁 An open pc case, with a Gigabyte graphics card installed next to an AMD CPU heatsink and a Samsung SSD.

The GPU core is the brains of the entire graphics operation of your PC and there are a few key components of the GPU that help it in successfully doing its job.

Stream Processors or CUDA Cores are the functional working units within a GPU core, dedicated to performing the FP32 shader operations and computational work required for graphics rendering and other parallel processing tasks. The abundance of stream processors or CUDA cores (depending on your GPU) enables the GPU to handle multiple operations simultaneously. The greater the number of these cores, the faster the GPU.
Memory Interface: The GPU's memory bandwidth determines data transfer speeds between the GPU core and VRAM. This is dependent on two factors: memory interface width (bits) and VRAM transfer rate (measured in Gbps). Bandwidth, which is measured in GB/s, is calculated by multiplying the bus width with the transfer rate and then dividing by 8 bits per byte. For instance, a GPU with a 320-bit bus width and 14 Gbps VRAM transfer rate yields a bandwidth of 560 GB/s.
ROPs (Raster Operations Pipelines) and TMUs (Texture Mapping Units): ROPs play quite an important role in producing the final pixel output on the screen, executing tasks such as antialiasing to enhance image quality. TMUs, on the other hand, are responsible for applying textures to 3D models. Both components are vital for achieving high-quality visual effects and rendering performance, contributing significantly to the overall visual experience.
RT Cores: These dedicated units accelerate real-time ray tracing tasks, such as bounding volume hierarchy traversal and ray-triangle intersections. By offloading these specific computations to specialized hardware, GPUs achieve a significant reduction in load for rendering complex lighting effects.
Tensor Cores: Specialized for accelerating matrix multiplications, these cores are crucial in deep learning and neural network computations. They leverage mixed-precision computing to enhance throughput while incorporating mechanisms (FP16 for computation, FP32 for accumulation) to boost performance without sacrificing accuracy to ensure precision.

The advanced architecture of modern GPUs👁 nvidia geforce rtx 4080 super fe stood up in front of its packaging

The advanced architecture of the modern day serves as the foundation behind the stunning visuals you see in games and the quick calculations needed for AI, along with other scientific research and even content production. They've gotten good at doing lots of tasks at the same time, thanks to some smart design choices. Here's a quick look at what makes them tick:

Unified shader architecture: Modern GPUs adopt a unified shader architecture. This flexible framework allows the same shader units to process various types of shaders, be it vertex, pixel, or geometry. Shaders are tailored to the task at hand. This adaptability increases processing efficiency.
Instruction pipelining and parallelism: At the heart of a GPU's speed and efficiency is its ability to execute multiple instructions simultaneously through instruction pipelining. This method layers the execution stages, keeping each core active and engaged, speeding up data processing and rendering times.
Cache hierarchy and memory management: Effective memory management is important to maintain a GPU's performance. With a proper cache hierarchy, including L1 and L2 caches, GPUs minimize latency and make efficient use of bandwidth. This design ensures quick access to frequently used data, which ensures smooth rendering.
Multi-Level Parallelism: By using parallelism in multiple layers, hardware, thread, and instruction, GPUs achieve a level of efficiency that's unparalleled. This multi-tiered approach allows for a large number of operations to be conducted in unison.
SIMD and SIMT Architectures: The concepts of SIMD (Single Instruction, Multiple Data) and SIMT (Single Instruction, Multiple Threads) are central to a GPU's capability to process multiple data points or threads simultaneously. This is especially effective for vector operations.
Execution Units and Warp Scheduling: The way GPUs manage threads and execution units is through something called warp schedulers. These schedulers organize threads into groups known as warps or wavefronts (depending on your GPU). These schedulers ensure that each execution unit is utilized efficiently.
Register files and shared memory: The inclusion of expansive register files and shared memory within each compute unit provides a fast, accessible storage solution for threads. This design facilitates swift variable access and inter-thread communication, cutting down on the need for global memory access and, thereby, enhancing processing speeds.
Asynchronous compute engines: The integration of asynchronous compute engines in some GPUs allows for the simultaneous execution of graphics and compute tasks. This dual-processing capability is especially crucial in applications requiring complex simulations alongside graphics rendering, providing a more streamlined and efficient use of resources.

How do memory architecture and optimization work on a GPU?

The memory interface width (e.g., 256-bit, 384-bit) and the type of VRAM used (GDDR6X, GDDR7, HBM) are critical factors in determining the GPU's memory bandwidth. Higher bandwidth enables faster data transfer rates between the GPU and memory, is crucial for high-resolution textures, detailed 3D models, and complex scenes. GDDR7 and HBM2E memory technologies stand out for their innovative approaches to increasing bandwidth and reducing latency. Here’s how these technologies are shaping the future of graphics memory:

GDDR6X and GDDR7: GDDR6X introduced PAM4 (Pulse Amplitude Modulation with 4 levels) signaling, effectively doubling the data rate per pin compared to the NRZ (Non-Return to Zero) signaling used by earlier versions. This was tailored to meet the increasing demands for high-resolution gaming and intricate graphic rendering. However, taking a new direction, GDDR7 shifts to PAM3 (3 signal levels) signaling. This change positions PAM3 as an intermediary between the complexity of PAM4 and the simplicity of NRZ, optimizing for both speed and signal integrity, as well as enhancing power efficiency. Samsung's GDDR7 DRAM, as the industry's first, promises unprecedented performance with speeds up to 37Gbps per pin on a 384-bit bus reaching a bandwidth of 1.8 terabytes-per-second (TBps)—a substantial improvement over GDDR6’s 1.0TBps. Additionally, it introduces a 20% improvement in energy efficiency and significantly reduces heat generation.
HBM2E Memory: HBM2E (High Bandwidth Memory 2E) takes a different approach by stacking memory dies vertically and utilizing a wide interface, connected by through-silicon vias (TSVs), and placing them on the same package as the GPU. This design reduces the physical distance data must travel, drastically increasing bandwidth and reducing power consumption. This structure significantly increases bandwidth by providing a direct pathway for data to travel between the memory and the GPU, making it especially beneficial for applications that handle vast amounts of data.
Cache coherency and memory compression: As GPUs grow faster, efficient cache management becomes increasingly crucial. Modern GPUs tackle this challenge with advanced cache coherency protocols, ensuring that data across all cache levels remains consistent and quickly accessible. This coherency is critical for multithreaded operations where the same data might be accessed and modified by different processes simultaneously. Additionally, memory compression techniques optimize data transfer by compressing the data before it moves between the GPU and memory. These algorithms significantly reduce the bandwidth needed for data transfer, enhancing overall performance while also conserving power.
Challenges in Cache coherency: Ensuring cache coherency across a GPU's complex memory hierarchy presents significant challenges. With multiple cores accessing and modifying shared data, maintaining a consistent state across all caches is essential to prevent data corruption and performance degradation. GPUs address these challenges through sophisticated cache coherency protocols like MOESI (Modified, Owner, Exclusive, Shared, Invalid) that manage the state of data in caches, ensuring consistency and minimizing latency. Implementing these protocols requires careful balancing to avoid overhead that could negate the benefits of coherency.
Data Compression Algorithms in GPUs: Data compression plays a vital role in optimizing the bandwidth and storage efficiency of GPUs. Techniques like delta color compression (DCC) and block compression (BC) are commonly used. DCC works by storing only the differences in color values between adjacent pixels rather than the full-color data, which is particularly effective for images with gradual color changes. BC, on the other hand, compresses blocks of pixels into smaller data sets based on similar patterns and colors. These algorithms reduce the amount of data that needs to be transferred and stored, significantly improving performance and reducing power consumption.

Semiconductor Fabrication and Process Nodes

The architecture of modern GPUs showcases the advancements in semiconductor manufacturing processes and microarchitecture design, resulting in the development of specialized processing units for specific tasks. This evolution reflects a continuous effort to balance performance, power efficiency, and thermal management.

Semiconductor manufacturing processes

The foundation of today's GPU performance lies in the semiconductor manufacturing process, often quantified in nanometers (nm). As the industry has moved to smaller process nodes from 7nm, 5nm, and now approaching 3nm, the potential for packing more transistors into the same die space has surged. This miniaturization boosts both performance and energy efficiency while mitigating heat production. Two major developments in transistor design, FinFET (Fin Field-Effect Transistor) and the more recent GAAFET (Gate-All-Around Field-Effect Transistor) have been instrumental in these developments. They enhance control over the transistor's channel, diminishing leakage current and improving switching performance.

While the shrinkage of process nodes brings about significant advantages, it's not without its complications. As we push the boundaries of miniaturization, yield issues become more prominent, and manufacturing complexities escalate. The precision required for developing chips at these scales introduces a higher probability of defects, which can affect the overall yield of viable chips from each wafer.

Microarchitecture Design: The Role of SMs and CUs

At the microarchitectural level, GPUs are organized into Streaming Multiprocessors (SMs) and Compute Units (CUs), which are clusters of cores executing instructions in tandem. The architecture of each SM/CU is balanced to optimize throughput for a wide array of tasks. An easy fix for this situation would be to just increase the core density, but that brings on issues with power efficiency. The trade-off between increasing the number of cores for parallel processing and managing the resultant rise in power consumption and heat is a critical consideration. Achieving an optimal performance-per-watt ratio is a primary goal for GPU architects. The efficiency of executing threads in groups, which is known as warps in NVIDIA and wavefronts in AMD, is crucial for maximizing core utilization. GPUs use scheduling algorithms to adapt dynamically to varying demands, enhancing overall efficiency.

Parallel processing & Compute shaders

Parallelism is achieved through an architectural design vastly different from that of traditional CPUs. GPUs differentiate themselves by their architecture, featuring thousands of smaller, efficient cores designed for parallel processing, as opposed to CPUs, which are optimized for sequential execution with a far lower number of cores. This design enables GPUs to handle numerous tasks simultaneously, making them ideally suited for applications requiring high computational power. The GPU cores operate in a multithreaded manner, allowing for the simultaneous processing of multiple data streams. This is particularly effective for tasks that can be broken down into smaller, independent tasks, such as pixel or vertex processing in graphics rendering, or parallelizable computations in scientific research.

How does a GPU work: Rendering pipeline👁 An image showing AMD's Radeon 7900 XTX and 7900 XT GPUs next to each other.

The GPU rendering pipeline is a complex sequence of stages designed to convert 3D scenes into the 2D images we see on our screens. This process involves several key steps, each handling a different aspect of the rendering task:

Application Stage: The journey begins with the CPU preparing and sending instructions along with 3D scene data (comprising geometric shapes, usually triangles or polygons, and textures) to the GPU. This stage sets the groundwork for rendering by defining the objects and their properties within the scene.
Vertex Processing: At this stage, vertex shaders process each vertex's attributes, such as position, color, and texture coordinates. The vertices are transformed from their original 3D space (world space) to a 2D projection on the screen (screen space) through a series of transformations. Lighting calculations are also performed to determine how light sources within the scene affect the color and brightness of vertices.
Tessellation: An optional but powerful stage in GPUs that dynamically adds detail to objects based on the viewer's distance. It subdivides coarse mesh into finer polygons, enabling higher visual fidelity without excessively burdening the GPU with complex models that are far away and less noticeable.
Geometry Shading: This stage allows for the manipulation of geometry. Geometry shaders can add or modify vertices and primitives (the basic shapes that form 3D models), enabling effects like an explosion, grass swaying in the wind, or even the generation of complex shapes on the fly without burdening the CPU.
Rasterization: A conversion process that transforms the 3D geometric representations into pixels (or fragments) on a 2D screen. It determines which pixels on the screen are covered by each primitive, preparing them for further processing. This stage also involves clipping, where only the parts of the scene within the camera's view are processed.
Fragment Processing: Also known as pixel shading, this stage calculates the final color of each pixel by applying textures, shading effects, and lighting models. Advanced effects like bump mapping, reflections, shadows, and transparency are applied here, significantly contributing to the realism of the scene.
Output Merger: The concluding stage of the pipeline, where all the processed fragments are combined to form the final image. It resolves which fragments are visible (through depth testing) and how they blend with others (alpha blending), producing the pixels that will be displayed on the screen. Throughout these stages, the GPU employs parallel processing, allowing vast numbers of vertices and pixels to be processed simultaneously.
Enhanced realism: Utilizing RT cores, GPUs can now trace the paths of individual light rays in real time, allowing for the creation of highly realistic images with accurate shadows, reflections, and refractions. This method mimics the physical properties of light, significantly enhancing the visual quality of 3D environments.
Global Illumination: Complementing real-time ray tracing, global illumination algorithms simulate the complex behavior of light as it bounces off multiple surfaces before reaching the observer. This technique adds depth and realism to scenes by accurately portraying how light diffuses across different materials and textures.
Tensor Cores: Specialized Tensor Cores expedite matrix operations, crucial in deep learning and AI applications. By performing mixed-precision arithmetic, these cores enable rapid computation and efficient power usage, which is important for processing large neural networks and other AI models.
Deep Learning Super Sampling (DLSS): DLSS employs AI to intelligently upscale lower-resolution images in real time. This process allows for smoother frame rates and enhanced visual quality, showcasing how AI can revolutionize rendering techniques by optimizing performance without compromising on image detail.
Raster Operations Pipelines (ROPs): ROPs are critical in the final image composition. They manage the last stage of rendering, where fragment shader outputs are merged, depth and stencil testing are conducted, and the final pixel values are written to the frame buffer. Operations like blending and anti-aliasing are handled here, ensuring the visual output is both accurate and aesthetically pleasing.
Texture Mapping Units (TMUs): TMUs are responsible for applying textures to the 3D models, a process that involves filtering and mapping texture data onto the surfaces of objects. This stage is vital for adding detail and realism to the scene, as textures give objects their color, appearance, and surface qualities.

Additional features

Adaptive Shading: This technology optimizes rendering workload by varying shading rates across different areas of the scene, focusing on processing power where it's most needed. This can lead to performance improvements without noticeable loss in visual quality.
Mesh Shading: A newer approach that allows the GPU to more efficiently process large amounts of geometry. By offloading complex culling and geometry processing tasks to the GPU, mesh shading can significantly improve performance in scenes with dense geometric detail.
Variable Rate Shading (VRS): VRS allows GPUs to allocate varying amounts of shading resources to different areas of the frame based on their visual complexity or importance, optimizing performance by reducing the detail in less noticeable areas.

Software and algorithmic optimizations

Hardware doesn't quite account for the entire process of the inner workings of a GPU. The result you see on the screen comes from the synergy between hardware and software.

Graphics APIs and shader languages

DirectX 12, Vulkan, and CUDA: These APIs provide low-level access to GPU resources, enabling developers to craft highly optimized code that taps into the full potential of the GPU. DirectX 12 and Vulkan, in particular, offer fine-grained control over hardware resources, facilitating more efficient execution of graphics and compute tasks. CUDA, while focused on NVIDIA GPUs, provides a rich set of programming tools and libraries specifically designed for GPU-accelerated applications.

Parallel computing frameworks and libraries

CUDA and OpenCL: CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model that extends the power of their GPUs to general-purpose computing. It allows developers to use C, C++, and Fortran to develop software that can run on NVIDIA GPUs. OpenCL (Open Computing Language) is an open standard for cross-platform, parallel programming of diverse processors found in personal computers, servers, mobile devices, and embedded platforms. OpenCL provides a framework for writing programs that execute across heterogeneous platforms, including CPUs, GPUs, DSPs (Digital Signal Processors), and more.
TensorFlow and Other Libraries: TensorFlow is an open-source machine learning library developed by Google, which can leverage GPUs to accelerate neural network training and inference. It considers the complexities of parallel computation, making it easier for developers to implement and scale machine learning models. Other libraries and frameworks, such as PyTorch and Microsoft's CNTK, similarly support GPU acceleration, further increasing access to high-performance computing resources for AI research and development.

👁 An image showing an AMD Radeon RX 7900 XTX GPU that's installed on a test bench.
👁 An image showing a Zotac Gaming GeForce RTX 4070 Super Trinity Black Edition GPU installed on a computer.power.
👁 An image showing the Zotac branding on its RTX 4070 Super Trinity Black Edition GPU backplate.
👁 An image showing the backplate of the Zotac Gaming GeForce RTX 4070 Super Trinity Black Edition GPU.
👁 An image showing the top of the Zotac Gaming GeForce RTX 4070 Super Trinity Black Edition GPU.
👁 An image showing a person holding the Zotac Gaming GeForce RTX 4070 Super GPU to show its side profile.

Bringing things together

The versatility of GPUs has expanded significantly, impacting machine learning, scientific computing, and beyond. Originally designed for accelerating 3D graphics, GPUs now play a crucial role in deep learning by efficiently parallelizing matrix multiplications and significantly cutting down neural network training times. Recent technological advancements have introduced features like real-time ray tracing for realistic lighting in graphics and AI-driven optimizations for tasks such as image upscaling and noise reduction. This evolution underscores the GPU's transition from a graphics-focused unit to a multifaceted processor driving advancements in various fields, including artificial intelligence and scientific research, highlighting its essential role in modern computing.

URL: https://www.xda-developers.com/how-does-a-graphics-card-actually-work/

⇱ How do graphics cards work under the hood to bring your games to life?