Semiconductor research firm SemiAnalysis has conducted a systematic micro-benchmarking study of NVIDIA's Blackwell GPU architecture, marking one of the most significant GPU microarchitecture changes in recent years, yet one that still lacks a comprehensive official white paper. The study, conducted over several months, publicly reveals for the first time the hardware performance limits of the architecture under AI workloads.
Test results indicate that the Blackwell architecture approaches theoretical peak performance in key areas such as Tensor Core throughput, memory subsystem bandwidth, and new 2SM MMA instructions. However, performance is highly dependent on instruction shape configuration, with noticeable bandwidth bottlenecks in certain scenarios. This finding provides direct reference value for AI infrastructure investors and chip purchasers, highlighting that fully realizing the architecture's potential depends on meticulous software-level optimization.
SemiAnalysis has open-sourced the relevant benchmark code repository. The B200 nodes used for testing were provided by Nebius and Verda. The research team also announced plans to extend benchmarking to TPU Pallas kernels, Trainium NKI kernels, and AMD CDNA4 assembly in subsequent phases.
A core architectural change involves the introduction of Tensor Memory (TMEM) for storing MMA accumulators and the new 2SM MMA instructions. In transitioning from the Hopper to the Blackwell architecture, NVIDIA made several important adjustments to the PTX abstraction layer for MMA-related instructions. The most notable change is the introduction of TMEM. In previous architectures, threads implicitly held MMA operation results; Blackwell shifts this to software explicitly managing TMEM within the MMA scope, altering the ownership relationship between threads and computation results.
Concurrently, the tcgen05 operation is now issued by a single thread representing the entire CTA (Cooperative Thread Array), unlike the Hopper architecture where it was issued per warp or warpgroup. This change is directly reflected in CuTe MMA atoms. Blackwell also introduces TPC-scoped TMA and MMA, supporting two cooperative CTAs executing tcgen05.mma across SMs while sharing operands. This provides higher computational intensity MMA instructions while reducing shared memory bandwidth demands per CTA. The architecture natively supports sub-byte data types with micro-scaling and introduces Cluster Launch Control (CLC) as hardware support for dynamic work scheduling in persistent CTA kernels.
Through reverse engineering, SemiAnalysis uncovered the physical topology of the B200 chip. By utilizing the PTX %%smid instruction and launching clusters of varying sizes, the team inferred the SM-to-GPC (Graphics Processing Cluster) mapping. Results indicate that some TPCs in the B200 are exclusive to logical GPCs and are never co-scheduled with other TPCs.
By having each SM traverse pointer-chasing arrays that fill the L2 cache and measuring access latency between SMs, the team constructed an inter-SM distance matrix. This matrix clearly revealed two distinct groups of SMs, with an average L2 access latency difference exceeding 300 clock cycles, corresponding to the cross-die access penalty between the two dies.
Based on this, the research team inferred the die-level TPC distribution for the B200 as follows: Die A contains GPCs with 10, 10, 10, and 9 TPCs respectively; Die B contains GPCs with 9, 9, 9, and a split of 5+3 TPCs. This physical layout difference means that even two GPUs with identical logical configurations can have different physical SM distributions, constituting a potential source of performance non-determinism.
Memory subsystem testing focused on two types of asynchronous copy instructions: LDGSTS and TMA. For LDGSTS, testing covered typical configurations of the FlashInfer multi-head attention (MHA) kernel. Results show that LDGSTS memory throughput saturates at 32 KiB of in-flight bytes, with a peak of approximately 6.6 TB/s. Latency tests revealed a baseline LDGSTS latency of about 600 nanoseconds, which nearly doubles when in-flight bytes exceed 8 KiB, attributed to numerous threads stalling due to MIO (Memory Input/Output) throttling.
For TMA, peak throughput is achieved significantly later than for LDGSTS. Below 32 bytes of in-flight data, asynchronous copy throughput slightly outperforms TMA; beyond this threshold, TMA catches up and can scale up to 128 KiB. In latency, asynchronous copy has a slight advantage below 12 KiB of in-flight data, after which TMA latency increases substantially. TMA multicast testing showed that explicit TMA multicast perfectly eliminates L2 traffic, achieving an ideal "1/cluster size" L2 byte ratio.
Tensor Core performance testing, a core part of the study, revealed a high sensitivity of Blackwell MMA performance to instruction shape. For throughput, 1SM MMA with M=64 configuration reaches only 50% of theoretical peak, while M=128 approaches 100%, confirming that M=64 utilizes only half the data path. For 2SM MMA, M=128 with N=64 achieves 90% of peak throughput, with other N sizes nearing 100%. M=256 maintains near-peak throughput across all configurations as it is equivalent to each SM processing M=128, fully utilizing the data path.
The layout of input matrices A and B also has a significant impact. When both input matrices are stored in shared memory (SS mode), M=128 with N<128 shows a clear SMEM bandwidth bottleneck. For example, with FP16, the hardware can execute 8192 MMA FLOPs per cycle, while SMEM bandwidth is 128 B/cycle. Calculations show that for an M=128 N=64 K=16 configuration, SMEM requires 48 cycles, while math operations need only 32 cycles, indicating the instruction is limited by SMEM bandwidth. This pattern holds for all data types: MMA instructions with both operands in SMEM are constrained by SMEM bandwidth when N<128.
The 2SM MMA achieves perfect weak scaling, delivering twice the speedup of 1SM MMA when using double the computational resources. In SS mode with small shape configurations, speedup even exceeds 2x due to operand B being partitioned across two SMs. The conclusion is clear: always use the largest instruction shape available for a given SMEM tile size to achieve maximum throughput.
Regarding latency, latency increases linearly as N grows from 64 to 128 across all configurations, with a jump at N=256. Latency by data type follows a consistent order. Testing of actual in-flight instructions shows that with 1 to 4 in-flight MMA instructions, a typical scenario for kernels, the throughput ceiling for 4 in-flight MMAs is about 78% to 80% of theoretical peak, with 1SM MMA being approximately 5 percentage points higher than 2SM MMA.
Comments