Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP
By Jakub Antkiewicz
•2026-06-11T12:11:18Z
PyTorch Profiling Reveals Limits of Compiling Single Operations
A deep-dive analysis into PyTorch's fundamental `nn.Linear` module reveals that the popular `torch.compile` feature offers negligible performance gains for this specific operation in isolation. This finding, demonstrated on an NVIDIA A100 GPU, challenges the common practice of reflexively compiling individual model components and underscores the importance of profiling to identify true bottlenecks. The analysis shows that eager mode execution of `nn.Linear` is already highly optimized, using a single fused kernel that handles both matrix multiplication and bias addition, leaving little for the compiler to improve upon.
The technical investigation highlights that `nn.Linear` in eager mode dispatches a single `aten::addmm` operation, which utilizes a specialized cuBLAS GEMM kernel. This kernel incorporates the bias addition as an 'epilogue'—a final computation performed before writing the result to memory, which avoids a costly secondary memory access. When `torch.compile` is applied, the GPU executes the exact same kernel. The only optimization is on the CPU side, where the compiler eliminates an `aten::t` (transpose) call by pre-computing the weight tensor's memory strides. This saves a few microseconds of CPU dispatch overhead but does not change the GPU's workload.
- Eager Mode Efficiency: `nn.Linear` already uses a fused `addmm` kernel with a bias epilogue, not separate matmul and add kernels.
- Compiler Impact: `torch.compile` does not change the GPU kernel for a single `nn.Linear` layer.
- CPU vs. GPU Work: The primary benefit of compiling a single `nn.Linear` is the removal of a CPU-side metadata operation (`aten::t`), not a reduction in GPU compute time.
- True Fusion: Significant performance gains are observed when compiling larger graphs, like a multi-layer perceptron (MLP), where PyTorch's Inductor backend can fuse multiple distinct kernels (e.g., GEMMs, activations, element-wise ops) into one.
This analysis encourages a more nuanced approach to performance optimization within the AI developer community. Rather than applying `torch.compile` indiscriminately, the findings suggest its power lies in holistic graph optimization. For developers building models on frameworks like PyTorch, the lesson is to profile first and focus compilation efforts on sequences of operations or entire models where kernel fusion can genuinely reduce memory bandwidth constraints and GPU launch overhead. This shifts the performance narrative from optimizing individual building blocks to optimizing the connections between them.
The efficiency of modern deep learning libraries means that performance bottlenecks are rarely in single, optimized operations like `nn.Linear`. Significant gains from compilers like PyTorch's Inductor are realized by fusing multiple operations across a computational graph, not by re-optimizing already-fused kernels.