High-Performance Computing in Deep Learning: Distributed Training Strategies for Transformer Models in Natural Language Processing
Keywords:
distributed training, transformer models, heterogeneous clusters, gradient sparsification, fault toleranceAbstract
Distributed training of large Transformer models is increasingly conducted on heterogeneous high-performance computing (HPC) clusters, where variability in compute capacity and network topology degrades efficiency and stability. Existing systems rely on static partitioning or uniform gradient compression, leading to communication bottlenecks, suboptimal convergence, and poor fault tolerance. To address these limitations, we propose an adaptive distributed training framework that integrates topology-aware model placement, layer-wise adaptive sparsification based on gradient variance, and error feedback with hybrid parallelism. Evaluated on a 1.3-billion-parameter Transformer across 32 GPUs (including RTX 4090 and V100), our method achieves a throughput of 2,268 ± 29 samples/sec (23.1% higher than Megatron-LM) and reduces time to target validation loss (<2.85) to 12.8 ± 0.2 hours (12.9% shorter than Megatron-LM and 25.1% shorter than DeepSpeed ZeRO-2 (p < 0.001)). Communication volume is lowered to 2.03 ± 0.02 GB/step (approximately 58% lower than Megatron-LM), and the robustness score reaches 0.92 ± 0.01. The approach maintains competitive out-of-domain perplexity (PubMed: 14.2; GitHub: 18.7) and recovers from 5% node failures in 30 ± 3 steps. These results demonstrate a practical path toward efficient, stable, and deployable large-model training in shared, heterogeneous infrastructure.Downloads
Published
2026-02-17