Presentation
SIGN IN TO VIEW THIS PRESENTATION Sign In
Scalable Neural Network Training: Distributed Data-Parallel Approaches
DescriptionTraining large neural networks is computationally demanding and often limited by synchronization overhead in distributed environments. Traditional data-parallel frameworks, such as Horovod or PyTorch DDP, average gradients at every batch, which can limit scalability due to communication bottlenecks.
In this work, we propose two novel data-parallel strategies that reduce synchronization by averaging weights and biases only at the end of each epoch. These methods are implemented using the PyCOMPSs task-based programming model and integrated into dislib, enabled by a new distributed tensor abstraction (ds-tensor) that supports multidimensional data structures suitable for deep learning workloads.
We evaluate our approach on classification and regression tasks using real-world datasets and federated learning scenarios. Results show up to 95% training time reduction and strong scalability up to 64 workers, while maintaining or improving model accuracy. Our strategies enable asynchronous, communication-efficient training and are well-suited for heterogeneous and large-scale HPC systems.
In this work, we propose two novel data-parallel strategies that reduce synchronization by averaging weights and biases only at the end of each epoch. These methods are implemented using the PyCOMPSs task-based programming model and integrated into dislib, enabled by a new distributed tensor abstraction (ds-tensor) that supports multidimensional data structures suitable for deep learning workloads.
We evaluate our approach on classification and regression tasks using real-world datasets and federated learning scenarios. Results show up to 95% training time reduction and strong scalability up to 64 workers, while maintaining or improving model accuracy. Our strategies enable asynchronous, communication-efficient training and are well-suited for heterogeneous and large-scale HPC systems.


