Distributed Deep Learning

HAL: Computer System for Scalable Deep Learning

V. Kindratenko, D. Mu, Y. Zhan, J. Maloney, S. Hashemi, B. Rabe, K. Xu, R. Campbell, J. Peng, and W. Gropp.

My Contributions

  • Distributed training on HAL with PyTorch and NVIDIA Apex

  • ImageNet benchmark experiments for performance analysis

  • Member of the NCSA HAL cluster admin team, now called NCSA CAII

  • NCSA HAL cluster tutorial series on distributed deep learning