SC16 Salt Lake City, UT

75. Toward Portable Machine Learning Kernels for Deep Neural Networks with Autotuning on Top of OpenCL and High Bandwidth Memory GPUs


Authors: Yaohung Tsai (University of Tennessee)Piotr Luszczek (University of Tennessee)Jakub Kurzak (University of Tennessee)Jack Dongarra (University of Tennessee)

Abstract: We present a portable, highly-optimized Deep Neural Network (DNN) algorithm and its implementation techniques. Our approach combines in novel ways existing HPC techniques such as autotuning, data layout, and low-level optimizations that, when applied simultaneously, achieve performance that matches and exceeds what is possible with either reverse engineering and manual assembly coding or proprietary vendor libraries. The former was done inside the maxDNN implementation, and the latter is represented by cuDNN. Our work may be directly applied to the most time consuming part DNN workflow, namely the training process which often needs restart when it stagnates due to, among other reasons, diminishing gradients and getting stuck in local minima. We used for our performance tests a consumer-grade GPU with the latest High Bandwidth Memory (HBM) stack which can match a number of server grade hardware at fraction of the price which attests to the portability of our approach and implementation.

Poster: pdf
Two-page extended abstract: pdf


Poster Index