Lyu 2019
Lyu 2019
Lyu 2019
Abstract—While deep learning applications become popular, NVLink are used for the inter- and intra-node communications,
the design of deep learning systems is a critical task to unleash the respectively.
computing power of underlying systems. Aside from the comput- It is a challenging task to design a computer system that
ing hardware, the computer networking is also a key factor that
affects the delivered performance. When considering a large and satisfies the requirements of target applications while meet-
complex model, the scalability of the system highly depends on the ing the project budget, since each application has distinct
design of the networks, as well as the software behaviors. In this computation and communication patterns and misplacement
paper, we propose a profile-data-guided performance prediction of computing and communication resources often results in
method to estimate the performance of the system with desired unwanted idle time. While adopting an over-designed system
high-speed interconnects, based on the profiling data obtained
in a previous run. In particular, we leverage the open-source may avoid the inefficiency of resource use, it would lead
profiling tool, SOFA, for characterizing the software activities of to huge ownership and operating expenses as such systems
deep learning software running in a computer cluster, and the usually consist of many machine nodes, each of which equips
characterized information is used to build the performance model with costly hardware. Therefore, estimating the delivered per-
for the model training process. When estimating the performance, formance of a computer system with different configurations
SOFA is used to capture the performance critical factors for
the model to make the predictions. To evaluate the proposed is important to not overdesign the system. While there have
method, four popular deep learning models are adopted in our been several attempts to predict the performance of the deep
experiments, ResNet50, Inception3, Alexnet, and VGG16, where learning system, they do not tackle the performance issues
a computer cluster formed by four nodes is used to profile incurred by the communication network. A scaling constant is
the training of the above models on TensorFlow. We ran the assumed in PALEO [3] to roughly estimate the inefficiency of
scalability analysis to analyze the size of the cluster, and the
suitable computer networks for the models. By comparing the the communication network. Machine learning to predict the
predicted data and those measured on the cluster, our model performance is proposed in [14] and [11], and the inefficiency
achieves up to 95% accuracy in most of the cases, with the of the communication network is hidden in the black box
maximum error rate of 10%. of machine learning. SCALE-SIM [6] is a cycle accurate
Index Terms—Profiling tool, Deep learning, Distributed train- DNN accelerator simulator with the micro architecture features
ing, Timing Model, Network.
inside the accelerator, and the communication network outside
the accelerator is not fully discussed in the system level.
I. I NTRODUCTION
In this work, we propose a hybrid way by performance
With the success of deep learning methods being adopted in measurement and estimation to project the delivered perfor-
diversified applications, various computer systems are created mance of the computer system with different configurations.
to facilitate the deep learning computations whose perfor- We leverage the profiling tool, SOFA [10], to characterize
mance are often boosted by multicore processors and/or accel- the computation/computation patterns of the target application,
eration hardware on each of the machine nodes of the systems. and develop the performance models to further estimate the
In particular, communication networks on such systems are delivered performance according to the obtained performance
specially designed for rapidly transferring the large amount data. Our proposed method is independent of the deep learning
of data required by the computations. For instance, high frameworks that are used for creating the applications and is
speed Ethernet and InfiniBand are often used for inter-node able to be applied to the major deep learning frameworks, such
interconnections, whereas Peripheral Component Interconnect as TensorFlow [1], Mxnet, PyTorch, and Caffe2 [5].
Express (PCIe) is adopted for the interconnect network for Based on the profiled data on the single machine node,
the acceleration hardware within the machine node. One of which is equipped with the x86-based multicore processor
the notable examples is the DGX series systems developed by and the two NVIDIA GPUs, our experimental results show
NVIDIA, where 100 Gb InfiniBand and the custom designed that the performance models are able to report the projected
Authorized licensed use limited to: Carleton University. Downloaded on September 20,2020 at 17:51:46 UTC from IEEE Xplore. Restrictions apply.
performance when the system adopts more machine nodes our analysis, the hot-spots of the deep learning computations
(i.e., two and four nodes) and different high speed interconnect are the forward and backward propagation, costing 99% of
networks (i.e., 10/40/100 Gigabit Ethernet and InfiniBand) for time [15]. With the performance profile in mind, we can
the four popular deep learning models, resnet50 [4], inception3 choose as the best hardware as possible to accelerate the com-
[13], vgg16 [12], and alexnet [7], with the error rate of putation, if the selection fits the system design requirements.
5% in most of scenarios and the maximum error rate of On the other hand, the parameters of deep learning mod-
10% against the data measured on the real systems. The els are worthy of investigating since the parameters occupy
results show that the proposed method is able to predict the the memory size and the computation cannot be performed
delivered performance of the training processes of varied deep efficiently without enough memory space. Furthermore, when
learning models and is useful for exploring the configurations a computer cluster is used for the computations, the updated
for designing the decentralized systems while meeting the parameter values are moved around the computing hardware
application requirements. within a node or across machine nodes before the computation
The rest of the paper is organized as followings. Section II of the next training iteration. Hence, we analyzed the structures
introduces distributed deep learning and the performance of deep learning models and their parameters. It is interesting
profiling tool, SOFA. Section III elaborates the performance to note that the size of parameters is not positively correlated
estimation models. Section IV evaluates and discusses the to the number of layers of a given model. According to
performance of our proposed modeling techniques with the table I, the layers of Vgg16 is fewer than Resnet50, but
four popular DNN models. Finally, we conclude our work and the parameter size of Vgg16 is more than Resnet50. That is
summarize future work in Section V. because there is a big fully-connected layer in Vgg16 and
the big fully-connected layer includes most of parameters
II. BACKGROUND of Vgg16. Discovering the network structure of the given
In this section, we introduce performance characteristics of model and the size of the parameters helps characterize the
deep learning applications, and the system-wide performance communication behaviors and further facilitate the modeling
profiling tool, SOFA (Swarm-Oriented Function Calls Analy- of the delivered performance of the system.
sis), which aims at analyzing program runtime behaviors on
top of a complex system consisting of a deep software stacks TABLE I
and heterogeneous computing hardware. L AYERS AND PARAMETER S IZES OF F OUR P OPULAR DNN M ODELS
170
Authorized licensed use limited to: Carleton University. Downloaded on September 20,2020 at 17:51:46 UTC from IEEE Xplore. Restrictions apply.
mpstat, blktrace, and pyflame. perf_events cor- experimental results show repeated patterns are revealed by
relates the occurred hardware and software events with the SOFA. Figure 2 gives an impression of the data generated
resolved function symbols and the performance data helps by SOFA via the illustration for visualizing the performance
pinpoint the code sites for further performance analysis. trace of an iteration of the model training process, where the
tcpdump is used for tracking inter-process or inter-processor occurrences of the tracked events are represented by the dots in
communication activities. nvprof is the proprietary tool different colors. To make it easy to understand, the interpreted
offered by NVIDIA to track GPU-related runtime activities logical steps of the training process are written at the bottom
and performance information. mpstat gives the processor of the figure.
status information, and vmstat provides the virtual memory As can be seen in the figure, a training iteration starts from
statistics. blktrace is for collecting detailed information transferring the parameters from the server to the computation
about disk IO information as it goes to the block device node (dark gray dots). Then, within the computation node, the
layer. pyflame allows profiling a Python process without data is further copied from the CPU to the GPU (red dots).
modifying the Python code. During the GPU computations of the forward and backward
propagation (green dots), it can be seen that there are data
III. M ETHODOLOGY exchanges between GPUs (purple dots). Next, when the GPU
Our developed performance model takes three kinds of computation is done, the resulting data is copied back to the
input data to make the predictions: 1) neural network model CPU (brown dots). Finally, at the end of this iteration, the
description for the model size and connections of trainable intermediate data is transferred back to the parameter server
parameters, 2) GPU computation time for iteratively executed from the computation node (dark gray dots).
program parts, and 3) the hardware specification of the target A major advantage of SOFA is that each type of the activ-
system considered as the input parameters to the performance ities is represented by a distinct color and the developers are
model, such as the bandwidth of the interconnection network, able to find out the big picture of the software execution flow.
the bus bandwidth between a processor and the system mem- From the different color dots, we find out the key execution
ory, and the number of machine nodes. The first data can be stages within an iteration for model training (e.g., the order of
obtained by calling the APIs provided by the deep learning the inter- and intra-node data movements, and the forward and
frameworks, the second data is provided by SOFA, and the backward propagation, as described in the previous paragraph).
third data is fed by the developers manually. The concrete Furthermore, we also learn that the backward propagation
example of the inputs to the model is illustrated in Figure 1. is overlapped with the GPU to CPU communication. This
In the following subsections, we characterize the execution situation is illustrated in Figure 2, where the brown dots are
flow of deep learning models with SOFA, which helps identify on top of the green dots. In addition, the phenomenon is
the key execution stages of deep learning software. Moreover, also reported in the previous works [15], and the concurrent
we analyze the data communications taken place among the execution of the GPU computation (for backward propagation)
machine nodes during the computations, as well as the com- and the data movements from the GPU to the CPU is supported
munications among the processing hardware within the ma- by CUDA streams.
chine node. Based on the above information, we introduce the In addition to the visualized data representations, SOFA
performance model that we build for performance predictions. also reports the summarized performance information. For
example, the execution time taken by the GPU can be figured
out by the events of GPU kernel (green dots) in Figure 2
and it is about 0.25 seconds per the time represented in the
horizontal axis. The GPU execution time can also be figured
out by the total execution time of GPU divided by the iteration
number. For example, Table II lists the performance data of the
training processes for the four deep learning models, Resnet50,
Inception3, Alexnet, and Vgg16. The number of processed
images in Restnet50 is stable after 100 iterations. It takes 50.10
seconds for 100 iterations and takes 55.16 seconds for 120
Fig. 1. The Flow of Performance Estimation iterations. Hence, it takes 5.16 seconds for 20 iterations and
takes 0.25 seconds for 1 iteration. It matches the result from
the visualized data of SOFA. More data reported by SOFA
A. Discovering Deep Learning Software Execution Stages are given in Table III regarding the training process in deep
with SOFA learning models across the three NVIDIA GPUs, GTX 1080,
We give a concrete example to illustrate the performance Titan X, and Tesla P100 with CNN benchmarks [9].
profiling data provided by SOFA. In particular, the parameter
server scheme [2] [8] is adopted in the experiment, where a B. Characterizing Communication Behaviors
parameter server (denoted as machine: 101) and a computation As the sizes of model parameters affect the data transferring
machine node (denoted as machine: 100) are involved. Our performance, we used TensorFlow to report the parameter
171
Authorized licensed use limited to: Carleton University. Downloaded on September 20,2020 at 17:51:46 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. The runtime activities of one iteration for the model training process, where the parameter server scheme is adopted.
TABLE II
P ERFORMANCE P ROFILES OF F OUR D EEP L EARNING M ODELS ON GPU.
Iteration
Warm Iteration
Execution per
Model up Iteration Img/sec Time(s)
Time(s) second
Iteration (C)
1 / (C)
100 100 50.10 122.72
Resnet50 0.25 3.95
100 120 55.16 123.52
100 100 70.49 87.75
Inception3 0.34 2.9
100 120 77.38 86.31
600 600 40.26 910.31
Alexnet 0.04 28.01
600 720 44.54 907.47
100 100 75.91 86.55
Vgg16 0.37 2.73
100 120 83.25 86.33
172
Authorized licensed use limited to: Carleton University. Downloaded on September 20,2020 at 17:51:46 UTC from IEEE Xplore. Restrictions apply.
TABLE III
T IME R ATIO OF F ORWARD P ROPAGATION AND BACKWARD P ROPAGATION .
173
Authorized licensed use limited to: Carleton University. Downloaded on September 20,2020 at 17:51:46 UTC from IEEE Xplore. Restrictions apply.
TABLE V
C OMPUTATION AND C OMMUNICATION T IME FOR GPU.
Average Average
Data Data
HtoD Achieved DtoH Achieved Kernel Memcpy
Communication Communication
model Data Bandwidth Data Bandwidth Time Time
Time Time
(MB) HtoD (MB) DtoH (seconds) (A)+(B)
(A) (B)
(GB/s) (GB/s)
resnet50 2044.73 5.90 0.35 2048.97 6.10 0.33 5.46 0.68
inception3 1905.56 5.90 0.32 1908.30 5.80 0.33 8.58 0.65
alexnet 5194.76 6.00 0.87 5194.76 6.50 0.80 0.80 1.66
vgg16 11068.93 6.00 1.85 11068.93 6.50 1.70 9.60 3.55
PS0-Weight A PS1-WeightB
Remote PS1-WeightB
Worker 0 Worker 1
Host: Weight A+B Weight A+B
Fig. 5. The intra-node communications among a CPU and two GPU of the computation node (PS0), which receives the parameters, WeightB, from the
parameter server node (PS1). There are a total of eleven iterations performed in this experiment.
V
2 × (N − 1) × N N : GPU number in all nodes
TD2 H =
D2 H W : computing node numbers
Modeling the parameter-server scheme. We assume every G : GPU number in each computing node
machine node within the cluster is participated in the compu- B : the network bandwidth of two nodes in uni-direction
tations. H2 D : the bus bandwidth from CPU to GPU
D2 H : the bus bandwidth from GPU to CPU
P2 P : the bus bandwidth from GPU to GPU
TST EP = TN ET + TH2 D + TP2 P + TF orward + TD2 H IV. E VALUATION AND D ISCUSSION
where: We built a computer cluster and the configurations of each
Vremote × (W − 1) × 2 + (V − Vremote ) cluster node is listed in Table VIII. We ran the experiments
TN ET = with a single machine node to capture the performance charac-
B
teristics of the key stages (mentioned in Section III-A) in the
Vlocal
TH2 D = model training process. Based on the obtained performance
H2 D data, our performance models are used to estimate the perfor-
Vlocal mance of the same workloads running on the two-node and
TP2 P = four-node clusters, where each of the node is equipped with
P2 P
one or two GPUs.
V
TD2 H = Estimating errors of the predicted performance. Tables
D2 H VII and VIII give the predicted (predicted image/sec) and
V : the parameter size of the model measured (real image/sec) performance under the two param-
Vremote : the node hold most of parameters in all remote eter distribution schemes, ring-allreduce and parameter server,
parameter servers respectively. The error rates listed at the right-most columns
Vlocal : the node hold most of parameters in all local of the tables are under 5% in most of cases and the maximum
parameter servers. error is 9.7%.
174
Authorized licensed use limited to: Carleton University. Downloaded on September 20,2020 at 17:51:46 UTC from IEEE Xplore. Restrictions apply.
TABLE VI network whenever the computation is done.
T HE SW/HW C ONFIGURATIONS OF A C LUSTER N ODE .
V. C ONCLUSION
DNN models: Resnet50, Inception3, Alexnet, Vgg16
Batch size: 32
We present the performance estimation models for dis-
Software tributed deep learning systems, which is based on the commu-
DNN Framework: TensorFlow v.1.7.0
OS: CentOS 7.3 nication schemes and the profiling materials. Per the evaluation
Intel Xeon E5-2620 v4 @ 2.1 GHz w/ 64GB memory results with popular communication schemes and popular
NVIDIA GTX-1080 (w/ 8GB memory) * 2 DNN models, the performance estimation is very close to the
Hardware
Network bandwidth: 1 Gbps
Bus bandwidths: real results from real machines. The error rate is under 5% in
CPU to GPU: 6.0 GB/s most of scenarios and the maximum error rate is below 10%.
GPU to CPU: 6.5 GB/s It shows the feasibility to do the performance perdition of
GPU to GPU: 6.0/10.0 GB/s (uni-direction/bi-direction)
distributed deep learning systems before any real machines are
implemented and deployed. It will save the trial-n-error cost
and ensure the computation power and the network bandwidth
Characterizing the system performance. Our performance to satisfy the deep learning applications.
models are able to reflect the performance characteristics of the
model training with the two parameter distribution schemes. ACKNOWLEDGMENTS
When the model parameters distributed across different model This work was financially supported by the Ministry of
layers (i.e., Resnet50 and Inception3, as depicted in Figure 3), Science and Technology of Taiwan under grants MOST No.
the performance scales when adding more GPUs. It is inter- 107-2218-E-002-053- and MOST No. 107-2218-E-002-003-.
esting to note that the parameter server scheme achieves linear
R EFERENCES
speedups, when the number of GPU grows, and it is better than
the other scheme that costs more network time, and hence, [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for large-
limits the performance. scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
On the contrary, as for those models, whose parameters [2] W. Dai, A. Kumar, J. Wei, Q. Ho, G. Gibson, and E. P. Xing. High-
are concentrated on certain layers (i.e., Alexnet and Vgg16), performance distributed ml at scale through parameter server consistency
models. In Proceedings of the Twenty-Ninth AAAI Conference on
the ring-allreduce scheme delivers better performance than Artificial Intelligence, AAAI’15, pages 79–87. AAAI Press, 2015.
the other, since the performance of the parameter server is [3] A. T. Hang Qi, Evan R. Sparks. Paleo: A performance model for deep
dominated by the node that is assigned to handle the big layers. neural networks. ICLR, 2017.
[4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image
Choosing the proper network configuration. The pre- recognition. In Proceedings of the IEEE conference on computer vision
dicted performance is able to help make better decisions for and pattern recognition, pages 770–778, 2016.
[5] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick,
the system design. For example, the network time grows S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for
with the increasing of the GPU numbers as shown in Tables fast feature embedding. CoRR, abs/1408.5093, 2014.
VII and VIII. It is obvious that the current network setting, [6] D. Justus, J. Brennan, S. Bonner, and A. S. McGough. Predicting the
computational cost of deep learning models. In 2018 IEEE International
Gigabit Ethernet, is not sufficient for our workloads. While Conference on Big Data (Big Data), pages 3873–3882, Dec 2018.
simply choosing the network with the highest bandwidth may [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification
deliver the best results, it may present an over-designed and with deep convolutional neural networks. In Advances in neural
information processing systems, pages 1097–1105, 2012.
overbudget system. [8] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski,
We use Figure 6 as an example to show how our proposed J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine
work helps determine which network bandwidth is sufficient learning with the parameter server. In 11th USENIX Symposium on
Operating Systems Design and Implementation (OSDI 14), pages 583–
for the workloads using the ring-allreduce scheme. Note that 598, Broomfield, CO, 2014. USENIX Association.
in this experiment, a maximum of 32 GPUs is available in the [9] C.-Y. Liu. Scout j-bench. https://github.com/jcjohnson/cnn-benchmarks,
system formed by eight machine nodes; the configuration with 2018.
[10] C.-Y. Liu. Sofa. https://github.com/cyliustack/sofa.git, 2018.
a single GPU is the baseline performance and the speedups [11] A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna.
achieved by 2 to 32 GPUs are reported, where the maximum SCALE-Sim: Systolic CNN Accelerator Simulator. arXiv e-prints, page
speedup is 32. The figure plots the predicted performance arXiv:1811.02883, Oct 2018.
[12] K. Simonyan and A. Zisserman. Very deep convolutional networks for
of the design alternatives with different network bandwidths: large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
1Gbps, 10Gbps, 40Gbps, 100Gbps, and unlimited. [13] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking
The results show that Inception3 can take full advantage the inception architecture for computer vision. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages
of the underlying hardware by achieving 32x speedup when 2818–2826, 2016.
the 10Gbps network is adopted, and using a higher bandwidth [14] C. Witt, M. Bux, W. Gusew, and U. Leser. Predictive performance
network is a waste. This is not the case for Alexnet, which modeling for distributed computing using black-box monitoring and
machine learning. CoRR, abs/1805.11877, 2018.
needs to consume as much network bandwidth as possible [15] Z. Z. X. S. D. W. H. Q. L. X. . . X. E. P. Zhang, H. Poseidon: An
in order to get a better performance. The reason is that the efficient communication architecture for distributed deep learning on gpu
computation time of Alexnet is relatively short (e.g., 0.01s in clusters. arXiv preprint arXiv:1706.03292, 2017.
Table VII), and it requires frequent data exchanges over the
175
Authorized licensed use limited to: Carleton University. Downloaded on September 20,2020 at 17:51:46 UTC from IEEE Xplore. Restrictions apply.
TABLE VII
E RROR R ATES OF P ERFORMANCE E STIMATION IN R ING - ALLREDUCE S CHEME
TABLE VIII
E RROR R ATES OF P ERFORMANCE E STIMATION I N PARAMETER S ERVER S CHEME
Fig. 6. Predicting the performance speedups when training the models with the ring-allreduce scheme under five different network bandwidths with 1 to 32
GPUs.
176
Authorized licensed use limited to: Carleton University. Downloaded on September 20,2020 at 17:51:46 UTC from IEEE Xplore. Restrictions apply.