Lyu 2019

Modeling Interprocessor Communication and
Performance Scalability for Distributed Deep

Learning Systems
Yi-Hong Lyu1 , Cheng-Yueh Liu1 , Chen-Pang Lee1 , Chia-Heng Tu2 and Shih-Hao Hung1
1
National Taiwan University, Taipei 10617, Taiwan
2
National Cheng Kung University, Tainan 70101, Taiwan
Email: [email protected]; [email protected]; [email protected];
[email protected]; [email protected]
Abstract—While deep learning applications become popular, NVLink are used for the inter- and intra-node communications,
the design of deep learning systems is a critical task to unleash the respectively.
computing power of underlying systems. Aside from the comput- It is a challenging task to design a computer system that
ing hardware, the computer networking is also a key factor that
affects the delivered performance. When considering a large and satisfies the requirements of target applications while meet-
complex model, the scalability of the system highly depends on the ing the project budget, since each application has distinct
design of the networks, as well as the software behaviors. In this computation and communication patterns and misplacement
paper, we propose a profile-data-guided performance prediction of computing and communication resources often results in
method to estimate the performance of the system with desired unwanted idle time. While adopting an over-designed system
high-speed interconnects, based on the profiling data obtained
in a previous run. In particular, we leverage the open-source may avoid the inefficiency of resource use, it would lead
profiling tool, SOFA, for characterizing the software activities of to huge ownership and operating expenses as such systems
deep learning software running in a computer cluster, and the usually consist of many machine nodes, each of which equips
characterized information is used to build the performance model with costly hardware. Therefore, estimating the delivered per-
for the model training process. When estimating the performance, formance of a computer system with different configurations
SOFA is used to capture the performance critical factors for
the model to make the predictions. To evaluate the proposed is important to not overdesign the system. While there have
method, four popular deep learning models are adopted in our been several attempts to predict the performance of the deep
experiments, ResNet50, Inception3, Alexnet, and VGG16, where learning system, they do not tackle the performance issues
a computer cluster formed by four nodes is used to profile incurred by the communication network. A scaling constant is
the training of the above models on TensorFlow. We ran the assumed in PALEO [3] to roughly estimate the inefficiency of
scalability analysis to analyze the size of the cluster, and the
suitable computer networks for the models. By comparing the the communication network. Machine learning to predict the
predicted data and those measured on the cluster, our model performance is proposed in [14] and [11], and the inefficiency
achieves up to 95% accuracy in most of the cases, with the of the communication network is hidden in the black box
maximum error rate of 10%. of machine learning. SCALE-SIM [6] is a cycle accurate
Index Terms—Profiling tool, Deep learning, Distributed train- DNN accelerator simulator with the micro architecture features
ing, Timing Model, Network.
inside the accelerator, and the communication network outside
the accelerator is not fully discussed in the system level.
I. I NTRODUCTION
In this work, we propose a hybrid way by performance
With the success of deep learning methods being adopted in measurement and estimation to project the delivered perfor-
diversified applications, various computer systems are created mance of the computer system with different configurations.
to facilitate the deep learning computations whose perfor- We leverage the profiling tool, SOFA [10], to characterize
mance are often boosted by multicore processors and/or accel- the computation/computation patterns of the target application,
eration hardware on each of the machine nodes of the systems. and develop the performance models to further estimate the
In particular, communication networks on such systems are delivered performance according to the obtained performance
specially designed for rapidly transferring the large amount data. Our proposed method is independent of the deep learning
of data required by the computations. For instance, high frameworks that are used for creating the applications and is
speed Ethernet and InfiniBand are often used for inter-node able to be applied to the major deep learning frameworks, such
interconnections, whereas Peripheral Component Interconnect as TensorFlow [1], Mxnet, PyTorch, and Caffe2 [5].
Express (PCIe) is adopted for the interconnect network for Based on the profiled data on the single machine node,
the acceleration hardware within the machine node. One of which is equipped with the x86-based multicore processor
the notable examples is the DGX series systems developed by and the two NVIDIA GPUs, our experimental results show
NVIDIA, where 100 Gb InfiniBand and the custom designed that the performance models are able to report the projected
978-1-7281-4484-9/19/$31.00 ©2019 IEEE 169
Authorized licensed use limited to: Carleton University. Downloaded on September 20,2020 at 17:51:46 UTC from IEEE Xplore. Restrictions apply.
performance when the system adopts more machine nodes our analysis, the hot-spots of the deep learning computations
(i.e., two and four nodes) and different high speed interconnect are the forward and backward propagation, costing 99% of
networks (i.e., 10/40/100 Gigabit Ethernet and InfiniBand) for time [15]. With the performance profile in mind, we can
the four popular deep learning models, resnet50 [4], inception3 choose as the best hardware as possible to accelerate the com-
[13], vgg16 [12], and alexnet [7], with the error rate of putation, if the selection fits the system design requirements.
5% in most of scenarios and the maximum error rate of On the other hand, the parameters of deep learning mod-
10% against the data measured on the real systems. The els are worthy of investigating since the parameters occupy
results show that the proposed method is able to predict the the memory size and the computation cannot be performed
delivered performance of the training processes of varied deep efficiently without enough memory space. Furthermore, when
learning models and is useful for exploring the configurations a computer cluster is used for the computations, the updated
for designing the decentralized systems while meeting the parameter values are moved around the computing hardware
application requirements. within a node or across machine nodes before the computation
The rest of the paper is organized as followings. Section II of the next training iteration. Hence, we analyzed the structures
introduces distributed deep learning and the performance of deep learning models and their parameters. It is interesting
profiling tool, SOFA. Section III elaborates the performance to note that the size of parameters is not positively correlated
estimation models. Section IV evaluates and discusses the to the number of layers of a given model. According to
performance of our proposed modeling techniques with the table I, the layers of Vgg16 is fewer than Resnet50, but
four popular DNN models. Finally, we conclude our work and the parameter size of Vgg16 is more than Resnet50. That is
summarize future work in Section V. because there is a big fully-connected layer in Vgg16 and
the big fully-connected layer includes most of parameters
II. BACKGROUND of Vgg16. Discovering the network structure of the given
In this section, we introduce performance characteristics of model and the size of the parameters helps characterize the
deep learning applications, and the system-wide performance communication behaviors and further facilitate the modeling
profiling tool, SOFA (Swarm-Oriented Function Calls Analy- of the delivered performance of the system.
sis), which aims at analyzing program runtime behaviors on
top of a complex system consisting of a deep software stacks TABLE I
and heterogeneous computing hardware. L AYERS AND PARAMETER S IZES OF F OUR P OPULAR DNN M ODELS
Number Number of Number

A. Characterizations of Deep Learning Computations Model of fully-connected of
Before a deep learning model can be used, it takes a layer layers parameters
Resnet50 [4] 176 1 25,612,201
significant amount of time to train the model. When training Inception V3 [13] 313 1 23,786,333
a deep learning model, the two major phases are involved: Alexnet [7] 8 3 61,842,345
forward and backward propagation. The training data are fed Vgg16 [12] 16 3 138,361,641
into the model formed by many small cells (called neurons)
in different layers. The computations on the input data and
the parameters within the cells determine the output of the B. Profiling Deep Learning Software with SOFA
cell. The training process intends to learn the trends of SOFA [10] is a software framework for performance profil-
the input data by altering the value of the parameters. The ing of deep learning programs, where the specialized function-
forward pass applies the input data to the model with the swarm methods have been developed to analyze the run-
given parameter values and the backward pass updates the time performance data generated by existing heterogeneous
parameters according to the predicted output calculated in profiling tools. It facilitates the performance analysis for a
the forward pass and the given answer, aiming to approach distributed deep learning system, where each of the machine
the output to the answer. The execution of one forward and nodes consists of different types of computing hardware. By
one backward pass is defined as an iteration, and the training analyzing the function swarms (i.e., the temporal and spatial
process is repeated until the model output is precise enough. function-call patterns) within the collected performance data,
Sometimes, the iteration number is given to control the length SOFA is able to discover the proper code regions that are
of the training process (i.e., to see if the currently trained worthy of investigating for performance optimization, and
model is accurate enough). greatly shortens the time for the developers to analyze the
It is a common practice that building a computer cluster trace data manually to find out the meaningful code regions,
to boost the performance of the training process by taking such as main loops (by searching for the repeated occurrences
advantage of the massive computing resources consisting of of invoked function calls; temporal patterns) and intensively
multicore processors, GPUs, TPUs and FPGAs. In order to invoked groups of functions (by discovering the access patterns
facilitate the hardware acceleration, analyzing the runtime per- of the addresses for the executed instructions; spatial patterns).
formance characteristics is important. We studied the perfor- SOFA incorporates with existing tools to collect trace
mance of deep learning applications in terms of computation data that correlates performance information with the soft-
and communication parts. As for the computation part, from ware activities, such as perf_events, tcpdump, nvprof,
170
mpstat, blktrace, and pyflame. perf_events cor- experimental results show repeated patterns are revealed by
relates the occurred hardware and software events with the SOFA. Figure 2 gives an impression of the data generated
resolved function symbols and the performance data helps by SOFA via the illustration for visualizing the performance
pinpoint the code sites for further performance analysis. trace of an iteration of the model training process, where the
tcpdump is used for tracking inter-process or inter-processor occurrences of the tracked events are represented by the dots in
communication activities. nvprof is the proprietary tool different colors. To make it easy to understand, the interpreted
offered by NVIDIA to track GPU-related runtime activities logical steps of the training process are written at the bottom
and performance information. mpstat gives the processor of the figure.
status information, and vmstat provides the virtual memory As can be seen in the figure, a training iteration starts from
statistics. blktrace is for collecting detailed information transferring the parameters from the server to the computation
about disk IO information as it goes to the block device node (dark gray dots). Then, within the computation node, the
layer. pyflame allows profiling a Python process without data is further copied from the CPU to the GPU (red dots).
modifying the Python code. During the GPU computations of the forward and backward
propagation (green dots), it can be seen that there are data
III. M ETHODOLOGY exchanges between GPUs (purple dots). Next, when the GPU
Our developed performance model takes three kinds of computation is done, the resulting data is copied back to the
input data to make the predictions: 1) neural network model CPU (brown dots). Finally, at the end of this iteration, the
description for the model size and connections of trainable intermediate data is transferred back to the parameter server
parameters, 2) GPU computation time for iteratively executed from the computation node (dark gray dots).
program parts, and 3) the hardware specification of the target A major advantage of SOFA is that each type of the activ-
system considered as the input parameters to the performance ities is represented by a distinct color and the developers are
model, such as the bandwidth of the interconnection network, able to find out the big picture of the software execution flow.
the bus bandwidth between a processor and the system mem- From the different color dots, we find out the key execution
ory, and the number of machine nodes. The first data can be stages within an iteration for model training (e.g., the order of
obtained by calling the APIs provided by the deep learning the inter- and intra-node data movements, and the forward and
frameworks, the second data is provided by SOFA, and the backward propagation, as described in the previous paragraph).
third data is fed by the developers manually. The concrete Furthermore, we also learn that the backward propagation
example of the inputs to the model is illustrated in Figure 1. is overlapped with the GPU to CPU communication. This
In the following subsections, we characterize the execution situation is illustrated in Figure 2, where the brown dots are
flow of deep learning models with SOFA, which helps identify on top of the green dots. In addition, the phenomenon is
the key execution stages of deep learning software. Moreover, also reported in the previous works [15], and the concurrent
we analyze the data communications taken place among the execution of the GPU computation (for backward propagation)
machine nodes during the computations, as well as the com- and the data movements from the GPU to the CPU is supported
munications among the processing hardware within the ma- by CUDA streams.
chine node. Based on the above information, we introduce the In addition to the visualized data representations, SOFA
performance model that we build for performance predictions. also reports the summarized performance information. For
example, the execution time taken by the GPU can be figured
out by the events of GPU kernel (green dots) in Figure 2
and it is about 0.25 seconds per the time represented in the
horizontal axis. The GPU execution time can also be figured
out by the total execution time of GPU divided by the iteration
number. For example, Table II lists the performance data of the
training processes for the four deep learning models, Resnet50,
Inception3, Alexnet, and Vgg16. The number of processed
images in Restnet50 is stable after 100 iterations. It takes 50.10
seconds for 100 iterations and takes 55.16 seconds for 120
Fig. 1. The Flow of Performance Estimation iterations. Hence, it takes 5.16 seconds for 20 iterations and
takes 0.25 seconds for 1 iteration. It matches the result from
the visualized data of SOFA. More data reported by SOFA
A. Discovering Deep Learning Software Execution Stages are given in Table III regarding the training process in deep
with SOFA learning models across the three NVIDIA GPUs, GTX 1080,
We give a concrete example to illustrate the performance Titan X, and Tesla P100 with CNN benchmarks [9].
profiling data provided by SOFA. In particular, the parameter
server scheme [2] [8] is adopted in the experiment, where a B. Characterizing Communication Behaviors
parameter server (denoted as machine: 101) and a computation As the sizes of model parameters affect the data transferring
machine node (denoted as machine: 100) are involved. Our performance, we used TensorFlow to report the parameter
171
Fig. 2. The runtime activities of one iteration for the model training process, where the parameter server scheme is adopted.
TABLE II
P ERFORMANCE P ROFILES OF F OUR D EEP L EARNING M ODELS ON GPU.
Iteration
Warm Iteration
Execution per
Model up Iteration Img/sec Time(s)
Time(s) second
Iteration (C)
1 / (C)
100 100 50.10 122.72
Resnet50 0.25 3.95
100 120 55.16 123.52
100 100 70.49 87.75
Inception3 0.34 2.9
100 120 77.38 86.31
600 600 40.26 910.31
Alexnet 0.04 28.01
600 720 44.54 907.47
100 100 75.91 86.55
Vgg16 0.37 2.73
100 120 83.25 86.33
sizes for the four deep learning models, as listed in IV. We

further analyzed the distribution of the parameters over the
model layers, as illustrated in Figure 3. It is interesting to
note that over 80% of the parameters are concentrated on
the layers nearby the model output. Especially, over 90% of
the parameters are on the last three fully-connected layers
for Vgg16. With the size of model metadata in mind, we
further describe the characteristics of the inter- and intra-node
communications as follows.
Inter-node communications. To analyze the communica-
tions within a computer cluster, we ran the model training
program for the four models on the four-node cluster, where
three different styles for distributing parameters were adopted
to observe the data transmission patterns (i.e., round-robin Fig. 3. The distribution of model parameters across the model layers.
and greedy load balance of the parameter server scheme, and
the ring-allreduce scheme). On these machines, the network
tracing module (i.e. libpcap) in SOFA was used to measure the
data volumes transmitted out/in each node, and the averaged round-robin fasion), and hence, the node has more data traffics
data across multiple training iterations is reported to avoid the when a big layer is assigned to it. Also, the results show that
intermittent effects. the greedy mode cannot alleviate the problem.
As of Alexnet and Vgg16 in Figure 4, the data traffics On the other hand, as the parameters/weights are assigned
vary across four nodes operated in the greedy and round-robin equally to the computing engines within the cluster running
modes, since the parameter server scheme assigns each model in the ring-allreduce mode, we can see that almost four nodes
layer’s weights and parameters to different node (e.g., in the have equal data traffics. Furthermore, it is necessary for one
172
TABLE III
T IME R ATIO OF F ORWARD P ROPAGATION AND BACKWARD P ROPAGATION .
Software Configurations: TensorFlow 1.4, batch size=16, single GPU

GTX 1080 Pascal Titan X Tesla P100-SXM2-16GB
forward backward total speed forward backward total speed forward backward total speed
(ms) (ms) (ms) (images/s) (ms) (ms) (ms) (images/s) (ms) (ms) (ms) (images/s)
Alexnet 7 14 21 762 4 10 14 1143 4 9 13 1231
Inception-v1 16 40 56 286 12 27 39 410 14 31 45 356
Vgg16 60 123 183 87 42 87 129 124 39 81 120 133
Resnet-50 51 99 150 107 35 69 104 154 34 74 108 148
Resnet-101 77 148 225 71 77 148 225 71 54 115 169 95
TABLE IV beginning, PS1 dispatches the parameters, WeightB, to PS0

S IZES OF DNN M ODELS ’ PARAMETERS . as illustrated in the upper-left image of Figure 5. Upon
Parameter Parameter receiving WeightB, the computing engines at PS0 (denoted
Model Number Size DataType as Host, GPU1, and GPU2) exchange the pamaters, WeightA
(Bytes) and WeightB, to do the computations, as illustrated in the
Resnet50 25,612,201 102,448,804 Float32
Inception V3 23,853,833 95,415,332 Float32 upper-right image of the figure. In particular, Host sends
Alexnet 61,842,345 247,369,380 Float32 WeightA and WeightB to GPU1 and GPU2, respectively
Vgg16 138,361,641 553,446,564 Float32 (denoted as H2D, short for host to device communication).
Later, GPU1 and GPU2 exchange their parameters (denoted
as P2P, the point to point communication between the GPUs
controller to send all parameters to the computation nodes across the system bus). Whe the computations done by GPUs
at the beginning, and that is why one node transfers a little are finished, each GPU transfers the updated parameters to
bit more data than the others. According to the data reported Host (denoted as D2H, device to host communication), and
by SOFA, it shows that the extra traffic of the controller is Host at PS0 to the parameter server, PS1.
occurred only at the initialization, and the size of data transfer In addition, we used SOFA to compile the statistics for the
is the same with the size of Vgg16 parameters. data traffics within the machine node, and the traffic matrix is
established to list the communications between any two com-
puting engines within the computation node, PS0, as illustrated
in the bottom-left image of Figure 5. Our observation can be
summarized by the relation, H2D = H2D + P 2P , where
the size of data sent from the device (e.g., GPU1 or GPU2) to
Host is similar to that of data previously sent to the device. For
example, there are 5,805MB of data sent from GPU1 to Host,
and the size of data being transferred from Host to GPU1 and
from GPU2 to GPU1 is 1,215MB and 4,590MB, respectively.
C. Performance Prediction Models

Based on the performance characteristics above, we devise
the formulas below to predict the performance of the model
training process done by a computer cluster, which operates
Fig. 4. Network traffics under different parameter distribution methods. under the ring-allreduce scheme and the parameter server
scheme, respectively. Note that since the backward propaga-
Intra-node communications. We performed an experiment
tion is overlapped with the D2H operations, as mentioned in
to examine if the intra-node communication may dominate
Section III-A, the time taken by the backward propagation,
the GPU execution time. Table V plots the computation and
Tbackward , is not listed in the formulas.
communication time on GPU for executing the models. It is
Modeling the ring-allreduce scheme.
important to note that the communication time is two times as
much as the computation time for Alexnet. Our results show TST EP = TN ET + TH2 D + TF orward + TD2 H
that to accurately model the deep learning computations, the
intra-node communication time should be considered. where:
To further analyze the data communication patterns within 2 × (N − 1) × V
N
a machine node, we performed another experiment with one TN ET =
parameter server (denoted as PS1; running the greedy load B
balance scheme and the model training for Vgg16) and one 2 × (N − 1) × V
N
computation node (denoted as PS0). We found that at the TH2 D =
H2 D
173
TABLE V
C OMPUTATION AND C OMMUNICATION T IME FOR GPU.
Average Average
Data Data
HtoD Achieved DtoH Achieved Kernel Memcpy
Communication Communication
model Data Bandwidth Data Bandwidth Time Time
Time Time
(MB) HtoD (MB) DtoH (seconds) (A)+(B)
(A) (B)
(GB/s) (GB/s)
resnet50 2044.73 5.90 0.35 2048.97 6.10 0.33 5.46 0.68
inception3 1905.56 5.90 0.32 1908.30 5.80 0.33 8.58 0.65
alexnet 5194.76 6.00 0.87 5194.76 6.50 0.80 0.80 1.66
vgg16 11068.93 6.00 1.85 11068.93 6.50 1.70 9.60 3.55
PS0-Weight A PS1-WeightB
Remote PS1-WeightB
Worker 0 Worker 1
Host: Weight A+B Weight A+B
GPU1 GPU1 H2D : GreedyLoad algo D2H
GPU2 GPU2 GPU1: Weight A Weight B Computation

P2P
GPU2: Weight B Weight A Computation
Total Traffic = H2D + P2P + D2H

Data Traffic : H2D + P2P = D2H
GPU1:Weight A = 115,706,276 bytes

1,215MB = 115,706,276/1,024/1,024*11
GPU2:Weight B = 437,740,288 bytes

4,592MB = 437,740,288/1,024/1,024*11
Fig. 5. The intra-node communications among a CPU and two GPU of the computation node (PS0), which receives the parameters, WeightB, from the
parameter server node (PS1). There are a total of eleven iterations performed in this experiment.
V
2 × (N − 1) × N N : GPU number in all nodes
TD2 H =
D2 H W : computing node numbers
Modeling the parameter-server scheme. We assume every G : GPU number in each computing node
machine node within the cluster is participated in the compu- B : the network bandwidth of two nodes in uni-direction
tations. H2 D : the bus bandwidth from CPU to GPU
D2 H : the bus bandwidth from GPU to CPU
P2 P : the bus bandwidth from GPU to GPU
TST EP = TN ET + TH2 D + TP2 P + TF orward + TD2 H IV. E VALUATION AND D ISCUSSION
where: We built a computer cluster and the configurations of each
Vremote × (W − 1) × 2 + (V − Vremote ) cluster node is listed in Table VIII. We ran the experiments
TN ET = with a single machine node to capture the performance charac-
B
teristics of the key stages (mentioned in Section III-A) in the
Vlocal
TH2 D = model training process. Based on the obtained performance
H2 D data, our performance models are used to estimate the perfor-
Vlocal mance of the same workloads running on the two-node and
TP2 P = four-node clusters, where each of the node is equipped with
P2 P
one or two GPUs.
V
TD2 H = Estimating errors of the predicted performance. Tables
D2 H VII and VIII give the predicted (predicted image/sec) and
V : the parameter size of the model measured (real image/sec) performance under the two param-
Vremote : the node hold most of parameters in all remote eter distribution schemes, ring-allreduce and parameter server,
parameter servers respectively. The error rates listed at the right-most columns
Vlocal : the node hold most of parameters in all local of the tables are under 5% in most of cases and the maximum
parameter servers. error is 9.7%.
174
TABLE VI network whenever the computation is done.
T HE SW/HW C ONFIGURATIONS OF A C LUSTER N ODE .
V. C ONCLUSION
DNN models: Resnet50, Inception3, Alexnet, Vgg16
Batch size: 32
We present the performance estimation models for dis-
Software tributed deep learning systems, which is based on the commu-
DNN Framework: TensorFlow v.1.7.0
OS: CentOS 7.3 nication schemes and the profiling materials. Per the evaluation
Intel Xeon E5-2620 v4 @ 2.1 GHz w/ 64GB memory results with popular communication schemes and popular
NVIDIA GTX-1080 (w/ 8GB memory) * 2 DNN models, the performance estimation is very close to the
Hardware
Network bandwidth: 1 Gbps
Bus bandwidths: real results from real machines. The error rate is under 5% in
CPU to GPU: 6.0 GB/s most of scenarios and the maximum error rate is below 10%.
GPU to CPU: 6.5 GB/s It shows the feasibility to do the performance perdition of
GPU to GPU: 6.0/10.0 GB/s (uni-direction/bi-direction)
distributed deep learning systems before any real machines are
implemented and deployed. It will save the trial-n-error cost
and ensure the computation power and the network bandwidth
Characterizing the system performance. Our performance to satisfy the deep learning applications.
models are able to reflect the performance characteristics of the
model training with the two parameter distribution schemes. ACKNOWLEDGMENTS
When the model parameters distributed across different model This work was financially supported by the Ministry of
layers (i.e., Resnet50 and Inception3, as depicted in Figure 3), Science and Technology of Taiwan under grants MOST No.
the performance scales when adding more GPUs. It is inter- 107-2218-E-002-053- and MOST No. 107-2218-E-002-003-.
esting to note that the parameter server scheme achieves linear
R EFERENCES
speedups, when the number of GPU grows, and it is better than
the other scheme that costs more network time, and hence, [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for large-
limits the performance. scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
On the contrary, as for those models, whose parameters [2] W. Dai, A. Kumar, J. Wei, Q. Ho, G. Gibson, and E. P. Xing. High-
are concentrated on certain layers (i.e., Alexnet and Vgg16), performance distributed ml at scale through parameter server consistency
models. In Proceedings of the Twenty-Ninth AAAI Conference on
the ring-allreduce scheme delivers better performance than Artificial Intelligence, AAAI’15, pages 79–87. AAAI Press, 2015.
the other, since the performance of the parameter server is [3] A. T. Hang Qi, Evan R. Sparks. Paleo: A performance model for deep
dominated by the node that is assigned to handle the big layers. neural networks. ICLR, 2017.
[4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image
Choosing the proper network configuration. The pre- recognition. In Proceedings of the IEEE conference on computer vision
dicted performance is able to help make better decisions for and pattern recognition, pages 770–778, 2016.
[5] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick,
the system design. For example, the network time grows S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for
with the increasing of the GPU numbers as shown in Tables fast feature embedding. CoRR, abs/1408.5093, 2014.
VII and VIII. It is obvious that the current network setting, [6] D. Justus, J. Brennan, S. Bonner, and A. S. McGough. Predicting the
computational cost of deep learning models. In 2018 IEEE International
Gigabit Ethernet, is not sufficient for our workloads. While Conference on Big Data (Big Data), pages 3873–3882, Dec 2018.
simply choosing the network with the highest bandwidth may [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification
deliver the best results, it may present an over-designed and with deep convolutional neural networks. In Advances in neural
information processing systems, pages 1097–1105, 2012.
overbudget system. [8] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski,
We use Figure 6 as an example to show how our proposed J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine
work helps determine which network bandwidth is sufficient learning with the parameter server. In 11th USENIX Symposium on
Operating Systems Design and Implementation (OSDI 14), pages 583–
for the workloads using the ring-allreduce scheme. Note that 598, Broomfield, CO, 2014. USENIX Association.
in this experiment, a maximum of 32 GPUs is available in the [9] C.-Y. Liu. Scout j-bench. https://github.com/jcjohnson/cnn-benchmarks,
system formed by eight machine nodes; the configuration with 2018.
[10] C.-Y. Liu. Sofa. https://github.com/cyliustack/sofa.git, 2018.
a single GPU is the baseline performance and the speedups [11] A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna.
achieved by 2 to 32 GPUs are reported, where the maximum SCALE-Sim: Systolic CNN Accelerator Simulator. arXiv e-prints, page
speedup is 32. The figure plots the predicted performance arXiv:1811.02883, Oct 2018.
[12] K. Simonyan and A. Zisserman. Very deep convolutional networks for
of the design alternatives with different network bandwidths: large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
1Gbps, 10Gbps, 40Gbps, 100Gbps, and unlimited. [13] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking
The results show that Inception3 can take full advantage the inception architecture for computer vision. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages
of the underlying hardware by achieving 32x speedup when 2818–2826, 2016.
the 10Gbps network is adopted, and using a higher bandwidth [14] C. Witt, M. Bux, W. Gusew, and U. Leser. Predictive performance
network is a waste. This is not the case for Alexnet, which modeling for distributed computing using black-box monitoring and
machine learning. CoRR, abs/1805.11877, 2018.
needs to consume as much network bandwidth as possible [15] Z. Z. X. S. D. W. H. Q. L. X. . . X. E. P. Zhang, H. Poseidon: An
in order to get a better performance. The reason is that the efficient communication architecture for distributed deep learning on gpu
computation time of Alexnet is relatively short (e.g., 0.01s in clusters. arXiv preprint arXiv:1706.03292, 2017.
Table VII), and it requires frequent data exchanges over the
175
TABLE VII
E RROR R ATES OF P ERFORMANCE E STIMATION IN R ING - ALLREDUCE S CHEME
Forward Data Predicted

Network Memcpy Predicted Real
# of GPUs computation per Bandwidth iteration Error
Node Model time (s) time (s) img/sec img/sec
per Node time (s) iteration (Mbit/s) time (D)-(E)/(E)
(B) (C) (D) (E)
(A) (Mbits) (A)+(B)+(C)
1 781.62 940 0.83 0.03 0.96 66.43 64.99 0.7%
resnet50 0.10
2 1172.43 940 1.25 0.05 1.39 91.79 90.92 1.0%
1 727.96 940 0.77 0.03 0.94 68.03 68.01 0.0%
inception3 0.14
2 1091.94 940 1.16 0.04 1.34 95.36 93.55 1.9%
2
1 1887.28 940 2.01 0.07 2.10 30.54 29.07 5.0%
alexnet 0.01
2 2830.92 940 3.01 0.11 3.14 40.81 39.72 2.7%
1 4222.46 940 4.49 0.17 4.80 13.32 12.41 7.4%
vgg16 0.15
2 6333.69 940 6.74 0.25 7.13 17.95 17.51 2.5%
1 1172.43 940 1.25 0.05 1.39 91.79 90.39 1.5%
resnet50 0.10
2 1367.84 940 1.46 0.05 1.61 159.01 149.72 6.2%
1 1091.94 940 1.16 0.04 1.34 95.36 92.91 2.6%
inception3 0.14
2 1273.93 940 1.36 0.05 1.54 165.91 151.21 9.7%
4
1 2830.92 940 3.01 0.11 3.14 40.81 38.25 6.7%
alexnet 0.01
2 3302.74 940 3.51 0.13 3.66 70.00 68.32 2.5%
1 6333.69 940 6.74 0.25 7.13 17.95 16.71 7.4%
vgg16 0.15
2 7389.31 940 7.86 0.29 8.30 30.86 30.19 2.2%
TABLE VIII
E RROR R ATES OF P ERFORMANCE E STIMATION I N PARAMETER S ERVER S CHEME
Forward Data Predicted

Network Memcpy Predicted Real
# of GPUs computation per Bandwidth iteration Error
Node Model time (s) time (s) img/sec img/sec
per Node time (s) iteration (Mbit/s) time (D)-(E)/(E)
(B) (C) (D) (E)
(A) (Mbits) (A)+(B)+(C)
1 0.03 1.01 63.27 66.15 -4.4%
resnet50 0.10 826.87 940 0.88
2 0.03 1.01 126.42 131.58 -3.9%
1 0.03 1.00 64.14 67.79 -5.4%
inception3 0.14 781.58 940 0.83
2 0.03 1.00 128.14 136.74 -6.3%
2
1 0.07 3.37 18.98 18.23 4.1%
alexnet 0.01 3087.26 940 3.28
2 0.08 3.38 37.84 36.91 2.5%
1 0.17 8.36 7.66 7.08 8.2%
vgg16 0.15 7562.16 940 8.04
2 0.22 8.41 15.23 13.94 9.2%
1 0.03 1.52 84.28 77.58 8.6%
resnet50 0.10 1303.54 940 1.39
2 0.03 1.52 168.47 154.77 8.8%
1 0.03 1.56 81.98 77.73 5.5%
inception3 0.14 1311.37 940 1.40
2 0.03 1.56 163.84 156.13 4.9%
4
1 0.07 8.27 15.47 14.87 4.0%
alexnet 0.01 7694.19 940 8.19
2 0.08 8.28 30.90 31.55 -2.0%
1 0.17 21.97 5.83 5.58 4.4%
vgg16 0.15 20358.28 940 21.66
2 0.22 22.02 11.63 11.64 -0.1%
Fig. 6. Predicting the performance speedups when training the models with the ring-allreduce scheme under five different network bandwidths with 1 to 32
GPUs.
176

Lyu 2019

Uploaded by

Copyright:

Available Formats

Lyu 2019

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lyu 2019

Uploaded by

Copyright:

Available Formats

Modeling Interprocessor Communication and

Performance Scalability for Distributed Deep

978-1-7281-4484-9/19/$31.00 ©2019 IEEE 169

Number Number of Number

sizes for the four deep learning models, as listed in IV. We

Software Configurations: TensorFlow 1.4, batch size=16, single GPU

TABLE IV beginning, PS1 dispatches the parameters, WeightB, to PS0

C. Performance Prediction Models

GPU1 GPU1 H2D : GreedyLoad algo D2H

GPU2 GPU2 GPU1: Weight A Weight B Computation

GPU2: Weight B Weight A Computation

Total Traffic = H2D + P2P + D2H

GPU1:Weight A = 115,706,276 bytes

GPU2:Weight B = 437,740,288 bytes

Forward Data Predicted

Forward Data Predicted

You might also like