SPINN (MobiCom'20)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

PREPRINT: Accepted at the 26th Annual International Conference on Mobile Computing and Networking (MobiCom), 2020

SPINN: Synergistic Progressive Inference


of Neural Networks over Device and Cloud
Stefanos Laskaridis† *, Stylianos I. Venieris† *,
Mario Almeida† *, Ilias Leontiadis† *, Nicholas D. Lane†,‡
† Samsung AI Center, Cambridge ‡ University of Cambridge
* Indicates equal contribution.
{stefanos.l,s.venieris,mario.a,i.leontiadis,nic.lane}@samsung.com
arXiv:2008.06402v2 [cs.LG] 24 Aug 2020

ABSTRACT Device-only Synergistic Inference Server-only

Final exit Final exit Final exit


Despite the soaring use of convolutional neural networks (CNNs)
in mobile applications, uniformly sustaining high-performance in-
ference on mobile has been elusive due to the excessive compu-
tational demands of modern CNNs and the increasing diversity
SPINN: Synergistic Progressive Inference
of deployed devices. A popular alternative comprises offloading Mobile device Final exit
CNN processing to powerful cloud-based servers. Nevertheless, by Edge/cloud server
relying on the cloud to produce outputs, emerging mission-critical Network transfer Early exit 1 Early exit N
and high-mobility applications, such as drone obstacle avoidance or Optional transfer

interactive applications, can suffer from the dynamic connectivity


conditions and the uncertain availability of the cloud. In this paper,
we propose SPINN, a distributed inference system that employs Figure 1: Existing methods vs. SPINN.
synergistic device-cloud computation together with a progressive
inference method to deliver fast and robust CNN inference across
diverse settings. The proposed system introduces a novel scheduler while flagship devices can support the performance requirements
that co-optimises the early-exit policy and the CNN splitting at run of CNN workloads, the current landscape is still very diverse, in-
time, in order to adapt to dynamic conditions and meet user-defined cluding previous-gen and low-end models [80]. In this context, the
service-level requirements. Quantitative evaluation illustrates that less powerful low-tier devices struggle to consistently meet the
SPINN outperforms its state-of-the-art collaborative inference coun- application-level performance needs [2].
terparts by up to 2× in achieved throughput under varying network As an alternative, service providers typically employ cloud-centric
conditions, reduces the server cost by up to 6.8× and improves ac- solutions (Figure 1 - top right). With this setup, inputs collected by
curacy by 20.7% under latency constraints, while providing robust mobile devices are transmitted to a remote server to perform CNN
operation under uncertain connectivity conditions and significant inference using powerful accelerators [3, 6, 12, 19, 31, 32]. How-
energy savings compared to cloud-centric execution. ever, this extra computation capability comes at a price. First, cloud
execution is highly dependent on the dynamic network conditions,
CCS CONCEPTS with performance dropping radically when the communication
• Computing methodologies → Distributed computing method- channel is degraded. Second, hosting resources capable of accel-
ologies; • Human-centered computing → Ubiquitous and mobile erating machine learning tasks comes at a significant cost [40].
computing. Moreover, while public cloud providers offer elastic cost scaling,
there are also privacy and security concerns [64].
KEYWORDS To address these limitations, a recent line of work [22, 34, 46] has
Deep neural networks, distributed systems, mobile computing proposed the collaboration between device and cloud for CNN infer-
ence (Figure 1 - top center). Such schemes typically treat the CNN as
a computation graph and partition it between device and cloud. At
1 INTRODUCTION run time, the client executes the first part of the model and transmits
With the spectrum of CNN-driven applications expanding rapidly, the intermediate results to a remote server. The server continues
their deployment across mobile platforms poses significant chal- the model execution and returns the final result back to the device.
lenges. Modern CNNs [20, 68] have excessive computational de- Overall, this approach allows tuning the fraction of the CNN that
mands that hinder their wide adoption in resource-constrained mo- will be executed on each platform based on their capabilities.
bile devices. Furthermore, emerging user-facing [80] and mission- Despite their advantages, existing device-cloud collaborative in-
critical [37, 42, 66] CNN applications require low-latency processing ference solutions suffer from a set of limitations. First, similar to
to ensure high quality of experience (QoE) [4] and safety [11]. cloud execution, the QoE is greatly affected by the network condi-
Given the recent trend of integrating powerful System-on-Chips tions as execution can fail catastrophically when the link is severely
(SoCs) in consumer devices [2, 25, 78], direct on-device CNN ex- deteriorated. This lack of network fault tolerance also prevents
ecution is becoming possible (Figure 1 - top left). Nevertheless, the use of more cost-efficient cloud solutions, e.g. using ephemeral
spare cloud resources at a fraction of the price.1 Furthermore, CNNs device- or cloud-only execution, the increased computational capa-
are increasingly deployed in applications with stringent demands bilities of client devices [2, 25] have led to schemes that maximise
across multiple dimensions (e.g. target latency, throughput and ac- performance via device-cloud synergy. Next, we outline significant
curacy, or device and cloud costs).2 Existing collaborative methods work in this direction and visit approximate computing alternatives
cannot sufficiently meet these requirements. which exploit accuracy-latency trade-offs during inference.
To this end, we present SPINN, a distributed system that enables Approximate Inference. In applications that can tolerate some
robust CNN inference in highly dynamic environments, while meet- accuracy drop, a line of work [9, 18, 45] exploits the accuracy-
ing multi-objective application-level requirements (SLAs). This is latency trade-off through various techniques. In particular, NestDNN
accomplished through a novel scheduler that takes advantage of [9] employs a multi-capacity model that incorporates multiple de-
progressive inference; a mechanism that allows the system to exit scendant (i.e. pruned) models to expose an accuracy-complexity
early at different parts of the CNN during inference, based on the trade-off mechanism. However, such models cannot be natively
input complexity (Figure 1 - bottom). The scheduler optimises the split between device and cloud. On the other hand, model selec-
overall execution by jointly tuning both the split point selection tion systems [18, 45] employ multiple variants of a single model
and the early-exit policy at run time to sustain high performance (e.g. quantised, pruned) with different accuracy-latency trade-offs.
and meet the application SLAs under fluctuating resources (e.g. net- At run time, they choose the most appropriate variant based on the
work speed, device/server load). The guarantee of a local early exit application requirements and determine where it will be executed.
renders server availability non-critical and enables robust operation Similarly, classifier cascades [21, 33, 38, 39, 71] require multiple
even under uncertain connectivity. Overall, this work makes the models to obtain performance gains. Despite the advantages of
following key contributions: both, using multiple models adds substantial overhead in terms of
• A progressive inference mechanism that enables the fast and maintenance, training and deployment.
reliable execution of CNN inference across device and cloud. Progressive Inference Networks. A growing body of work
Concretely, on top of existing early-exit designs, we propose an from both the research [23, 35, 72, 81, 84] and industry commu-
early-exit-aware cancellation mechanism that allows the inter- nities [55, 74] has proposed transforming a given model into a
ruption of the (local/remote) inference when having a confident progressive inference network by introducing intermediate exits
early prediction, thus minimising redundant computation and throughout its depth. By exploiting the different complexity of
transfers during inference. Simultaneously, reflecting on the un- incoming samples, easier examples can early-exit and save on fur-
certain connectivity of mobile devices we design an early-exit ther computations. So far, existing works have mainly explored
scheme with robust execution in mind, even under severe con- the hand-crafted design of early-exit architectures (MSDNet [23],
nectivity disruption or cloud unavailability. By carefully placing SCAN [84]), the platform- and SLA-agnostic derivation of early-exit
the early exits in the backbone network and allowing for graceful networks from generic models (BranchyNet [72], SDN [35]) or the
fallback to locally available results, we guarantee the responsive- hardware-aware deployment of such networks (HAPI [44]). Despite
ness and reliability of the system and overcome limitations of the recent progress, these techniques have not capitalised upon the
existing offloading systems. unique potential of such models to yield high mobile performance
• A CNN-specific packing mechanism that exploits the reduced- through distributed execution and app-tailored early-exiting. In this
precision resilience and sparsity of CNN workloads to minimise context, SPINN is the first progressive inference approach equipped
transfer overhead. Our communication optimiser combines a with a principled method of selectively splitting execution between
lossless and an accuracy-aware lossy compression component device and server, while also tuning the early-exit policy, enabling
which exposes previously unattainable designs for collaborative high performance across dynamic settings.
inference, while not sacrificing the end accuracy of the system. Device-Cloud Synergy for CNN Inference. To achieve effi-
• An SLA- and condition-aware scheduler that co-optimises i) the cient CNN processing, several works have explored collaborative
early-exit policy of progressive CNNs and ii) their partitioning computation over device, edge and cloud. One of the most promi-
between device and cloud at run time. The proposed scheduler nent pieces of work, Neurosurgeon [34], partitions the CNN be-
employs a multi-objective framework to capture the user-defined tween a device-mapped head and a cloud-mapped tail and selects a
importance of multiple performance metrics and translate them single split point based on the device and cloud load as well as the
into SLAs. Moreover, by surveilling the volatile network condi- network conditions. Similarly, DADS [22] tackles CNN offloading,
tions and resources load at run time, the scheduler dynamically but from a scheduler-centric standpoint, with the aim to yield the
selects the configuration that yields the highest performance by optimal partitioning scheme in the case of high and low server
taking into account contextual runtime information and feedback load. However, both systems only optimise for single-criterion ob-
from previous executions. jectives (latency or energy consumption), they lack support for
app-specific SLAs, and suffer catastrophically when the remote
2 BACKGROUND AND RELATED WORK
server is unavailable. With a focus on data transfer, JALAD [47] in-
To optimise the execution of CNN workloads, several solutions corporates lossy compression to minimise the offload transmission
have been proposed, from compiler [1, 30, 65] and runtime opti- overhead. Nevertheless, to yield high performance, the proposed
misations [36, 43, 49] to custom cloud [7, 19, 34] and accelerator system sacrifices substantial accuracy (i.e. >5%). JointDNN [8] mod-
designs [75, 79]. While these works target a single model with elled CNN offloading as a graph split problem, but targets only
1 AWS Spot Instances – https://aws.amazon.com/ec2/spot/.
2 Typically expressed as service-level agreements (SLAs).
Input
Training Calibration Runtime Conditions
Target

Dataset …
… NN … Dataset SLAs
Convolutional

1 Progressive 3 Profiler 4 Scheduler


Inference Model Profiler Data
Multi-Obj.

Generator Optimisation
Early Classifier Positioning Monitor & update Conf. Thresh. Split Point

… … …
2 Model Splitter 6 Execution Engine 5 CNN-
CO
Device Edge/Cloud Quantisation

Joint Training Compression

Offline Online
Figure 2: Overview of SPINN’s architecture.
offline scheduling and static environments instead of highly dy- 3 PROPOSED SYSTEM
namic mobile settings. Contrary to these systems, SPINN intro- To remedy the limitations of existing systems, SPINN employs a pro-
duces a novel scheduler that adapts the execution to the dynamic gressive inference approach to alleviate the hard requirement for
contextual conditions and jointly tunes the offloading point and reliable device-server communication. The proposed system intro-
early-exit policy to meet the application-level requirements. More- duces a scheme of distributing progressive early-exit models across
over, by guaranteeing the presence of a local result, SPINN provides device and server, in which one exit is always present on-device,
resilience to server disconnections. guaranteeing the availability of a result at all times. Moreover, as
Apart from offloading CNNs to a dedicated server, a number of early exits along the CNN provide varying levels of accuracy, SPINN
works have focused on tangential problems. IONN [29] tackles a casts the acceptable prediction confidence as a tunable parameter to
slightly different problem, where instead of preinstalling the CNN adapt its accuracy-speed trade-off. Alongside, we propose a novel
model to a remote machine, the client device can offload to any close- run-time scheduler that jointly tunes the split point and early-exit
by server by transmitting both the incoming data and the model in policy of the progressive model, yielding a deployment tailored
a shared-nothing setup. Simultaneously, various works [50, 51, 85] to the application performance requirements under dynamic con-
have examined the case where the client device can offload to other ditions. Next, we present SPINN’s high-level flow followed by a
devices in the local network. Last, [73] also employs cloud-device description of its components.
synergy and progressive inference, but with a very different focus,
i.e. to perform joint classification from a multi-view, multi-camera 3.1 Overview
standpoint. Its models, though, are statically allocated to devices and SPINN comprises offline components, run once before deployment,
its fixed, statically-defined early-exit policy, renders it impractical and online components, which operate at run time. Figure 2 shows
for dynamic environments. a high-level view of SPINN. Before deployment, SPINN obtains a
Offloading Multi-exit Models. Closer to our approach, Edgent CNN model and derives a progressive inference network. This is
[46] proposes a way of merging offloading with multi-exit models. accomplished by introducing early exits along its architecture and
Nonetheless, this work has several limitations. First, the inference jointly training them using the supplied training set (Section 3.2
workflow disregards data locality and always starts from the cloud. 1 ). Next, the model splitter component (Section 3.3 2 ) identi-
Consequently, inputs are always transmitted, paying an additional fies all candidate points in the model where computation can be
transfer cost. Second, early-exit networks are not utilised with split between device and cloud. Subsequently, the offline profiler
progressive inference, i.e. inputs do not early-exit based on their (Section 3.4 3 ) calculates the exit-rate behaviour of the generated
complexity. Instead, Edgent tunes the model’s complexity by se- progressive model as well as the accuracy of each classifier. More-
lecting a single intermediary exit for all inputs. Therefore, the end over, it measures its performance on the client and server, serving
system does not benefit from the variable complexity of inputs. as initial inference latency estimates.
Finally, the system has been evaluated solely on simple models At run time, the scheduler (Section 3.5 4 ) obtains these initial
(AlexNet) and datasets (CIFAR-10), less impacted by low-latency timings along with the target SLAs and run-time conditions and
or unreliable network conditions. In contrast, SPINN exploits the decides on the optimal split and early-exit policy. Given a split
fact that data already reside on the device to avoid wasteful in- point, the communication optimiser (Section 3.6 5 ) exploits the
put transfers, and employs a CNN-tailored technique to compress CNN’s sparsity and resilience to reduced bitwidth to compress
the offloaded data. Furthermore, not only our scheduler supports the data transfer and increase the bandwidth utilisation. The ex-
additional optimisation objectives, but it also takes advantage of ecution engine (Section 3.7 6 ) then orchestrates the distributed
the input’s complexity to exit early, saving resource usage with inference execution, handling all communication between parti-
minimal impact on accuracy. tions. Simultaneously, the online profiler monitors the execution
across inferences, as well as the contextual factors (e.g. network,
device/server load) and updates the initial latency estimates. This
76.0 cifar100 imagenet inception v3 - cifar inception v3 - imagenet
1.0 Threshold
accuracy

69.0 0.7
0.5

Cumulative exit rate


inception v3 inception v3 0.8
75.0 last exit 68.8 last exit 0.9
0.0
75.0 resnet56 - cifar resnet50 - imagenet
74.0 1.0 0.7
accuracy

0.8
74.0 0.5 0.9
72.0 resnet56 resnet50
last exit 73.0 last exit 0.0
0.6 0.7
0.8 0.70.9 0.8 1.0 0.6 0.9 1.0 0 1 2 3 4 5 6 0 1 2 3 4 5 6
threshold threshold Exit index Exit index
Figure 3: Accuracy of progressive networks across different Figure 4: Early-exit CDF across different confidence thresh-
confidence threshold values. old values.
If none of the classifiers reaches the confidence threshold, the most
way, the system can adapt to the rapidly-changing environment, confident among them is used as the output prediction (Eq. (3)).
reconfigure its execution and maintain the same QoE. e zi
softmax(z)i = ÍK z (Softmax of i-th exit) (1)
j=1 e
j
3.2 Progressive Inference Model Generator
Given a CNN model, SPINN derives a progressive inference network argi {max {softmaxi } > t hr conf } (Check i-th exit’s top-1) (2)
i
(Figure 2 1 ). This process comprises a number of key design deci- j
argmax {max {softmaxi } } (Return most confident) (3)
j ∈classifiers i
sions: 1) the number, position and architecture of intermediate classi-
fiers (early exits), 2) the training scheme and 3) the early-exit policy. where zi is the output of the final fully-connected layer for the i-th
Early Exits. We place the intermediate classifiers along the label, K the total number of labels, j∈[0, 6] the classifier index and
depth of the architecture with equal distance in terms of FLOP thr conf the tunable confidence threshold.
count. With this platform-agnostic positioning strategy, we are Impact of Confidence Threshold. Figure 3 and 4 illustrate the
able to obtain a progressive inference model that supports a wide impact of different early-exit policies on the accuracy and early-exit
range of latency budgets while being portable across devices. With rate of progressive models, by varying the confidence threshold
respect to their number, we introduce six early exits in order to (thr conf ). Additionally, Figure 3 reports on the accuracy without
guarantee their convergence when trained jointly [35, 48], placed progressive inference (i.e. last exit only, represented by the red
at 15%, 30%, . . . 90% of the network’s total FLOPs. Last, we treat the dotted line). Note that exiting only at the last exit can lead to lower
architecture of the early exits as an invariant, adopting the design accuracy than the progressive models for some architectures, a phe-
of [23], so that all exits have the same expressivity [59]. nomenon that can be attributed to the problem of “overthinking"3 .
Training Scheme. We jointly train all classifiers from scratch Based on the figures, we draw two major conclusions that guide
and employ the cost function introduced in [35] as follows: L = the design of SPINN’s scheduler. First, across all networks, we ob-
ÍN −1
i=0 τi ∗ Li with τi starting uniformly at 0.01 and linearly increas- serve a monotonous trend with higher thresholds leading to higher
ing it to a maximum value of Ci , which is the relative position of the accuracies (Figure 3) while lower ones lead to more samples exit-
classifier in the network (C 0 = 0.15, C 1 = 0.3, . . . , C final = 1). The ra- ing earlier (Figure 4). This exposes the confidence threshold as a
tionale behind this is to address the problem of “overthinking” [35], tunable parameter to control accuracy and overall processing time.
where some samples can be correctly classified by early exits while Second, different networks behave differently, producing confident
being misclassified deeper on in the network. This scheme requires predictions at different exits along the architecture. For example,
the fixed placement of early exits prior to the training stage. Despite Inception-v3 on CIFAR-100 can obtain a confident prediction earlier
the inflexibility of this approach to search over different early-exit on, whereas ResNet-50 on ImageNet cannot classify robustly from
placements, it yields higher accuracy compared to the two-staged early features only. In this respect, we conclude that optimising the
approach of training the main network and the early classifiers confidence threshold for each CNN explicitly is key for tailoring
in isolation. In terms of training time, the end-to-end early-exit the deployment to the target requirements.
networks can take from 1.2× to 2.5× the time of the original net-
work training, depending on the architecture and number of exits. 3.3 Model Splitter
In fact, the higher training overhead happens when the ratio of
F LO P S e ar ly_cl as s i f i e r After deriving a progressive model from the original CNN, SPINN
is higher. However, given that this cost is
F LO P S or iдinal _ne t w or k aims to split its execution across a client and a server in order to
paid once offline, it is quickly amortised by the runtime latency dynamically utilise remote resources as required by the application
benefits of early exiting on confident samples. SLAs. The model splitter (Figure 2 2 ) is responsible for 1) defining
Early-exit Policy. For the early-exit strategy, we estimate a the potential split points and 2) identifying them automatically in
classifier’s confidence for a given input using the top-1 output value the given CNN.
of its softmax layer (Eq. (1)) [13]. An input takes the i-th exit if the
prediction confidence is higher than a tunable threshold, thr conf
3 “Overthinking" [35] dictates that certain samples that would normally get misclassi-
(Eq. (2)). The exact value of thr conf provides a trade-off between
fied by reaching the final classifier of the network if they exit early, they get classified
the latency and accuracy of the progressive model and determines correctly. This leads to small accuracy benefits of progressive inference networks that
the early-exit policy. At run time, SPINN’s scheduler periodically neither the original model would have (due to early-exiting) nor a single-exit smaller
tunes thr conf to customise the execution to the application’s needs. variant (due to late-exiting).
0.3 4G (Bandw=24Mbps, Lat=40ms) 5G (Bandw=250Mbps, Lat=1ms) requirements4 . This indicates that SPINN’s scheduler can selectively
0.2 choose one to minimise any given runtime (e.g. device, server or
Inference Time (s)

u10W
0.1 transfer), as required by the user. Second, dynamic conditions such
0.0 as the connectivity, and the device compute capabilities play an
0.3
important role in shaping the split point latency characteristics
0.2

10W
illustrated in Figure 5. For example, a lower-end client (u10W)
0.1
or a loaded server would require longer to execute its allocated
0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 split, while low bandwidth can increase the transfer time. This
Split index (normalised) Split index (normalised)
indicates that it is hard to statically identify the best split point and
Local client execution Server compute time Client packing
Client compute time Network transfer time Server unpacking highlights the need for an informed partitioning that adapts to the
environmental conditions in order to meet the application-level
Figure 5: ResNet-56 inference times for different network performance requirements.
conditions (4G and 5G) and client device compute capabili-
ties (Jetson 10W and u10W), when offloading to the cloud 3.4 Profiler
without early exits. Given the varying trade-offs of different split points and confidence
thresholds, SPINN considers the client and server load, the network-
ing conditions and the expected accuracy in order to select the most
suitable configuration. To estimate this set of parameters, the pro-
Split Point Decision Space. CNNs typically consist of a se- filer (Figure 2 3 ) operates in two stages: i) offline and ii) run-time.
quence of layers, organised in a feed-forward topology. SPINN Offline stage: In the offline phase, the profiler performs two
adopts a partition scheme which allows splitting the model along kinds of measurements, device-independent and device-specific. The
its depth at layer granularity. For a CNN with N L layers, there former include CNN-specific metrics, such as 1) the size of data to
are N L −1 candidate points, leading to 2N L −1 possible partitions. be transmitted for each candidate split and 2) the average accuracy
To reduce the search space and minimise the number of transmis- of the progressive CNN for different confidence thresholds. These
sions across the network, we make two key observations. First, are measured only once prior to deployment. Next, the profiler
since CNN final outputs are rather small, once execution is of- needs to obtain latencies estimates that are specific to each device.
floaded to the powerful remote server, there is no gain in having To this end, the profiler measures the average execution time per
two or more split points as this would incur in extra communica- layer by passing the CNN through a calibration set – sampled from
tion costs. Second, different layer splits have varying transmission the validation set of the target task. The results serve as the initial
costs and compression potential. For example, activation layers latency and throughput estimates.
such as ReLU [54] cap negative values to zero, which means that Run-time stage: At run time, the profiler refines its offline esti-
their output becomes more compressible [56, 60, 82] and they can mates by regularly monitoring the device and server load, as well
be more efficiently transferred by SPINN’s communication optimiser as the connectivity conditions. To keep the profiler lightweight,
(Section 3.6). Therefore, while SPINN’s design supports an arbitrary instead of adopting a more accurate but prohibitively expensive esti-
number of split points and layers, in this work, we allow one split mator, we employ a 2-staged linear model to estimate the inference
point per CNN and reduce the candidate split points to ReLU layers. latency on-device.
Automatic Split Point Identification. To automatically de- In the first step, the profiler measures the actual on-device ex-
tect all candidate split points in the given CNN, the model split- ecution time up to the split point s, denoted by T ⟨s real during each
ter employs a dynamic analysis approach. This is performed by ⟩
inference. Next, it calculates a latency scaling factor SF as the ratio
first constructing the execution graph of the model in the target
between the actual time and the offline latency estimate up to the
framework (e.g. PyTorch), identifying all split points and the associ- real ⟨s ⟩
ated dependencies, and then applying SPINN’s partitioning scheme split s, i.e. SF = Toffline⟨s ⟩ . As a second step, the profiler treats the
T
to yield the final pruned split point space. The resulting set of scaling factor as an indicator of the load of the client device, and
points defines the allowed partitions that can be selected by the uses it to estimate the latency of all other candidate splits. Thus,
scheduler (Section 3.5). the latency of a different split s ′ is estimated as SF · T ⟨s offline .
′⟩
Impact of Split Point. To investigate how the split point se- Similarly, to assess the server load, the remote endpoint’s com-
lection affects the overall latency, we run multiple CNN splits pute latency is optionally communicated back to the device, piggy-
between a Nvidia Jetson Xavier client and a server (experimen- backed with the CNN response when offloading. If the server does
tal setup detailed in Section 4). Figure 5 shows the breakdown of not communicate back latencies for preserving the privacy of the
ResNet-56’s inference times with CIFAR-100 over distinct network provider, these can be coarsely estimated as T ⟨s server = T response −
conditions and client compute capabilities - u10W and 10W. Pack- ⟩ ⟨s,e ⟩
D
 
response
ing refers to the runtime of the communication optimiser module, L + response
B , where T ⟨s ⟩
is the total time for the server to
detailed in Section 3.6. respond with the result for split point s and exit e, D response is the
Based on the figure, we make two observations. First, different size of transferred data and B, L are the instantaneous network
split points yield varying trade-offs in client, server and transfer
time. For example, earlier splits execute more on the server, while 4 Note that independently of the amount of transmitted data, there is always a network
later ones execute more on-device but often with smaller transfer latency overhead that must be amortised, which in the case of 4G, is quite significant.
where σ represents a design point, Mi ∈ M 1 , M 2 , ..., M |M | is the


Algorithm 1: Flow of dynamic scheduler upon invocation
Input: Space of candidate designs Σ
i-th metric in the ordered tuple of soft targets, i is a metric’s position
Prioritised hard constraints ⟨C 1, C 2, ..., C n ⟩ in the importance sequence and M j (σ j∗ ) represents the optimum of
Prioritised soft targets O 1, O 2, ..., O |M|

the j-th metric, found in the j-th iteration. Under this formulation,
Current network conditions net = ⟨L, B ⟩
the user ranks the metrics in order of importance as required by
Current device and server loads l {dev, server}
Profiler data pr f the target use-case.
Output: Highest performing design σ ∗ = s ∗, t hr conf ∗ Algorithm 1 presents the scheduler’s processing flow. As a first

step, the scheduler uses the estimated network latency and band-
1 pr f ← UpdateTimings(pr f , net , l dev , l server ) width, and device and server loads to update the profiler parameters
2 Σfeasible ← Σ
(line 1), including the attainable latency and throughput, and de-
3 /* - - - Obtain feasible space based on hard constraints - - -*/
4 foreach C i ∈ ⟨C 1, C 2, ..., C n ⟩ do vice and server cost for each candidate configuration. Next, all
5 Σfeasible ← RemoveInfeasiblePoints(pr f , C i , Σfeasible ) infeasible solutions are discarded based on the supplied hard con-

6 VecCompare(pr f , Σfeasible (:,M i ), opi , t hr i ) ∀i ∈ [1, n] straints (lines 4-7); given an ordered tuple of prioritised constraints
7 end ⟨C 1 , C 2 , ..., Cn ⟩, the scheduler iteratively eliminates all configura-
/* - - - Optimise user-defined metrics - Eq. (4) - - - */
8
tions σ =⟨s, thr conf ⟩ that violate them in the given order, where
σ ∗ ← OptimiseUserMetrics(pr f , O 1, O 2, ..., O |M| , Σfeasible )


9
↰ s and thr conf represent the associated split point and confidence
10 VecMax/Min(pr f , Σ feasible (:,M i ), opi ) ∀i ∈ [1, | M |]
threshold respectively. In case there is no configuration to satisfy
bandwidth and latency respectively. We periodically offload to the all the constraints up to i+1, the scheduler adopts a best-effort
server without stopping the local execution to reassess when the strategy by keeping the solutions that comply with up to the i-th
server transitions from “overloaded” to “accepting requests”. constraint and treating the remaining n-i constraints as soft targets.
To estimate the instantaneous network transfer latency, the pro- Finally, the scheduler performs a lexicographic optimisation of the
filer employs a run-time monitoring mechanism of the bandwidth user-prioritised soft targets (lines 9-10). To determine the high-
B and latency L experienced by the device [15]. The overall transfer est performing configuration σ ∗ , the scheduler solves a sequence
D of |M|
time is L + B⟨s ⟩ , where D ⟨s ⟩ is the amount of data to be transferred
single-objective optimisation problems, i.e. one for each
given split s. As the network conditions change, the monitoring M ∈ M 1 , M 2 , ..., M |M | (Eq. (4)).
module refines its estimates by means of two moving averages: a Deployment. Upon deployment, the scheduler is run on the
real-time estimation (Lrt , B rt ) and a historical moving average (Lhist , client side, since most relevant information resides on-device. In a
B hist ). The former is updated and used only when transfers have multi-client setting, this setup is further reinforced by the fact that
occurred within the last minutes. If no such information exists, the each client device decides independently on its offloading parame-
historical estimates for the same network type are used. ters. However, to be deployable without throttling the resources of
the target mobile platform, the scheduler has to yield low resource
3.5 Dynamic Scheduler utilisation at run time. To this end, we vectorise the comparison,
Given the output of the profiler, the dynamic scheduler (Figure 2 4 ) maximisation and minimisation operations (lines 5-6 and 9-10 in Al-
is responsible for distributing the computation between device and gorithm 1) to utilise the SIMD instructions of the target mobile CPU
cloud, and deciding the early-exit policy of the progressive inference (e.g. the NEON instructions on ARM-based cores) and minimise the
network. Its goal is to yield the highest performing configuration
scheduler’s runtime.
that satisfies the app requirements. To enable the support of realis-
tic multi-criteria SLAs, the scheduler incorporates a combination At run time, although the overhead of the scheduler is relatively
of hard constraints (e.g. a strict inference latency deadline of 100 low, SPINN only re-evaluates the candidate configurations when
ms) and soft targets (e.g. minimise cost on the device side). Inter- the outputs of the profiler change by more than a predefined thresh-
nally, we capture these under a multi-objective optimisation (MOO) old. For highly transient workloads, we can switch from a moving
formulation. The current set of metrics is defined as average to an exponential back-off threshold model for mitigating
M = {l at ency, t hr ouдhput, server cost, device cost, accur acy } too many scheduler calls. The scheduler overhead and the tuning
In SPINN, we interpret cloud and device cost as the execution time of the invocation frequency is discussed in Section 4.3.2.
on the respective side. The defined metrics set M, together with the The server – or HA proxy5 [70] in multi-server architectures –
associated constraints, can cover a wide range of use-cases, based can admit and schedule requests on the remote side to balance the
on the relative importance between the metrics for the target task. workload and minimise inference latency, maximise throughput
Formally, we define a hard constraint as C = ⟨M, op, thr ⟩ where M ∈ or minimise the overall cost by dynamically scaling down unused
M is a metric, op is an operator, e.g. ≤, and thr is a threshold value. resources. We consider these optimisations cloud-specific and out of
With respect to soft optimisation targets, we define them for- the scope of this paper. As a result, in our experiments we account
mally as O = ⟨M, min/max/value⟩ where a given metric M ∈ M for a single server always spinning and having the model resident
is either maximised, minimised or as close as possible to a desirable to its memory. Nevertheless, in a typical deployment, we would
value. To enable the user to specify the importance of each met- envision a caching proxy serving the models with RDMA to the CPU
ric, we employ a multi-objective lexicographic formulation [52], or GPU of the end server, in a virtualised or serverless environment
shown in Eq. (4). so as to tackle the cold-start problem [57, 77]. Furthermore, to avoid
min Mi (σ ), s.t. M j (σ ) ≤ M j (σ j∗ ) (4) oscillations (flapping) of computation between the deployed devices
σ
j = 1, 2, ..., i − 1 , i > 1 , i = 1, 2, ..., |M| 5 High-Availability proxy for load balancing & fault tolerance in data centres.
Client Server
and the available servers, techniques used for data-center traffic
flapping are employed [5].
Input CNN Comms Network
Optimiser Output
3.6 CNN Communication Optimiser
CNN layers often produce large volumes of intermediate data which 12345 Early exit 1 9 Early exit 2 Early exit 6

come with a high penalty in terms of network transfer. A key Conv BatchNorm
enabler in alleviating the communication bottleneck in SPINN is the ReLU Linear
678 10 11 12
Tensor: Interception
communication optimiser module (CNN-CO) (Figure 2 5 ). CNN-CO Injection

comprises two stages. In the first stage, we exploit the resilience of Figure 6: Offloading a progressive ResNet block.
CNNs to low-precision representation [14, 16, 28, 38] and lower the
data precision from 32-bit floating-point down to 8-bit fixed-point
through linear quantisation [28, 53]. By reducing the bitwidth of
only the data to be transferred, our scheme allows the transfer size
Platform CPU Clock Freq. Memory GPU
to be substantially lower without significant impact on the accuracy
of the subsequent classifiers (i.e. <0.65 percentage point drop across Server 2× Intel Xeon Gold 6130 2.10 GHz 128 GB GTX1080Ti
Jetson AGX Carmel ARMv8.2 2.26 GHz 16 GB 512-core Volta
all exits of the examined CNNs). Our scheme differs from both i)
weights-only reduction [17, 67, 86], which minimises the model size Table 1: Specifications of evaluated platforms.
rather than activations’ size and ii) all-layers quantisation [14, 16,
24, 28] which requires complex techniques, such as quantisation- partitions. To achieve this, Python’s dynamic instance attribute
aware training [24, 28] or a re-training step [14, 16], to recover the registration is used to taint tensors and monitor their flow through
accuracy drop due to the precision reduction across all layers. the network. With Figure 6 as a reference, SPINN’s custom wrapper
The second stage exploits the observation that activation data (Figure 2 6 ) performs the following operations:
are amenable to compression. A significant fraction of activations Normal execution: When a layer is to be run locally, the wrap-
are zero-valued, meaning that they are sparse and highly com- per calls the original function it replaced. In Figure 6, layers 1 to
pressible. As noted by prior works [56, 60, 82], this sparsity of 8 execute normally on-device, while layers from 9 until the end
activations is due to the extensive use of the ReLU layer that fol- execute normally on the server side.
lows the majority of layers in modern CNNs. In CNN-CO, sparsity Offload execution: When a layer is selected as a partition point
is further magnified due to the reduced precision. In this respect, (layer 9), instead of executing the original computation on the client,
CNN-CO leverages the sparsity of the 8-bit data by means of an LZ4 the wrapper queues an offloading request to be transmitted. This
compressor with bit shuffling. request contains the inputs (layer 5’s output) and subsequent layer
At run time, SPINN predicts whether the compression cost will dependencies (input of skip connection). Furthermore, the inference
outweigh its benefits by comparing the estimated CNN-CO runtime and the data transfer are decoupled in two separate threads, to allow
to the transfer time savings. If CNN-CO’s overhead is amortised, for pipelining of data transfer and next frame processing.
SPINN queues offloading requests’ data to the CNN-CO, with dedi- Resume execution: Upon receiving an offloading request, the
cated threads for each of the two stages, before transmission. Upon inputs and dependencies are injected in the respective layers (layer 9
reception at the remote end, the data are decompressed and cast and add operation) and normal execution is resumed on the remote
back to the original precision to continue inference. The overhead side. When the server concludes execution, the results are sent back
is shown as packing in Figure 5 for non-progressive models. in a parallel thread.
Early exit: When an intermediate classifier (i.e. early exit) is exe-
3.7 Distributed Execution Engine cuted, the wrapper evaluates its prediction confidence (Eq. (1)). If it
In popular Deep Learning frameworks, such as TensorFlow [1] and is above the provided thr conf , execution terminates early, returning
PyTorch [58], layers are represented by modules and data in the form the current prediction (Eq. (2)).
of multi-dimensional matrices, called tensors. To split and offload Since the premise of our system is to always have at least one us-
computation, SPINN modifies CNN layer’s operations behind the able result on the client side, we continue the computation on-device
scenes. To achieve this, it intercepts module and tensor operations even past the split layer, until the next early exit is encountered.
by replacing their functions with a custom wrapper using Python’s Furthermore, to avoid wasteful data transmission and redundant
function decorators. computation, if a client-side early exit covers the latency SLA and
Figure 6 focuses on an instance of an example ResNet block. satisfies the selected thr conf , the client sends a termination signal
SPINN attributes IDs to each layer in a trace-based manner, by to the remote worker to cancel the rest of the inference. If remote
executing the network and using the layer’s execution order as a execution has not started yet, SPINN does not offload at all.
sequence identifier.6 SPINN uses these IDs to build an execution
graph in the form of a directed acyclic graph (DAG), with nodes 4 EVALUATION
representing tensor operations and edges the tensor flows among This section presents the effectiveness of SPINN in significantly im-
them [1, 62, 75]. This is then used to detect the dependencies across proving the performance of mobile CNN inference by examining its
6 Despite
core components and comparing with the currently standard device-
the existence of branches, CNN execution tends to be parallelised across data
rather than layers. Hence, the numbering is deterministic.
and cloud-only implementations and state-of-the-art collaborative
inference systems.
mobilenet v2 - cifar resnet50 - imagenet resnet56 - cifar resnet50 - imagenet
(jetson-30w) (jetson-30w) (jetson-10w) (jetson-10w)
30 SPINN
Inference throughput (inf/s)

80 80 Neurosurgeon
30

Inference throughput (inf/s)


25 Edgent
60 On-device
20 60 Server-only
20
40
10 40 15
20
10
0 20
5
vgg16 - cifar inception v3 - imagenet
(jetson-u10w) (jetson-u10w) 1.0 2.5 5.0 7.5 10.0 1.0 2.5 5.0 7.5 10.0
150 25 SPINN Server slowdown factor Server slowdown factor
Inference throughput (inf/s)

w/o early exit


125 20 Neurosurgeon Figure 8: Effect of server slowdown on ResNet.
100 Edgent
15 On-device
75 Server-only 4.2 Performance Comparison
10
50 This subsection presents a performance comparison of SPINN with:
5
25 1) the two state-of-the-art CNN offloading systems Neurosurgeon
0 [34] and Edgent [46] (Section 2); 2) the status-quo cloud- and device-
100 101 102 103 100 101 102 103
Network speed (Mbps) Network speed (Mbps) only baselines; and 3) a non-progressive ablated variant of SPINN.
Figure 7: Achieved throughput for various 4.2.1 Throughput Maximisation. Here, we assess SPINN’s inference
⟨model, device, dataset⟩ setups vs. network speed. throughput across varying network conditions. For these experi-
ments, the SPINN schedulerâĂŹs objectives were set to maximise
4.1 Experimental Setup
throughput with up to 1 percentage point (pp) tolerance in accuracy
For our experiments, we used a powerful computer as the server drop with respect to the CNN’s last exit.
and an Nvidia Jetson Xavier AGX as the client (Table 1). Specifi- Figure 7 shows the achieved inference throughput for varying
cally for Jetson, we tested against three different power profiles to network speeds. On-device execution yields the same throughput
emulate end-devices with different compute capabilities:7 1) 30W independently of the network variation, but is constrained by the
(full power), 2) 10W (low power), 3) underclocked 10W (u10W). Fur- processing power of the client device. Server-only execution follows
thermore, to study the effect of limited computation capacity (e.g. the trajectory of the available bandwidth. Edgent always executes a
high-load server), we emulated the load by linearly scaling up the part of the CNN (up to the exit that satisfies the 1pp accuracy toler-
CNN computation times on the server side. We simulated the net- ance) irrespective of the network conditions. As a result, it follows
work conditions of offloading by using the average bandwidth and a similar trajectory to server-only but achieves higher throughput
latency across national carriers [26, 27], for 3G, 4G and 5G mobile due to executing only part of the model. Neurosurgeon demon-
networks. For local-area connections (Gigabit Ethernet 802.3, WiFi- strates a more polarised behaviour; under constrained connectivity
5 802.11ac), we used the nominal speeds of the protocol. We have it executes the whole model on-device, whereas as bandwidth in-
developed SPINN on top of PyTorch (1.1.0) and experimented with creases it switches to offloading all computation as it results in
four models, altered from torchvision (0.3.0) to include early exits or higher throughput. The ablated variant of SPINN (i.e. without early
to reflect the CIFAR-specific architectural changes. We evaluated exits) largely follows the behaviour of Neurosurgeon at the two
SPINN using: ResNet-50 and -56 [20], VGG16 [63], MobileNetV2 [61] extremes of the bandwidth while in the middle range, it is able
and Inception-v3 [69]. Unless stated otherwise, each benchmark to achieve higher throughput by offloading earlier due to CNN-CO
was run 50 times to obtain the average latency. compressing the transferred data.
Datasets and Training. We evaluated SPINN on two datasets, The end-to-end performance achieved by SPINN delivers the
namely CIFAR-100 [41] and ImageNet (ILSVRC2012) [10]. The for- highest throughput across all setups, achieving a speedup of up
mer contains 50k training and 10k test images of resolution 32×32, to 83% and 52% over Neurosurgeon and Edgent, respectively. This
each corresponding to one of 100 labels. The latter is significantly can be attributed to our bandwidth- and data-locality-aware sched-
larger, with 1.2m training and 50k test images of 300×300 and 1000 uler choices on the early-exit policy and partition point. In low
labels. We used the preprocessing steps described in each model’s bandwidths, SPINN selects device-only execution, outperforming all
implementation, such as scaling and cropping the input image, sto- other on-device designs due to its early-exiting mechanism, tuned
chastic image flip (p = 0.5) and colour channel normalisation. After by the scheduler module. In the mid-range, the CNN-CO module
converting these models to progressive early-exit networks, we enables SPINN to better utilise the available bandwidth and start of-
trained them jointly from scratch end-to-end, with the “overthink" floading earlier on, outperforming both Edgent and Neurosurgeon.
cost function (Section 3.2). We used the authors’ training hyper- In high-bandwidth settings, our system surpasses the performance
parameters, except for MobileNetV2, where we utilised SGD with of all other designs by exploiting its optimised early-exiting scheme.
learning rate of 0.05 and cosine learning rate scheduling, due to Specifically, compared to Edgent, SPINN takes advantage of the
convergence issues. We trained the networks for 300 epochs on input’s classification difficulty to exit early, whereas the latter only
CIFAR-100 and 90 epochs on ImageNet. selects an intermediate exit to uniformly classify all incoming sam-
7 We
ples. Moreover, in contrast with Edgent’s strategy to always trans-
are adjusting the TDP and clock frequency of the CPU and GPU cores, effectively
emulating different tiers of devices, ranging from high-end embedded devices to mit the input to the remote endpoint, we exploit the fact that data
mid-tier smartphones. already reside on the device and avoid the wasteful data transfers.
Server Time (ms) Accuracy (%) system for two device profiles with different compute capabilities -
resnet56 - u10W 80 resnet56 - u10W
20 u10W and 10W. Latency SLAs are represented as a percentage of
60 SPINN the device-only runtime of the original CNN (i.e. 20% SLA means
10 Neurosurgeon
0 40 Edgent that the target is 5× less latency than on-device execution, requiring
mobilenet v2 - u10W mobilenet v2 - u10W early exiting and/or server support).
20 80 For the low-end device (u10W) and strict latency deadlines,
10 60 SPINN offloads as much as possible to the cloud as it allows reaching
0 40 faster a later exit in the network, hence increasing the accuracy.
resnet56 - 10W resnet56 - 10W As the SLA loosens (reaching more than 40% of the on-device la-
20 80 tency), SPINN starts to gradually execute more and more locally. In
10 60 contrast, Edgent and Neurosurgeon achieve similar accuracy but
0 40 with up to 4.9× and 6.8× higher server load. On average across all
targets, SPINN reduces Edgent and Neurosurgeon server times by
20 mobilenet v2 - 10W 80 mobilenet v2 - 10W
68.64% and 82.5% (60.3% and 83.6% geo. mean), respectively, due to
10 60 its flexible multi-objective scheduler. Instead, Neurosurgeon can
0 40 only optimise for overall latency and cannot trade off accuracy to
20 40 60 80 100 20 40 60 80 100 meet the deadline (e.g. for 20% SLA on ResNet-56) while Edgent
Latency SLA (%) Latency SLA (%) cannot account for server-time minimisation and accuracy drop
Figure 9: Server time (left) and accuracy (right) of SPINN constraints.
vs. Neurosurgeon and Edgent for different latency SLAs and The situation is different for the more powerful device (10W).
client compute power (u10W and 10W). The SLA is ex- With the device being faster, the SLA targets become much stricter.
pressed as a percentage of the on-device latency. Therefore, we observe that SPINN and Edgent can still meet a la-
tency constraint as low as 20% and 30% of the local execution time
4.2.2 Server-Load Variation. To investigate the performance of
for ResNet-56 and MobileNetV2 respectively. In contrast, without
SPINN under various server-side loads, we measured the inference
progressive inference, it is impossible for Neurosurgeon to achieve
throughput of SPINN against baselines when varying the load of
inference latency below 60% of on-device execution across both
the remote end, with 1pp of accuracy drop tolerance. This is accom-
CNNs. In this context, SPINN is able to trade off accuracy in order
plished by linearly scaling the latency of the server execution by a
to meet stricter SLAs, but also improve its attainable accuracy as
slowdown factor (i.e. a factor of 2 means the server is 2× slower).
the latency constraints are relaxed.
Figure 8 presents the throughput achieved by each approach under
For looser latency deadlines (target larger than 50% of the on-
various server-side loads, with the Jetson configured at the 10W
device latency), SPINN achieves accuracy gains of 17.3% and 20.7%
profile and the network speed in the WiFi-5 range (500 Mbps).
over Edgent for ResNet-56 and MobileNetV2, respectively. The rea-
With low server load (left of the x-axis), the examined systems
son behind this is twofold. First, when offloading, Edgent starts
demonstrate a similar trend to the high-bandwidth performance of
the computation on the server side, increasing the communica-
Figure 7. As the server becomes more loaded (i.e. towards the right-
tion latency overhead. Instead, SPINN’s client-to-server offloading
hand side), performance deteriorates, except for the case of device-
strategy and compression significantly reduces the communication
only execution which is invariant to the server load. On the one
latency overhead.
 Second,  due to Edgent’s unnormalised cost func-
hand, although its attainable throughput reduces, Neurosurgeon 1 + acc ), the throughput’s reward dominates the
adapts its policy based on the server utilisation and gradually ex- tion (i.e. max l at
ecutes a greater fraction of the CNN on the client side. On the accuracy gain, leading to always selecting the first early-exit sub-
other hand, Edgent’s throughput deteriorates more rapidly and network and executing it locally. In contrast, SPINN’s scheduler’s
even reaches below the device-only execution under high server multi-criteria design is able to capture accuracy, server time and
load, since its run-time mechanism does not consider the varying latency constraints to yield an optimised deployment. Hence, sim-
server load. Instead, by adaptively optimising both the split point ilarly to the slower device, SPINN successfully exploits the server
and the early-exit policy, SPINN’s scheduler manages to adapt the resources to boost accuracy under latency constraints, while it can
overall execution based on the server-side load, leading to through- reach up to pure on-device execution for loose deadlines.
put gains between 1.18-1.99× (1.57× geo. mean) and 1.15-3.09× 4.3 Runtime Overhead and Efficiency
(1.61× geo. mean) over Neurosurgeon and Edgent respectively.
4.3.1 Deployment Overhead. By evaluating across our examined
4.2.3 Case Study: Latency-driven SLAs at minimal server cost. To CNNs and datasets on the CPU of Jetson, the scheduler executes in
assess SPINN’s performance under deadlines, we target the scenario max 14 ms (11 ms geo. mean). This time includes the cost of read-
where a service provider aims to deploy a CNN-based application ing the profiler parameters, updating the monitored metrics, and
that meets strict latency SLAs at maximum accuracy and minimal searching for and returning the selected configuration. Moreover,
server-side cost. In this setting, we compare against Neurosurgeon8 SPINN’s memory consumption is in the order of a few KB (i.e. <1% of
and Edgent, targeting MobileNetV2 and ResNet-56 over 4G. Figure Jetson’s RAM). These costs are amortised over multiple inferences,
9 shows the server computation time and accuracy achieved by each as the scheduler is invoked only on significant context changes. We
discuss the selection of such parameters in the following section.
8 It should be noted that Neurosurgeon maintains the accuracy of the original CNN.
inception v3 - ImageNet inception v3 - imagenet (jetson-u10w)
40
(Jetson-u10w) 1000 inference energy (GPU)
compression energy (CPU)
bus transfer energy (USB)
Bandwidth (Mbps)

Energy per sample (mJ)


30 foot 800 bad configuration
20 600 good configurations
10
400
0
100 1.0 200
Split index (normalised)

Confidence threshold
75 0.8
0
0.6 device-only, device-only, 89%, 81%, 4%, server-only,
50 w/o ee 0.4 0.4 0.4 0.8 0.8
0.4 Scheduler configuration
25 (split index (norm.), conf. threshold)
0.2 Figure 11: Energy consumption of SPINN vs. baselines.
split ratio conf. thresh
0 0.0
over UK’s Three Broadband’s 4G network with a Huawei E3372
Est. throughput (inf/s)

6 USB adapter. We measured the power of Jetson (CPU, GPU) from its
5 integrated probes and the transmission energy with the Monsoon
AAA10F power monitor.
4 Traversing the horizontal axis left-to-right, we first see device-
0 100 200 300 400 500 600 700 800 only execution without and with early-exits, where the local pro-
Timestamp cessing dominates the total energy consumption. The latter shows
Figure 10: SPINN scheduler’s behaviour on real network benefits due to samples exiting early from the network. Next, we
provider trace. showcase the consumption breakdown of three different ⟨split, thr conf ⟩
4.3.2 Network Variation. To assess the responsiveness of SPINN configurations. The first two configurations demonstrate compara-
in adapting to dynamic network conditions, we targeted a real ble energy consumption with the device-only execution without
bandwidth trace from a Belgian ISP. The trace contains time series early exits. On the contrary, a bad configuration requires an exces-
of network bandwidth variability during different user activities. sively large transfer size, leading to large compression and transfer
In this setup, SPINN executes the ImageNet-trained Inception-v3 energy overheads. Last, the energy consumption when fully offload-
with Jetson-u10W as the client under the varying bandwidth emu- ing is dominated by the network transfer.
lated by the Belgium 4G/LTE logs. The scheduler is configured to Across configurations, we witness a 5× difference in energy
maximise both throughput and accuracy. Figure 10 (top) shows an consumption across different inference setups. While device-only
example bandwidth trace from a moving bus followed by walking. execution yields the lowest energy footprint per sample, it is also
Figure 10 (bottom) shows SPINN’s achieved inference throughput the slowest. Our scheduler is able to yield deployments that are
under the changing network quality. The associated scheduler de- significantly more energy efficient than full offloading (4.2×) and
cisions are depicted in Figure 10 (middle). on par with on-device execution (0.76 − 1.12×), while delivering
At low bandwidth (<5 Mbps), SPINN falls back to device-only significantly faster end-to-end processing. Finally, with different
execution. In these cases, the scheduler adopts a less conserva- configurations varying both in energy and performance, the deci-
tive early-exit policy by lowering the confidence threshold. In this sion space is amenable to energy-driven optimisation by adding
manner, it allows more samples to exit earlier, compensating for energy as a scheduler optimisation metric.
the client’s low processing power. Nonetheless, the impact on ac-
curacy remains minimal (<1%) for the selected early-exit policies
4.5 Constrained Availability Robustness
by the scheduler (thr conf ∈ [0.6, 1.0]), as illustrated in Figure 3 for Next we evaluate SPINN’s robustness under constrained availability
Inception-v3 on ImageNet. At the other end, high bandwidths re- of the remote end such as network timeouts, disconnections and
sult in selecting an earlier split point and thus achieving up to 7× server failures. More specifically, we investigate 1) the achieved
more inf/sec over pure on-device execution. Finally, the similar accuracy across various failure rates and 2) the latency improvement
trajectories of the top and bottom figure suggest that our scheduler over conventional systems enhanced with an error-control policy.
can adapt the system to the running conditions, without having to In these experiments, we fix the confidence threshold of three
be continuously invoked. models (Inception-v3, ResNet-56 and ResNet-50) to a value of 0.8
Overall, we observe that small bandwidth changes do not cause and emulate variable failure rates by sampling from a random
significant alterations to the split and early-exit strategies. By em- distribution across the validation set.
ploying an averaging historical window of three values and a dif- Accuracy comparison: For this experiment (Figure 12), we
ference threshold of 5%, the scheduler is invoked 1/3 of the total compare SPINN at different network split points (solid colours)
bandwidth changes across traces. against a non-progressive baseline (dashed line). Under failure con-
ditions, the baseline unavoidably misclassifies the result as there is
no usable result locally on-device. However, SPINN makes it possi-
4.4 Energy Consumption ble to acquire the most confident local result up to the split point,
Figure 11 shows the breakdown of dominant energy consumption when the server is unavailable.
across the client device subsystems. We measured energy consump- As shown in Figure 12, the baseline quickly drops in accuracy as
tion over 1000 inferences from the validation set and offloading the failure rate increases. This is not the case with SPINN, which
inception v3 resnet56 resnet50 inception v3 resnet56 resnet50
(cifar100) (cifar100) (imagenet) 0.4 (cifar100) (cifar100) (imagenet)
70 baseline baseline baseline
prog., split: 28% prog., split: 26% prog., split: 25%
0.3 prog., split: 44% prog., split: 37% prog., split: 38%
60 prog., split: 63% prog., split: 40% prog., split: 53%
Accuracy

Latency
prog., split: 92% prog., split: 54% prog., split: 82%
50 baseline baseline baseline 0.2
prog., split: 28% prog., split: 26% prog., split: 25%
prog., split: 44% prog., split: 37% prog., split: 38%
40 prog., split: 63% prog., split: 40% prog., split: 53% 0.1
prog., split: 92% prog., split: 54% prog., split: 82%
30

0.1

0.5
0.1

0.5
0.1

0.5
0.1

0.5
0.1

0.5
0.1

0.5

0.2

0.2

0.2
0.2

0.2

0.2
Failure Probability Failure Probability
Figure 12: Comparison of the average accuracy under uncer- Figure 13: Comparison of average latency under uncertain
tain server availability. The shaded area indicates attained server availability.
accuracies under a probability distribution.
manages to maintain a minimal accuracy drop. Specifically, we wit-
5 DISCUSSION
ness drops ranging in [0, 5.75%] for CIFAR-100 and [0.46%, 33%] for SPINN and the current ML landscape. The status-quo deploy-
ImageÎİet, when the equivalent drop of the baseline is [11.56%, 44.1%] ment process of CNNs encompasses the maintenance of two mod-
and [9.25%, 44.34%], respectively. As expected, faster devices are els: a large, highly accurate model on the cloud and a compact,
able to execute locally a larger part of the model (exit later) while lower-accuracy one on the device. However, this approach comes
meeting their SLA exhibit the smallest impact under failure, as with significant deployment overheads. First, from a development
depicted in the progressive variants of the two models. time perspective, the two-model approach results in two time- and
Latency comparison: In this scenario, we compare SPINN against resource-expensive stages. In the first stage, the large model is de-
a single-exit offloaded variant of the same networks. This time signed and trained requiring multiple GPU-hours. In the second
instead of simply failing the inference when the remote end is stage, the large model is compressed through various techniques
unavailable, we allow for retransmission with exponential back- in order to obtain its lightweight counterpart, with the selection
off, a common behaviour of distributed systems to avoid channel and tuning of the compression method being a difficult task in
contention. When a sample fails under the respective probability itself. Furthermore, typically, to gain back the accuracy loss due to
distribution, the result gets retransmitted. If the same sample fails compression, the lightweight model has to be fine-tuned through a
again, the client waits double the time and retransmits, until it suc- number of additional training steps.
ceeds. Here, we assume Jetson-10W offloads to our server over 4G With regards to the lightweight compressed networks, SPINN
and varying failure probability (P fail ∈ {0.1, 0.25, 0.5}). The initial is orthogonal to these techniques and hence a compressed model
retransmission latency is 20 ms. We ran each experiment three can be combined with SPINN to obtain further gains. Given a com-
times and report the mean and standard deviation of the latencies. pressed model, our system would proceed to derive a progressive
As depicted in Figure 13, the non-progressive baseline follows inference variant with early exits and deploy the network with a
a trajectory of increasing latency as the failure probability gets tailored implementation. For use-cases where pre-trained models
higher, due to the additional back-off latency each time a sample are employed, SPINN can also smoothly be adopted by modifying
fails. While the impact on the average latency for both networks its training scheme (Section 3.2) so that the pre-trained backbone
going from Pfail = 0.1 to Pfail = 0.25 is gradual, at 3.9%, 5.8% and is frozen during training and only the early exits are updated.
4.7% for Inception-v3, ResNet-56 and ResNet-50 respectively, the Nonetheless, with SPINN we also enable an alternative paradigm
jump from Pfail = 0.25 to Pfail = 0.5 is much more dramatic, at that alleviates the main limitations of the current practice. SPINN
52.9%, 91% and 118%. The variance at Pfail = 0.5 is also noticeably requires a single network design step and a single training process
higher, compared to previous values, attributed to higher number - which trains both the backbone network and its early exits. Upon
of retransmissions and thus higher discrepancies across different deployment, the model’s execution is adaptively tuned based on
runs. We should note that despite the considerably higher latency the multiple target objectives, the environmental conditions and
of the non-progressive baseline, its accuracy can be higher, since all the device and cloud load. In this manner, SPINN enables a highly
samples – whether re-transmitted or not – exit at the final classifier. customised deployment which is dynamically and efficiently ad-
Last, we also notice a slight reduction in the average latency of justed to sustain its performance in mobile settings. This approach
SPINN’s models as Pfail increases. This is a result of more samples is further supported by the ML community’s growing number of
early-exiting in the network, as the server becomes unavailable works on progressive networks [23, 35, 48, 72, 81, 83, 84] which can
more often. be directly targeted by SPINN to yield an optimised deployment on
To sum up, these two results demonstrate that SPINN can perform mobile platforms.
sufficiently, in terms of accuracy and latency, even when the remote Limitations and future work. Despite the challenges addressed
end remains unresponsive, by falling back to results of local exits. by SPINN, our prototype system has certain limitations. First, the
Compared to other systems, as the probability of failure when scheduler does not explicitly optimise for energy or memory con-
offloading to the server increases, there is a gradual degradation of sumption of the client. The energy consumption could be integrated
the quality of service, instead of catastrophic unresponsiveness of as another objective in the MOO solver of the scheduler, while
the application. memory footprint could be minimised by only loading part of the
model in memory and always offloading the rest. Moreover, while
SPINN supports splitting at any given layer, we limit the candi- of the 24th Annual International Conference on Mobile Computing and Networking
date split points of each network to the outputs of ReLU layers, (MobiCom). 115âĂŞ127.
[10] L. Fei-Fei, J. Deng, and K. Li. 2010. ImageNet: Constructing a large-scale image
due to their high compressibility (Section 3.3). Although offload- database. Journal of Vision 9, 8 (2010), 1037–1037.
ing could happen at sub-layer, filter-level granularity, this would [11] L. Fridman, D. E. Brown, M. Glazer, W. Angell, S. Dodd, B. Jenik, J. Terwilliger,
A. Patsekin, J. Kindelsberger, L. Ding, S. Seaman, A. Mehler, A. Sipperley, A.
impose extra overhead on the scheduler due to the significantly Pettinato, B. D. Seppelt, L. Angell, B. Mehler, and B. Reimer. 2019. MIT Advanced
larger search space. Vehicle Technology Study: Large-Scale Naturalistic Driving Study of Driver
Our workflow also assumes the model to be available at both the Behavior and Interaction With Automation. IEEE Access 7 (2019), 102021–102038.
[12] Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar,
client and server side. While cloud resources are often dedicated Greg Henry, Hans Pabst, and Alexander Heinecke. 2018. Anatomy of High-
to specific applications, edge resources tend to present locality Performance Deep Learning Convolutions on SIMD Architectures. In Proceedings
challenges. To handle these, we could extend SPINN to provide of the International Conference for High Performance Computing, Networking,
Storage, and Analysis (SC) (SC âĂŹ18). IEEE Press, Article 66, 12 pages.
incremental offloading [29] and cache popular functionality [76] [13] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration
closer to its users. In the future, we intend to explore multi-client of Modern Neural Networks. In Proceedings of the 34th International Conference
settings and simultaneous asynchronous inferences on a single on Machine Learning (ICML). 1321–1330.
[14] Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Jincheng Yu, Junbin Wang, Song Yao, Song
memory copy of the model, as well as targeting regression tasks Han, Yu Wang, and Huazhong Yang. 2017. Angel-Eye: A complete design flow
and recurrent neural networks. for mapping CNN onto embedded FPGA. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems (TCAD) 37, 1 (2017), 35–47.
[15] Selim Gurun, Chandra Krintz, and Rich Wolski. 2004. NWSLite: A Light-Weight
6 CONCLUSION Prediction Utility for Mobile Devices. In Proceedings of the 2nd International
Conference on Mobile Systems, Applications, and Services (MobiSys). 2âĂŞ11.
In this paper, we present SPINN, a distributed progressive inference [16] P. Gysel, J. Pimentel, M. Motamedi, and S. Ghiasi. 2018. Ristretto: A Framework
engine that addresses the challenge of partitioning CNN inference for Empirical Study of Resource-Efficient Inference in Convolutional Neural
across device-server setups. Through a run-time scheduler that Networks. IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
29, 11 (2018), 5784–5789.
jointly tunes the early-exit policy and the partitioning scheme, the [17] Song Han, Huizi Mao, and William J Dally. 2016. Deep Compression: Compressing
proposed system supports complex performance goals in highly Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.
dynamic environments while simultaneously guaranteeing the ro- International Conference on Learning Representations (ICLR) (2016).
[18] Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal, Alec Wolman,
bust operation of the end system. By employing an efficient multi- and Arvind Krishnamurthy. 2016. MCDNN: An Approximation-Based Execution
objective optimisation approach and a CNN-specific communica- Framework for Deep Stream Processing Under Resource Constraints. In Proceed-
ings of the 14th Annual International Conference on Mobile Systems, Applications,
tion optimiser, SPINN is able to deliver higher performance over the and Services (MobiSys). 123–136.
state-of-the-art systems across diverse settings, without sacrificing [19] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy,
the overall system’s accuracy and availability. B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong,
and X. Wang. 2018. Applied Machine Learning at Facebook: A Datacenter Infras-
tructure Perspective. In 2018 IEEE International Symposium on High Performance
REFERENCES Computer Architecture (HPCA). 620–629.
[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey [20] K He, X Zhang, S Ren, and J Sun. 2016. Deep Residual Learning for Image
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Man- Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition
junath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, (CVPR). 770–778.
Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan [21] Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman,
Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-scale Machine Paramvir Bahl, Matthai Philipose, Phillip B. Gibbons, and Onur Mutlu. 2018.
Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Focus: Querying Large Video Datasets with Low Latency and Low Cost. In
Design and Implementation (OSDI). 265–283. Proceedings of the 12th USENIX Conference on Operating Systems Design and
[2] Mario Almeida, Stefanos Laskaridis, Ilias Leontiadis, Stylianos I. Venieris, and Implementation (OSDI). USENIX Association, 269–286.
Nicholas D. Lane. 2019. EmBench: Quantifying Performance Variations of Deep [22] C. Hu, W. Bao, D. Wang, and F. Liu. 2019. Dynamic Adaptive DNN Surgery for
Neural Networks Across Modern Commodity Devices. In The 3rd International Inference Acceleration on the Edge. In IEEE INFOCOM 2019 - IEEE Conference on
Workshop on Deep Learning for Mobile Systems and Applications (EMDL) (Seoul, Computer Communications. 1423–1431.
Republic of Korea). 1–6. [23] Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and
[3] Amazon. 2020. Amazon Inferentia ML Chip. https://aws.amazon.com/machine- Kilian Weinberger. 2018. Multi-Scale Dense Networks for Resource Efficient
learning/inferentia/. [Retrieved: August 25, 2020]. Image Classification. In International Conference on Learning Representations
[4] Alejandro Cartas, Martin Kocour, Aravindh Raman, Ilias Leontiadis, Jordi Luque, (ICLR).
Nishanth Sastry, Jose Nuñez Martinez, Diego Perino, and Carlos Segura. 2019. [24] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua
A Reality Check on Inference at Mobile Networks Edge. In Proceedings of the Bengio. 2017. Quantized Neural Networks: Training Neural Networks with Low
2nd International Workshop on Edge Systems, Analytics and Networking (EdgeSys). Precision Weights and Activations. J. Mach. Learn. Res. 18, 1 (2017), 6869âĂŞ6898.
54–59. [25] Andrey Ignatov, Radu Timofte, Andrei Kulik, Seungsoo Yang, Ke Wang, Felix
[5] D. Chandrasekar. 2016. AWS Flap Detector: An Efficient Way to Detect Flapping Baum, Max Wu, Lirong Xu, and Luc Van Gool. 2019. AI Benchmark: All About
Auto Scaling Groups on AWS Cloud. University of Cincinnati. Deep Learning on Smartphones in 2019. In International Conference on Computer
[6] E. Chung et al. 2018. Serving DNNs in Real Time at Datacenter Scale with Project Vision (ICCV) Workshops.
Brainwave. IEEE Micro 38, 2 (2018), 8–20. [26] UK ISPs. 2020. 4G Mobile Network Experience Report. https://www.opensignal.
[7] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengill, com/reports/2019/04/uk/mobile-network-experience.
M. Liu, D. Lo, S. Alkalay, M. Haselman, M. Abeydeera, L. Adams, H. Angepat, [27] UK ISPs. 2020. 5G Mobile Network Report. https://www.opensignal.com/2020/
C. Boehn, D. Chiou, O. Firestein, A. Forin, K. S. Gatlin, M. Ghandi, S. Heil, K. 02/20/how-att-sprint-t-mobile-and-verizon-differ-in-their-early-5g-approach.
Holohan, A. El Husseini, T. Juhasz, K. Kagi, R. Kovvuri, S. Lanka, F. van Megen, [28] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D.
D. Mukhortov, P. Patel, B. Perez, A. Rapsang, S. Reinhardt, B. Rouhani, A. Sapek, Kalenichenko. 2018. Quantization and Training of Neural Networks for Efficient
R. Seera, S. Shekar, B. Sridharan, G. Weisz, L. Woods, P. Yi Xiao, D. Zhang, R. Integer-Arithmetic-Only Inference. In IEEE Conference on Computer Vision and
Zhao, and D. Burger. 2018. Serving DNNs in Real Time at Datacenter Scale with Pattern Recognition (CVPR). 2704–2713.
Project Brainwave. IEEE Micro 38, 2 (2018), 8–20. [29] Hyuk-Jin Jeong, Hyeon-Jae Lee, Chang Hyun Shin, and Soo-Mook Moon. 2018.
[8] A. E. Eshratifar, M. S. Abrishami, and M. Pedram. 2019. JointDNN: An Efficient IONN: Incremental Offloading of Neural Network Computations from Mobile De-
Training and Inference Engine for Intelligent Mobile Cloud Computing Services. vices to Edge Servers. In Proceedings of the ACM Symposium on Cloud Computing
IEEE Transactions on Mobile Computing (TMC) (2019). (SoCC). 401–411.
[9] Biyi Fang, Xiao Zeng, and Mi Zhang. 2018. NestDNN: Resource-Aware Multi- [30] Yu Ji, Youhui Zhang, Wenguang Chen, and Yuan Xie. 2018. Bridge the Gap
Tenant On-Device Deep Learning for Continuous Mobile Vision. In Proceedings Between Neural Networks and Neuromorphic Hardware with a Neural Network
Compiler. In Proceedings of the Twenty-Third International Conference on Archi- [54] Vinod Nair and Geoffrey E Hinton. 2010. Rectified Linear Units improve Restricted
tectural Support for Programming Languages and Operating Systems (ASPLOS). Boltzmann Machines. In International Conference on Machine Learning (ICML).
448–460. 807–814.
[31] Jian Ouyang, Shiding Lin, Wei Qi, Yong Wang, Bo Yu, and Song Jiang. 2014. SDA: [55] Intel Nervana. 2020. Nervana’s Early Exit Inference. https://nervanasystems.
Software-defined accelerator for large-scale DNN systems. In 2014 IEEE Hot Chips github.io/distiller/algo_earlyexit.html. [Retrieved: August 25, 2020].
26 Symposium (HCS). 1–23. [56] Miloš Nikolić, Mostafa Mahmoud, Andreas Moshovos, Yiren Zhao, and Robert
[32] Norman P. Jouppi et al. 2017. In-Datacenter Performance Analysis of a Tensor Mullins. 2019. Characterizing Sources of Ineffectual Computations in Deep
Processing Unit. In Proceedings of the 44th Annual International Symposium on Learning Networks. In IEEE International Symposium on Performance Analysis of
Computer Architecture (ISCA). ACM, 1–12. Systems and Software (ISPASS). 165–176.
[33] Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017. [57] Edward Oakes, Leon Yang, Dennis Zhou, Kevin Houck, Tyler Harter, Andrea
NoScope: Optimizing Neural Network Queries over Video at Scale. Proc. VLDB Arpaci-Dusseau, and Remzi Arpaci-Dusseau. 2018. SOCK: Rapid Task Provi-
Endow. 10, 11 (2017), 1586–1597. sioning with Serverless-Optimized Containers. In 2018 USENIX Annual Technical
[34] Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor Mudge, Jason Conference (USENIX ATC 18). 57–70.
Mars, and Lingjia Tang. 2017. Neurosurgeon: Collaborative Intelligence Between [58] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gre-
the Cloud and Mobile Edge. In Proceedings of the Twenty-Second International gory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga,
Conference on Architectural Support for Programming Languages and Operating Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison,
Systems (ASPLOS). 615–629. Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and
[35] Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras. 2019. Shallow-Deep Net- Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep
works: Understanding and Mitigating Network Overthinking. In International Learning Library. In Advances in Neural Information Processing Systems (NeurIPS).
Conference on Machine Learning (ICML). 3301–3310. 8026–8037.
[36] Youngsok Kim, Joonsung Kim, Dongju Chae, Daehyun Kim, and Jangwoo Kim. [59] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-
2019. µLayer: Low Latency On-Device Inference Using Cooperative Single- Dickstein. 2017. On the Expressive Power of Deep Neural Networks. In Pro-
Layer Acceleration and Processor-Friendly Quantization. In Proceedings of the ceedings of the 34th International Conference on Machine Learning (ICML), Vol. 70.
Fourteenth EuroSys Conference 2019. 45:1–45:15. 2847–2854.
[37] A. Kouris and C. Bouganis. 2018. Learning to Fly by MySelf: A Self-Supervised [60] M. Rhu, M. O’Connor, N. Chatterjee, J. Pool, Y. Kwon, and S. W. Keckler. 2018.
CNN-Based Approach for Autonomous Navigation. In IEEE/RSJ International Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep
Conference on Intelligent Robots and Systems (IROS). 1–9. Neural Networks. In 2018 IEEE International Symposium on High Performance
[38] A. Kouris, S. I. Venieris, and C. Bouganis. 2018. CascadeCNN: Pushing the Computer Architecture (HPCA). 78–91.
Performance Limits of Quantisation in Convolutional Neural Networks. In 2018 [61] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-
28th International Conference on Field Programmable Logic and Applications (FPL). Chieh Chen. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In
155–1557. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
[39] A. Kouris, S. I. Venieris, and C. Bouganis. 2020. A Throughput-Latency Co- (CVPR). 4510–4520.
Optimised Cascade of Convolutional Neural Network Classifiers. In 2020 Design, [62] Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung
Automation Test in Europe Conference Exhibition (DATE). 1656–1661. Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From High-
[40] C. Kozyrakis. 2013. Resource Efficient Computing for Warehouse-scale Data- level Deep Neural Models to FPGAs. In IEEE/ACM International Symposium on
centers. In 2013 Design, Automation Test in Europe Conference Exhibition (DATE). Microarchitecture (MICRO). 17:1–17:12.
1351–1356. [63] K. Simonyan and A. Zisserman. 2015. Very Deep Convolutional Networks for
[41] Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features Large-Scale Image Recognition. In International Conference on Learning Represen-
from tiny images. Technical Report. tations (ICLR).
[42] V. K. Kukkala, J. Tunnell, S. Pasricha, and T. Bradley. 2018. Advanced Driver- [64] Ashish Singh and Kakali Chatterjee. 2017. Cloud security issues and challenges:
Assistance Systems: A Path Toward Autonomous Vehicles. IEEE Consumer Elec- A survey. Journal of Network and Computer Applications 79 (2017), 88–115.
tronics Magazine 7, 5 (2018), 18–25. [65] Muthian Sivathanu, Tapan Chugh, Sanjay S. Singapuram, and Lidong Zhou. 2019.
[43] N. D. Lane, S. Bhattacharya, P. Georgiev, C. Forlivesi, L. Jiao, L. Qendro, and F. Astra: Exploiting Predictability to Optimize Deep Learning. In Proceedings of the
Kawsar. 2016. DeepX: A Software Accelerator for Low-Power Deep Learning Twenty-Fourth International Conference on Architectural Support for Programming
Inference on Mobile Devices. In 2016 15th ACM/IEEE International Conference on Languages and Operating Systems (ASPLOS). 909–923.
Information Processing in Sensor Networks (IPSN). 1–12. [66] N. Smolyanskiy, A. Kamenev, J. Smith, and S. Birchfield. 2017. Toward low-flying
[44] Stefanos Laskaridis, Stylianos I. Venieris, Hyeji Kim, and Nicholas D. Lane. 2020. autonomous MAV trail navigation using deep neural networks for environmental
HAPI: Hardware-Aware Progressive Inference. In IEEE/ACM International Con- awareness. In 2017 IEEE/RSJ International Conference on Intelligent Robots and
ference on Computer-Aided Design (ICCAD). Systems (IROS). 4241–4247.
[45] Royson Lee, Stylianos I. Venieris, Lukasz Dudziak, Sourav Bhattacharya, and [67] Pierre Stock, Armand Joulin, RÃľmi Gribonval, Benjamin Graham, and HervÃľ
Nicholas D. Lane. 2019. MobiSR: Efficient On-Device Super-Resolution Through JÃľgou. 2020. And the Bit Goes Down: Revisiting the Quantization of Neural
Heterogeneous Mobile Processors. In The 25th Annual International Conference Networks. In International Conference on Learning Representations (ICLR).
on Mobile Computing and Networking (MobiCom). [68] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. 2017.
[46] E. Li, L. Zeng, Z. Zhou, and X. Chen. 2020. Edge AI: On-Demand Accelerating Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learn-
Deep Neural Network Inference via Edge Computing. IEEE Transactions on ing. In AAAI Conference on Artificial Intelligence.
Wireless Communications (TWC) (2020), 447–457. [69] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. 2016. Rethinking the
[47] Hongshan Li, Chenghao Hu, Jingyan Jiang, Zhi Wang, Yonggang Wen, and Inception Architecture for Computer Vision. In IEEE Conference on Computer
Wenwu Zhu. 2019. JALAD: Joint Accuracy-And Latency-Aware Deep Struc- Vision and Pattern Recognition (CVPR). 2818–2826.
ture Decoupling for Edge-Cloud Execution. In Proceedings of the International [70] Willy Tarreau et al. 2012. HAProxy-the reliable, high-performance TCP/HTTP
Conference on Parallel and Distributed Systems (ICPADS). 671–678. load balancer.
[48] Hao Li, Hong Zhang, Xiaojuan Qi, Ruigang Yang, and Gao Huang. 2019. Improved [71] Ben Taylor, Vicent Sanz Marco, Willy Wolff, Yehia Elkhatib, and Zheng Wang.
Techniques for Training Adaptive Deep Networks. In International Conference on 2018. Adaptive Deep Learning Model Selection on Embedded Systems. In Proceed-
Computer Vision (ICCV). ings of the 19th ACM SIGPLAN/SIGBED International Conference on Languages,
[49] Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang. 2019. Optimiz- Compilers, and Tools for Embedded Systems (LCTES) (LCTES 2018). 31âĂŞ43.
ing CNN Model Inference on CPUs. In 2019 USENIX Annual Technical Conference [72] Surat Teerapittayanon, Bradley McDanel, and HT Kung. 2016. BranchyNet: Fast
(USENIX ATC 19). 1025–1040. Inference via Early Exiting from Deep Neural Networks. In 2016 23rd International
[50] Jiachen Mao, Xiang Chen, Kent W. Nixon, Christopher Krieger, and Yiran Chen. Conference on Pattern Recognition (ICPR). 2464–2469.
2017. MoDNN: Local distributed mobile computing system for Deep Neural [73] S. Teerapittayanon, B. McDanel, and H. T. Kung. 2017. Distributed Deep Neu-
Network. Proceedings of the 2017 Design, Automation and Test in Europe (DATE) ral Networks Over the Cloud, the Edge and End Devices. In 2017 IEEE 37th
(2017), 1396–1401. International Conference on Distributed Computing Systems (ICDCS). 328–339.
[51] Jiachen Mao, Zhongda Yang, Wei Wen, Chunpeng Wu, Linghao Song, Kent W. [74] Tenstorrent. 2020. Tenstorrent’s Grayskull AI Chip. https://www.tenstorrent.
Nixon, Xiang Chen, Hai Li, and Yiran Chen. 2017. MeDNN: A distributed mobile com/technology/. [Retrieved: August 25, 2020].
system with enhanced partition and deployment for large-scale DNNs. IEEE/ACM [75] S. I. Venieris and C. Bouganis. 2019. fpgaConvNet: Mapping Regular and Irregular
International Conference on Computer-Aided Design (ICCAD) (2017), 751–756. Convolutional Neural Networks on FPGAs. IEEE Transactions on Neural Networks
[52] R Timothy Marler and Jasbir S Arora. 2004. Survey of multi-objective optimization and Learning Systems (TNNLS) 30, 2 (2019), 326–342.
methods for engineering. Structural and multidisciplinary optimization 26, 6 [76] Liang Wang, Mario Almeida, Jeremy Blackburn, and Jon Crowcroft. 2016. C3PO:
(2004), 369–395. Computation Congestion Control (PrOactive). In Proceedings of the 3rd ACM
[53] Szymon Migacz. 2017. 8-bit Inference with TensorRT. In GPU Technology Confer-
ence.
Conference on Information-Centric Networking (ACM-ICN ’16). 231–236. for Computational Linguistics, 2246–2251.
[77] Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart, and Michael [82] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. 2015. Empirical Evaluation of
Swift. 2018. Peeking Behind the Curtains of Serverless Platforms. In 2018 USENIX Rectified Activations in Convolutional Network. In CoRR.
Annual Technical Conference (USENIX ATC 18). 133–146. [83] Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and
[78] S. Wang, A. Pathania, and T. Mitra. 2020. Neural Network Inference on Mobile Kaisheng Ma. 2019. Be Your Own Teacher: Improve the Performance of Convolu-
SoCs. IEEE Design Test (2020). tional Neural Networks via Self Distillation. In IEEE International Conference on
[79] Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Computer Vision (ICCV).
Hu, Yun Liang, and Jason Cong. 2017. Automated Systolic Array Architecture [84] Linfeng Zhang, Zhanhong Tan, Jiebo Song, Jingwei Chen, Chenglong Bao, and
Synthesis for High Throughput CNN Inference on FPGAs. In Proceedings of the Kaisheng Ma. 2019. SCAN: A Scalable Neural Networks Framework Towards
54th Annual Design Automation Conference (DAC). 29:1–29:6. Compact and Efficient Models. In Advances in Neural Information Processing
[80] C. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood, Systems (NeurIPS).
E. Isaac, Y. Jia, B. Jia, T. Leyvand, H. Lu, Y. Lu, L. Qiao, B. Reagen, J. Spisak, F. [85] Zhuoran Zhao, Kamyar Mirzazad Barijough, and Andreas Gerstlauer. 2018.
Sun, A. Tulloch, P. Vajda, X. Wang, Y. Wang, B. Wasti, Y. Wu, R. Xian, S. Yoo, DeepThings: Distributed Adaptive Deep Learning Inference on Resource-
and P. Zhang. 2019. Machine Learning at Facebook: Understanding Inference at Constrained IoT Edge Clusters. IEEE Transactions on Computer-Aided Design of
the Edge. In 2019 IEEE International Symposium on High Performance Computer Integrated Circuits and Systems (TCAD) 37 (2018), 2348–2359.
Architecture (HPCA). 331–344. [86] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. 2017. Incremen-
[81] Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. 2020. DeeBERT: tal Network Quantization: Towards Lossless CNNs with Low-Precision Weights.
Dynamic Early Exiting for Accelerating BERT Inference. In Proceedings of the 58th In International Conference on Learning Representations (ICLR).
Annual Meeting of the Association for Computational Linguistics (ACL). Association

You might also like