DEEP LEARNING FRAMEWORKS EVALUATION
FOR IMAGE CLASSIFICATION ON RESOURCE
CONSTRAINED DEVICE
Mathieu Febvay and Ahmed Bounekkar
Université de Lyon, Lyon 2, ERIC UR 3083, F69676 Bron Cedex, France
ABSTRACT
Each new generation of smartphone gains capabilities that increase performance and power
efficiency allowing us to use them for increasingly complex calculations such as Deep Learning.
This paper implemented four Android deep learning inference frameworks (TFLite, MNN,
NCNN and PyTorch) to evaluate the most recent generation of System On a Chip (SoC)
Samsung Exynos 2100, Qualcomm Snapdragon 865+ and 865. Our work focused on image
classification task using five state-of-the-art models. The 50 000 images of the ImageNet 2012
validation subset were inferred. Latency and accuracy with various scenarios like CPU,
OpenCL, Vulkan with and without multi-threading were measured. Power efficiency and realworld use-case were evaluated from these results as we run the same experiment on the device's
camera stream until they consumed 3% of their battery. Our results show that low-level
software optimizations, image pre-processing algorithms, conversion process and cooling
design have an impact on latency, accuracy and energy efficiency.
KEYWORDS
Deep Learning, On-device inference, Image classification, Mobile, Quantized Models.
1. INTRODUCTION
Nowadays, mobile devices are in every human hand, replacing slowly but surely our way of life.
Many mobile applications use artificial intelligence in diverse ways such as gaming, social
media, artistic filters or augmented reality using different tasks like face detection, real-time
image classification or object detection. Unfortunately, many artificial intelligence models run in
the cloud due to the computational resources needed to execute their complexity with millions of
parameters. Today, more than ever, data privacy represents a major concern for people. Ondevice inference is an alternative, protecting data, fixing loss of internet connectivity, and
reducing computing costs. However, computing power on these devices is clearly insufficient to
run effectively and submitted to energy limitations.
Recent improvements made on hardware like Neural and Tensor Processing Unit (NPU/TPU),
Digital Signal Processor (DSP), and other accelerators [1] let Machine Learning and Deep
Learning on-device execution possible [2, 3]. Several mobile deep learning frameworks have
been developed by open-source community or industry leader with low-level software
optimization like General Matrix Multiplication (GeMM), GPU libraries (e.g. OpenCL™,
Vulkan® and OpenGL® ES) and most recently general hardware accelerators API like NNAPI
letting on-device inference become a new opportunity [4].
David C. Wyld et al. (Eds): EMSA, SEA, AIFU, NLCAI, BDML, BIoT, NCOM, CLOUD, CCSEA, SIPRO - 2022
pp. 27-40, 2022. CS & IT - CSCP 2022
DOI: 10.5121/csit.2022.120603
28
Computer Science & Information Technology (CS & IT)
But these features are implemented differently in frameworks and combination of both model,
framework, hardware and device make performance assessment difficult.
Two smartphones and one tablet, based on the two most popular architecture, Qualcomm
Snapdragon and Samsung Exynos, were chosen. Android devices were selected because of its
easier framework deployment process compared to Apple iPhone. We used four different
frameworks with different low-level software optimization techniques such as integration of Arm
assembly language code portion, integration of GeMM libraries Eigen, OpenBLAS or custom,
NPU support and different software graphic libraries (OpenCL, OpenGL, Vulkan). Our models
are pre-trained on ImageNet dataset with both Tensorflow and PyTorch allowing us to easily
convert them to our two other frameworks.
Our approach is to evaluate frameworks and models designed and developed for mobile devices
with the objective of providing the community our inference latency, Top-1 and Top-5 accuracy
and power efficiency results of different models allowing scientists to take the proper decisions
and save time when choosing software libraries and hardware in order to run image classification,
object detection, instance segmentation on resource constrained devices based on Arm Cortex A
architecture. Our work differs from other as we developed an Android Java application for each
framework where inference took place.
2. RELATED WORK
Bahrampour et al. [5] evaluated Deep Learning frameworks performance but they focused their
work on desktop computer with a Titan X GPU.
Lu et al. [6] launched their benchmark on different mobile frameworks with a Nvidia TK1 and
TX1 which are not smartphones or tablet used by customers.
Sehgal and Kehtarnavaz [7] offered a benchmark of multiple deep learning models inferring on
mobile SoC but they tested TFLite and Core ML only.
MLPerf [8] and AI Benchmark [9, 10] provide an Android application to test various models on
the device using different scenarios. Limitations are the inference engine which is based on
TFLite only and the output result, approximated (MLPerf) or displayed as a weighted score (AI
Benchmark).
Bianco et al. [11] and Almeida et al. [12] proposed the most related works. They evaluated
multiple models on diverse architecture among which there are mobile SoCs. The main
difference is they didn't run their test from an Android application.
Benchmarks and previous work to evaluate the performance of deep learning models or
frameworks on different devices exist but we propose an alternative approach as we focused our
test on mobile devices, either smartphone or tablet, with frameworks and models optimized for
them.
3. ALGORITHMIC APPROACH
For our experiment, we chose two smartphones which had a SoC generation gap and one tablet
with a boosted SoC. Four frameworks were implemented on which we executed seven models,
five 32-bit floating point and two quantized (8-bit integer) used as image segmentation
backbones. To simulate the most representative use cases for real-time image segmentation tasks,
Computer Science & Information Technology (CS & IT)
29
we needed a dataset with enough images. Our choice was to use the ImageNet 2012 validation
dataset containing 50,000 images. We kept the results from this first benchmark to evaluate the
device power consumption and the image inference latency from the device's camera.
3.1. Devices
We selected the latest Samsung Galaxy Tab S7 containing a Qualcomm Snapdragon 865+ SoC,
the OnePlus 8 with a Qualcomm Snapdragon 865 and the newest Samsung Galaxy S21 with a
Samsung Exynos 2100. The two Snapdragon are on the same architecture to explore if the extra
260 MHz on one big core and the 87 MHz boost on the GPU provided by the 865+ produce a
significant impact on the latency. Recent release of the Exynos 2100 represents a generation gap
with the Snapdragon 865. It's based on the new Arm Cortex X1 which, giving to Arm, is 30%
faster and have twice the ML performance over the Cortex A77 [13]. The three devices have
Android 11 operating system. Table 1 shows their specifications in-depth. Our experiment was
launched on all hardware available on each device which was CPU, GPU and NPU/DSP with
different hyper-threading scenarios. When we run on GPU, we inferred with OpenCL, OpenGL
or Vulkan graphic libraries. Manufacturers consider CPU, GPU and NPU/DSP, as a whole,
named the AI engine which can only run quantized models with specific software frameworks.
Table 1. Device SoC’s specifications with quantity of RAM, type of cluster with number of cores in it,
Arm reference and core frequencies
865
8
865+
RAM (Gb)
6
2100
SoC
8
Cluster
LITLLE
big
big
LITLLE
big
big
LITLLE
big
big
Number
4
3
1
4
3
1
4
3
1
Ref
A55
A77
A77
A55
A77
A77
A55
A78
X1
Freq (GHz)
1.80
2.42
2.84
1.80
2.42
3.10
2.20
2.80
2.90
3.2. Frameworks
We tested four open-source frameworks, TensorFlow Lite 2.4.0, MNN 1.1.0, NCNN 20201218
and PyTorch mobile 1.7.
They all had Arm NEON optimizations and OpenMP library integrated in their source code.
TFLite [14] is, at the time of this paper, the only framework to have a general hardware
accelerator library, NNAPI, which allow inference on the AI engine. MNN and NCNN use a
custom GeMM implementation whereas PyTorch does not have a GPU and NPU inference
option yet.
We selected these frameworks due to their mobile context. All of them are compatible with
Android and iOS devices.
3.3. Models and Dataset
The inference was launched on ImageNet 2012 [15] pre-trained models commonly used as image
segmentation backbone.
30
Computer Science & Information Technology (CS & IT)
The main difficulty was to find different models available on both PyTorch and Tensorflow but
we manage to download five 32-bits floating point models: SqueezeNet v1.1 (sqn11) [16],
MobileNet v2 (mob2) [17], Inception v3 (inc3) [18], ResNet50 v1 (res50), ResNet101 v1
(res101) [19] and two TFLite quantized models: MobileNet v2 (mob2q) and Inception v3 (inc3q)
to run on the AI engine.
Table 2 shows the Top-1 and Top-5 accuracy provided by Tensorflow and PyTorch Hub [20, 21,
22, 23].
Table 2. PyTorch and TensorFlow Top-1 and Top-5 model accuracies provided by the sources. Best
accuracy for each model is in bold text
Tensorflow
PyTorch
Framework
Model
Top-1 (%)
Top-5 (%)
SqueezeNet v1.1
58.19
80.62
MobileNet v2
71.88
90.29
Inception v3
77.45
93.56
ResNet50 v1
76.15
92.87
ResNet101 v1
77.37
93.56
SqueezeNet v1.1
49.00
72.90
MobileNet v2
71.90
91.00
MobileNet v2 (quant)
70.80
89.9
Inception v3
78.00
93.90
Inception v3 (quant)
77.5
93.70
ResNet50 v1
75.20
92.20
ResNet101 v1
76.40
92.90
3.4. Model conversion process
The frameworks implemented for our experiment can't use the downloaded models, they need to
be converted. TFLite and PyTorch mobile models were the easiest to switch because of the tools
provided by their parent training framework but MNN and NCNN don't support all of the
PyTorch and TensorFlow operations.
To be compatible, PyTorch models had to be converted in ONNX format. We run different
converters to make them compatible with MNN and NCNN.
For Tensorflow models, the MNN and NCNN tools were unable to convert ResNet v1 and
Inception v3 architecture.
Computer Science & Information Technology (CS & IT)
31
3.5. Image pre-processing
During the training phase of our models, each image was transformed to fit in the input tensor.
We had to reproduce the pre-processing steps to reproduce the best accuracy.
TensorFlow crops or pads the image to the littlest size followed by a scale down then it
normalizes each image color channel, Red, Blue, Green, with mean and standard deviation equal
to 127.5 for floating point models and mean to 0.0 and standard deviation to 1.0 for quantized
models.
It is quite the opposite for PyTorch as it resizes the image before cropping or padding it. Its
normalization parameters are respectively for red, blue and green channels and for mean: 0.485,
0.456, 0.406 and standard deviation: 0.229, 0.224, 0.225.
3.6. Algorithm
For each framework, we developed a Java Android application which looped on all the converted
models and inferred each of the ImageNet 50,000 images for any hardware available (CPU,
OpenCL, OpenGL, Vulkan or NNAPI) from one to ten threads.
At each inference the time elapsed by the device to output the probabilities was gathered. We
compared the result to the image key contained in the ground truth file provided with the dataset
to know if the highest probability and the five best were in it. Latency and accuracy are saved in a
CSV file in the internal memory. When the test was launched, the device was plugged to the
power source in plane mode and screen luminosity was at its minimum level. The energy
consumption was not measure in this algorithm.
From the results collected in the previous algorithm, the same experiment parameters were
executed from a camera stream acquired on the device. The energy efficiency of all the
components as well as the image pre-processing time were evaluated. In addition, the screen and
the camera power consumption were collected separately to isolate the hardware used during the
inference.
Algorithm 1. Experiment algorithm
image = 1,…,50000
Input
latency = inference latency of the image
isInTop1 = ground truth compared to the best probability
isInTop5 = ground truth compared to the five best probabilities
Parameters hardware = CPU,…,NNAPI
thread = 1,…,10
model = sqn11,…,inc3q
Output
for hardware = CPU to NNAPI do
for thread = 1 to 10 do
for model = sqn11 to inc3q do
for image = 1 to 50000 do
preProcessedImage preProcessImage(image);
startTime getSystemTime();
32
Computer Science & Information Technology (CS & IT)
probs infer(preProcessedImage);
stopTime getSystemTime();
latency (stopTime – startTime);
descendantOrderSort(probs)
isInTop1 false;
isInTop5 false;
if ground truth == probs[0] then
isInTop1 true;
isInTop5 true;
end
else if ground truth in probs[1:4] then
isInTop5 true;
end
appendToCSV(latency, isInTop1, isInTop5);
end
end
end
end
4. EXPERIMENTAL RESULTS
For our experiment, we chose two smartphones which had a SoC generation gap and one tablet
with a boosted SoC. Four frameworks were implemented on which we executed seven models,
five 32-bit floating point and two quantized (8-bit integer) used as image segmentation
backbones.
4.1. ImageNet dataset latency
We ran Algorithm 1 on two smartphones and one tablet to get the closest real-world use case
results. Our algorithm was looping on 50,000 images which could come closest to a video feed
from the device camera to simulate an image segmentation backbone in real-time. An acceptable
latency for this task is under 30 ms letting display around 30 frames per second while providing
room for image pre-processing and decoding functions. One of the intrinsic limitations of our
devices was the thermal protection mechanism also known as Dynamic Voltage Frequency
Scaling (DVFS) or CPU throttling. The system downscales the CPU frequency to dissipate the
heat. Figure 1 shows two different DVFS behaviours when we ran NCNN on the Snapdragon 865
with one CPU thread. DVFS effect of Inception v3 pre-trained with PyTorch (1a) is not obvious,
resulting in a stable inference with a narrow range around 4 ms (1b). On the contrary, ResNet 50
v1 pre-trained with the TensorFlow framework (1c) shows two inference levels, 165 ms and 280
ms (1d). From the 30,000th image, the SoC is so hot it stands longer at 280 ms. DVFS is less
Computer Science & Information Technology (CS & IT)
33
present on the Snapdragon 865+ because it is an 11-inch tablet which contains more space to
dissipate the heat, unlike the two other devices as shown on Figure 2.
(a) Raw inference latency (in ms) of Inception v3 (PyTorch)
without DVFS effect
(b) Kernel density estimation of Inception
v3 (PyTorch) without DVFS effect
(c) Raw inference latency (in ms) of ResNet50 (Tensorflow)
v1 with DVFS effect
(d) Kernel density estimation of ResNet50
(Tensorflow) v1 with DVFS effect
Figure 1. Inference without (a)(b) and with (c)(d) DVFS on Snapdragon 865 CPU with 1 thread
(a)
(b)
(c)
Figure 2. SoC’s boards from Samsung Galaxy Tab S7 (a), Samsung Galaxy S21 5G (b) and OnePlus 8 (c)
(not to scale)
34
Computer Science & Information Technology (CS & IT)
Figure 3a shows that the multi-threading mechanism didn't affect the GPU. Switching from 1 to
10 threads didn’t affect the latency.
Figure 3b shows the AI engine, which uses CPU, GPU and NPU.
We saw that MobileNet v2 and Inception v3 latencies were improved when switching from the
GPU with floating point format to the AI engine with quantized one. Quantized version of
Inception v3 on the Exynos 2100 is improved when running from 1 to 4 threads. The NNAPI
library uses the best hardware in order to improve the latency. In our case, the library used the
NPU, and the GPU excepted for Inception v3 model on the Exynos 2100.
(a)
(b)
Figure 3. Influence of multi-threading on GPU (a) and AI engine (b)
Table 3 represents the arithmetic mean (µ) and the standard deviation (σ) of the inference latency
in milliseconds with the accuracy loss compared to their reference model in Table 2. For each
row we reported the best results of our experiment.
TensorFlow model latencies are the best with TFLite OpenCL for floating point models. We
greyed SqueezeNet v1.1, ResNet50 v1 and ResNet101 v1 Tensorflow models in our table due to
conversion issue reported in Section 3.4. The new Exynos 2100 provides an improvement in
comparison of the Snapdragon 865 especially for PyTorch models inferred with NCNN Vulkan.
NCNN has a non-negligible accuracy drop on different models but on SqueezeNet v1.1 it is 18 %
faster than the second best framework, MNN, with 13.91 ± 0.16 ms on Snapdragon 865 and
13.36 ± 0.43 ms on Snapdragon 865+.
Snapdragon 865+ outperforms the most recent generation due to its better CPU and GPU
frequency. This increase in frequency should represent a problem due to the throttling mechanism
however the device demonstrates an excellent capability to dissipate the heat making extra
computational power efficient.
The inconclusive results of the TFLite NNAPI on the 2100 should be related to the driver
compatibility of the Samsung NPU which was probably unimplemented yet.
Computer Science & Information Technology (CS & IT)
35
Model
Framework
Hardware
µ ± σ (ms)
Top-1 (%)
Top-5 (%)
Snapdragon 865
sqn11-pt
sqn11-tf
mob2-pt
mob2-tf
mobq2-tf
inc3-pt
inc3-tf
inc3q-tf
res50-pt
res50-tf
res101-pt
res101-tf
NCNN
NCNN
MNN
NCNN
TFLite
MNN
TFLite
TFLite
NCNN
TFLite
NCNN
TFLite
CPU2
CPU3
CPU7
CPU3
NNAPI
CPU8
OpenCL
NNAPI
Vulkan
OpenCL
Vulkan
OpenCL
11.40 ± 3.89
20.04 ± 4.13
15.96 ± 2.32
13.00 ± 2.86
4.44 ± 0.69
158.54 ± 33.94
70.07 ± 1.69
52.67 ± 4.46
86.23 ± 2.87
26.54 ± 13.06
133.17 ± 2.24
25.98 ± 13.03
-13.07
-24.67
-3.17
-3.79
-1.79
-1.40
-0.44
-0.26
-7.70
-47.08
-5.78
-48.28
-10.67
-28.25
-1.49
-3.30
-1.15
-0.68
-0.24
-0.20
-4.22
-42.04
-3.09
-42.74
Snapdragon 865 +
sqn11-pt
sqn11-tf
mob2-pt
mob2-tf
mobq2-tf
inc3-pt
inc3-tf
inc3q-tf
res50-pt
res50-tf
res101-pt
res101-tf
NCNN
TFLite
NCNN
TFLite
TFLite
MNN
TFLite
TFLite
NCNN
TFLite
NCNN
TFLite
CPU3
OpenCL
CPU3
OpenCL
NNAPI
CPU5
OpenCL
NNAPI
Vulkan
OpenCL
Vulkan
OpenCL
10.95 ± 2.24
8.14 ± 0.59
14.54 ± 1.24
5.47 ± 0.70
4.07 ± 0.81
182.09 ± 23.17
46.85 ± 0.60
12.06 ± 0.81
66.29 ± 2.75
8.13 ± 0.58
101.22 ± 2.3
8.17 ± 0.59
-13.07
-20.88
-9.51
-1.70
-1.79
-1.40
-0.44
-0.26
-7.70
-47.08
-5.78
-48.28
-10.67
-22.74
-5.78
-1.63
-1.15
-0.68
-0.24
-0.20
-4.22
-42.04
-3.09
-42.74
Exynos 2100
Table 3. Best mean (μ) with standard deviation (σ) of inference time in milliseconds for each model trained
by TensorFlow (model-tf) and PyTorch (model-pt) of all hardware and framework on the three devices
with their accuracy loss compared to Table 2 Top-1 and Top-5
sqn11-pt
sqn11-tf
mob2-pt
mob2-tf
mobq2-tf
inc3-pt
inc3-tf
inc3q-tf
res50-pt
res50-tf
res101-pt
res101-tf
MNN
TFLite
MNN
TFLite
TFLite
MNN
TFLite
TFLite
NCNN
TFLite
NCNN
TFLite
CPU4
OpenCL
CPU5
OpenCL
NNAPI
CPU4
OpenCL
NNAPI
Vulkan
OpenCL
Vulkan
OpenCL
12.88 ± 0.44
18.02 ± 7.46
14.55 ± 3.43
11.37 ± 6.80
10.77 ± 5.56
194.73 ± 43.03
93.05 ± 31.31
40.92 ± 9.53
81.48 ± 9.72
17.00 ± 6.71
117.25 ± 29.78
17.07 ± 7.01
-4.17
-20.88
-3.17
-1.70
-1.79
-1.40
-0.44
-0.26
-7.70
-47.08
-5.78
-48.28
-3.08
-22.74
-1.49
-1.63
-1.15
-0.68
-0.24
-0.20
-4.22
-42.04
-3.09
-42.74
SoC
36
Computer Science & Information Technology (CS & IT)
4.2. Accuracy
An accuracy loss occurred when the model is converted. For Tensorflow models there was a drop
of 1-2 % on Top-1 and 2-3 % on Top-5 on all frameworks with, from the lowest loss to highest:
TFLite, MNN, NCNN. The same figures appeared for PyTorch ones except for NCNN which had
a 7-13 % drop on Top-1 and a 5-11 % drop on Top-5 depending on models.
In addition, SqueezeNet v1.1, ResNet50 v1, ResNet101 v1 from TensorFlow were not operating
the pre-processing parameters provided on TensorFlow Hub leading to an accuracy cap on both
Top-1 and Top-5 with respectively 28.12 % and 50.16 % for them.
The accuracy loss from the quantized model is negligible regarding the latency gain. MobileNet
v2 lost 1.89 % compared to its floating-point version but it reduced its latency by 5 % on the
2100 SoC, 26 % on the 865+ and 66 % on the 865. This gain is even bigger with Inception v3
model.
4.3. Camera stream latency
The camera stream from the device was integrated inside the Android application using Camera2
API. Images are acquired by the device camera in the YUV420 format and converted into
ARGB8888 to make it compatible with models input. The image's size is 640 pixels width and
480 pixels height. These new outcomes integrate the image pre-processing latency executed by
the framework and show CPU and GPU governors behaviour once the device is powered by its
battery. These results are consistent with the ImageNet ones. There is a performance drop for all
the frameworks as the device has to manage its energy. We observe that TFLite is more affected
than MNN or NCNN. Once again, quantized models outperform the others on the three devices.
It is particularly obvious for Inception v3 as it is approximately 3 times faster than its floatingpoint version on Snapdragon 865, 2.5 times on Snapdragon 865+ and 1.5 times on the Exynos
2100. This experiment confirms the performance of the Snapdragon 865+ related to a reduced
DVFS effect.
4.4. Power efficiency
Before each test, devices were fully charged, screen brightness was set to medium, Bluetooth and
Wi-Fi were turned ON to reproduce as much as possible real usage of the device. The test was
stop once the device's battery reaches 97 % to avoid the nonlinear discharge of the lithium-ion
battery.
We recorded the elapsed time for the device to go from 100 to 97 % with the help of Battery
Historian software from Google [24]. We measured the screen consumption by setting the device
in plane mode and recording the time for the device to reach 97 % when the screen is ON with
medium brightness. Then we measured the camera consumption by doing the same process as for
the screen but with launching the camera application. Then we subtracted the screen consumption
to the observed one to have the camera.
The energy consumption for the Snapdragon 865, 865+ and Exynos 2100 screen are respectively
214 mAh, 619 mAh and 198 mAh. For the cameras, 438 mAh, 413 mAh and 792 mAh. The
Snapdragon 865+ screen is bigger than the two others and the Exynos 2100 has the most
powerful camera module.
Computer Science & Information Technology (CS & IT)
37
Once again, our results show small models and quantized models are the more energy efficient.
The faster it runs the less energy it consumes. Also, device screen and camera have a bigger
impact on energy than the dedicated inference hardware.
Table 4. Latency and energy consumed in μA for processing one image from device camera stream after
consuming 3 % of battery device. Hardware consumption is for the energy consumed by the hardware
involved in the inference (CPU, GPU, NPU, RAM) and Device consumption represents the total of energy
consumed by the device (screen and camera included).
Exyno
s 2100
Snapdragon 865 +
Snapdragon 865
SoC
Model
Framework
Hardware
µ ± σ (ms)
Hardware
consumption
(µA/img)
Device
consumption
(µA/img)
sqn11-pt
NCNN
CPU2
11.40 ± 3.89
2.73
6.55
sqn11-tf
NCNN
CPU3
20.04 ± 4.13
5.69
9.76
mob2-pt
MNN
CPU7
15.96 ± 2.32
0.64
4.53
mob2-tf
NCNN
CPU3
13.00 ± 2.86
3.24
6.49
mobq2-tf
TFLite
NNAPI
4.44 ± 0.69
0.92
3.69
inc3-pt
MNN
CPU8
158.54 ± 33.94
27.40
59.78
inc3-tf
TFLite
OpenCL
70.07 ± 1.69
17.06
34.22
inc3q-tf
TFLite
NNAPI
52.67 ± 4.46
2.47
7.40
res50-pt
NCNN
Vulkan
86.23 ± 2.87
13.11
31.55
res50-tf
TFLite
OpenCL
26.54 ± 13.06
7.14
15.58
res101-pt
NCNN
Vulkan
133.17 ± 2.24
20.06
48.12
res101-tf
TFLite
OpenCL
25.98 ± 13.03
8.50
25.48
sqn11-pt
NCNN
CPU3
10.95 ± 2.24
3.92
7.57
sqn11-tf
TFLite
OpenCL
8.14 ± 0.59
2.97
8.06
mob2-pt
NCNN
CPU3
14.54 ± 1.24
4.85
9.37
mob2-tf
TFLite
OpenCL
5.47 ± 0.70
2.16
6.49
mobq2-tf
TFLite
NNAPI
4.07 ± 0.81
0.94
5.09
inc3-pt
MNN
CPU5
182.09 ± 23.17
51.73
107.44
inc3-tf
TFLite
OpenCL
46.85 ± 0.60
11.44
30.98
inc3q-tf
TFLite
NNAPI
12.06 ± 0.81
2.69
10.35
res50-pt
NCNN
Vulkan
66.29 ± 2.75
18.88
39.22
res50-tf
TFLite
OpenCL
8.13 ± 0.58
8.01
19.73
res101-pt
NCNN
Vulkan
101.22 ± 2.3
34.17
70.97
res101-tf
TFLite
OpenCL
8.17 ± 0.59
14.28
35.13
sqn11-pt
MNN
CPU4
12.88 ± 0.44
1.15
5.39
sqn11-tf
TFLite
OpenCL
18.02 ± 7.46
1.89
13.18
38
Computer Science & Information Technology (CS & IT)
mob2-pt
MNN
CPU5
14.55 ± 3.43
1.67
7.82
mob2-tf
TFLite
OpenCL
11.37 ± 6.80
3.21
14.99
mobq2-tf
TFLite
NNAPI
10.77 ± 5.56
2.96
13.70
inc3-pt
MNN
CPU4
194.73 ± 43.03
10.57
73.67
inc3-tf
TFLite
OpenCL
93.05 ± 31.31
21.73
60.38
inc3q-tf
TFLite
NNAPI
40.92 ± 9.53
4.30
30.48
res50-pt
NCNN
Vulkan
81.48 ± 9.72
14.17
39.63
res50-tf
TFLite
OpenCL
17.00 ± 6.71
7.18
33.19
res101-pt
NCNN
Vulkan
117.25 ± 29.78
15.63
54.00
res101-tf
TFLite
OpenCL
17.07 ± 7.01
15.18
52.90
5. CONCLUSION
In this paper we presented an inference latency benchmark on mobile to help the community
better deployed image classification/segmentation model on Android devices. Our results showed
that quantized models on AI engine should be the de facto standard, especially for complex
models like Inception v3. Quantized models are more energy efficient and performs better than
floating point ones with a tiny loss of accuracy. If there is no other choice than floating points,
developers should go for TFLite. It experiences an easy model conversion and integration process
on Android. For PyTorch models, we saw NCNN is a notable candidate, but it needs to improve
its conversion process to gain more accuracy. We are looking forward to GPU and NPU/DSP's
support in the future PyTorch mobile framework.
MNN and NCNN integration of these frameworks inside Android application is not a
straightforward task. The conversion step is not user-friendly as engineer need to compile or find
the appropriate converter and execute commands to transform the original model to a compatible
and optimized one. Additionally, framework libraries must be compiled and integrated with the
Android NDK which is an error prone process.
To conclude, manufacturers should improve heat dissipation or cooling mechanism on small
devices to avoid the DVFS effect resulting in an improved latency.
REFERENCES
[1]
[2]
[3]
Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy
Kepner. Survey and benchmarking of machine learning accelerators.2019 IEEE High Performance
Extreme Computing Conference (HPEC), pages 1–9, 2019.
Sahar Voghoei, Navid Hashemi Tonekaboni, Jason G Wallace, and Hamid Reza Arabnia. Deep
learning at the edge. 2018 International Conference on Computational Science and Computational
Intelligence (CSCI), pages 895–901, 2018.
Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim M.
Hazelwood, Eldad Isaac, Yangqing Jia, Bill Jia, Tommer Leyvand, Hao Lu, Yang Lu, Lin Qiao,
Brandon Reagen, Joe Spisak, Fei Sun, Andrew Tulloch,Peter Vajda, Xiaodong Wang, Yanghan
Wang, Bram Wasti, Yiming Wu, Ran Xian,Sungjoo Yoo, and Peizhao Zhang. Machine learning at
facebook: Understanding inference at the edge.2019 IEEE International Symposium on High
Performance Computer Architecture (HPCA), pages 331–344, 2019.
Computer Science & Information Technology (CS & IT)
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
39
Mengwei Xu, Jiawei Liu, Yuanqiang Liu, Felix Xiaozhu Lin, Yunxin Liu, and Xuanzhe Liu. A first
look at deep learning apps on smartphones. In WWW ’19, 2019.
Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, and Mohak Shah. Comparative study of
deep learning software frameworks arXiv: Learning, 2016.
Zongqing Lu, Swati Rallapalli, Kevin S. Chan, and Thomas F. La Porta. Modeling the resource
requirements of convolutional neural networks on mobile devices. Proceedings of the 25th ACM
international conference on Multimedia, 2017.
Abhishek Sehgal and Nasser Kehtarnavaz. Guidelines and benchmarks for deployment of deep
learning models on smartphones as real-time apps. Machine Learning and Knowledge Extraction,
1:450–465, 2019.
Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther
Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William
Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick,
J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj
Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius
Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunath Rajan, Dilip Sequeira,
Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu,
Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou.
Mlperf inference benchmark, 2019.
Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool.
Ai benchmark: Running deep neural networks on android smartphones. In ECCV Workshops, 2018.
Andrey Ignatov, Radu Timofte, Andrei Kulik, Seung soo Yang, Ke Wang, Felix Baum, Max Wu,
Lirong Xu, and Luc Van Gool. Ai benchmark: All about deep learning on smartphones in 2019.2019
IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3617–3635,
2019.
Simone Bianco, Remi Cadene, Luigi Celona, and Paolo Napoletano. Benchmark analysis of
representative deep neural network architectures. IEEE Access, 6:64270–64277, 2018.
Mario Almeida, Stefanos Laskaridis, Ilias Leontiadis, Stylianos I. Venieris, and Nicholas D. Lane.
Embench: Quantifying performance variations of deep neural networks across modern commodity
devices. In The 3rd International Workshop on Deep Learning for Mobile Systems and Applications,
EMDL’19, page 1–6, New York, NY, USA, 2019. Association for Computing Machinery. ISBN
9781450367714.doi: 10.1145/3325413.3329793. https://doi.org/10.1145/3325413.3329793.
Arm
Ltd.
Introducing
the
arm
cortex-x
custom
program,
2021.
https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/arm-cortexx-custom-program.
Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,Matthieu
Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kud-lur, Josh Levenberg, Rajat
Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete
Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.Tensorflow: A system for large-scale
machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation
(OSDI 16), pages 265–283, 2016. https://www.usenix.org/system/files/conference/osdi16/osdi16abadi.pdf.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,
Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C.Berg, and Li Fei-Fei. ImageNet
Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV),
115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J.Dally, and Kurt
Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size.
ArXiv, abs/1602.07360, 2016.
Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.
Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and
segmentation. ArXiv, abs/1801.04381, 2018.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking
the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2016.
40
Computer Science & Information Technology (CS & IT)
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2016.
[20] Google. Tensorflow hub, 2020. https://tfhub.dev/.
[21] Google.
TFLite
hosted
models,
2020.
https://github.com/tensorflow/models/tree/master/research/slim.
[22] PyTorch. Pytorch model hub, 2020. https://pytorch.org/hub/.
[23] TensorFlow. TFLite hosted models, 2020. https://www.tensorflow.org/lite/guide/hosted_models.
[24] Google. Battery historian GitHub, 2021. https://github.com/google/battery-historian.
AUTHORS
Mathieu Febvay is currently a PhD candidate at ERIC laboratory at University of Lyon in France (UR 3083).
His research focuses on Lightweight Deep Learning where he investigates performance and feasibility of neural
networks model running on resource constrained devices. He also works as a Software Engineer on mobile
devices. He holds a Master in Computer Science (MIAGE) from University of Lyon (2017) and is graduated in
Software Development from University of Montpellier (2008). He has interest in the field of health and mobile
medical devices.
Ahmed Bounekkar is an Associate Professor at the University of Lyon 1, attached to the ERIC laboratory.
Since 2009, he has been in charge of the Master MIAGE in management informatics. His research focuses on
modelling in complex systems for the design of decision support methodologies. They particularly concern the
development of algorithms for data structuring, machine learning and multi-objective optimisation problems.
The proposed models mainly concern problems in the field of health.
© 2022 By AIRCC Publishing Corporation. This article is published under the Creative Commons
Attribution (CC BY) license.