Advanced Radiation Sensors VLSI Design
Advanced Radiation Sensors VLSI Design
Advanced Radiation Sensors VLSI Design
Sergio Saponara
Alessandro De Gloria Editors
Applications
in Electronics
Pervading Industry,
Environment and
Society
APPLEPIES 2019
Lecture Notes in Electrical Engineering
Volume 627
Series Editors
Leopoldo Angrisani, Department of Electrical and Information Technologies Engineering, University of Napoli
Federico II, Naples, Italy
Marco Arteaga, Departament de Control y Robótica, Universidad Nacional Autónoma de México, Coyoacán,
Mexico
Bijaya Ketan Panigrahi, Electrical Engineering, Indian Institute of Technology Delhi, New Delhi, Delhi, India
Samarjit Chakraborty, Fakultät für Elektrotechnik und Informationstechnik, TU München, Munich, Germany
Jiming Chen, Zhejiang University, Hangzhou, Zhejiang, China
Shanben Chen, Materials Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
Tan Kay Chen, Department of Electrical and Computer Engineering, National University of Singapore,
Singapore, Singapore
Rüdiger Dillmann, Humanoids and Intelligent Systems Laboratory, Karlsruhe Institute for Technology,
Karlsruhe, Germany
Haibin Duan, Beijing University of Aeronautics and Astronautics, Beijing, China
Gianluigi Ferrari, Università di Parma, Parma, Italy
Manuel Ferre, Centre for Automation and Robotics CAR (UPM-CSIC), Universidad Politécnica de Madrid,
Madrid, Spain
Sandra Hirche, Department of Electrical Engineering and Information Science, Technische Universität
München, Munich, Germany
Faryar Jabbari, Department of Mechanical and Aerospace Engineering, University of California, Irvine, CA,
USA
Limin Jia, State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing, China
Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Alaa Khamis, German University in Egypt El Tagamoa El Khames, New Cairo City, Egypt
Torsten Kroeger, Stanford University, Stanford, CA, USA
Qilian Liang, Department of Electrical Engineering, University of Texas at Arlington, Arlington, TX, USA
Ferran Martín, Departament d’Enginyeria Electrònica, Universitat Autònoma de Barcelona, Bellaterra,
Barcelona, Spain
Tan Cher Ming, College of Engineering, Nanyang Technological University, Singapore, Singapore
Wolfgang Minker, Institute of Information Technology, University of Ulm, Ulm, Germany
Pradeep Misra, Department of Electrical Engineering, Wright State University, Dayton, OH, USA
Sebastian Möller, Quality and Usability Laboratory, TU Berlin, Berlin, Germany
Subhas Mukhopadhyay, School of Engineering & Advanced Technology, Massey University,
Palmerston North, Manawatu-Wanganui, New Zealand
Cun-Zheng Ning, Electrical Engineering, Arizona State University, Tempe, AZ, USA
Toyoaki Nishida, Graduate School of Informatics, Kyoto University, Kyoto, Japan
Federica Pascucci, Dipartimento di Ingegneria, Università degli Studi “Roma Tre”, Rome, Italy
Yong Qin, State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing, China
Gan Woon Seng, School of Electrical & Electronic Engineering, Nanyang Technological University,
Singapore, Singapore
Joachim Speidel, Institute of Telecommunications, Universität Stuttgart, Stuttgart, Germany
Germano Veiga, Campus da FEUP, INESC Porto, Porto, Portugal
Haitao Wu, Academy of Opto-electronics, Chinese Academy of Sciences, Beijing, China
Junjie James Zhang, Charlotte, NC, USA
The book series Lecture Notes in Electrical Engineering (LNEE) publishes the
latest developments in Electrical Engineering—quickly, informally and in high
quality. While original research reported in proceedings and monographs has
traditionally formed the core of LNEE, we also encourage authors to submit books
devoted to supporting student education and professional training in the various
fields and applications areas of electrical engineering. The series cover classical and
emerging topics concerning:
• Communication Engineering, Information Theory and Networks
• Electronics Engineering and Microelectronics
• Signal, Image and Speech Processing
• Wireless and Mobile Communication
• Circuits and Systems
• Energy Systems, Power Electronics and Electrical Machines
• Electro-optical Engineering
• Instrumentation Engineering
• Avionics Engineering
• Control Systems
• Internet-of-Things and Cybersecurity
• Biomedical Devices, MEMS and NEMS
For general information about this book series, comments or suggestions, please
contact [email protected].
To submit a proposal or request further information, please contact the
Publishing Editor in your country:
China
Jasmine Dou, Associate Editor ([email protected])
India, Japan, Rest of Asia
Swati Meherishi, Executive Editor ([email protected])
Southeast Asia, Australia, New Zealand
Ramesh Nath Premnath, Editor ([email protected])
USA, Canada:
Michael Luby, Senior Editor ([email protected])
All other Countries:
Leontina Di Cecco, Senior Editor ([email protected])
** Indexing: The books of this series are submitted to ISI Proceedings,
EI-Compendex, SCOPUS, MetaPress, Web of Science and Springerlink **
Editors
Applications in Electronics
Pervading Industry,
Environment and Society
APPLEPIES 2019
123
Editors
Sergio Saponara Alessandro De Gloria
DII DITEN
University of Pisa University of Genoa
Pisa, Italy Genoa, Italy
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
v
vi Preface
vii
viii Contents
Abstract In this paper we discuss some issues related to the design, implementa-
tion and test of a CMOS Active Pixel Sensor. Two different pixel layout have been
proposed based on a standard architecture to investigate the suitability of a 110 nm
standard technology for the realization of small pixels, high granularity detectors to
be used in High-Energy Physics, medical and space applications, such as particle
tracking or beam monitoring.
Keywords Active Pixel Sensor · CMOS · Radiation sensor · High energy physics
applications
1.1 Introduction
The adoption of standard CMOS technology has been suggested as a viable option
for the fabrication of particle detectors, integrating sensitive element and related
read-out circuitry on the same substrate. The inherently lower detection efficiency of
standard CMOS substrates can be compensated by the simultaneous integration of
small capacitance detection nodes and signal conditioning and elaboration of circuitry
[1]. This foster the realization of integrated detectors without the need of hybrid
solutions, e.g. the very expensive bump-bonding between sensing nodes (pixels)
and read-out circuitry or the adoption of dedicated, ad-hoc technology flavours and
options (e.g. high-resistivity substrates, with thick epi-layers or multiple wells) [2,
3]. In this paper we discuss some design, implementation and test issues with respect
to the development of conventional Active Pixel Sensor (APS) matrices in 110 nm
LFoundry technology [4] conceived for CMOS Image Sensor (CIS) fabrication. The
aim of this study is to investigate the suitability of such a technology for the realization
To evaluate the performance of the chosen technology for high energy physics ap-
plications, test structures based on single active pixels and on pixel arrays of limited
dimensions have been designed. The structures are characterized by the use of the
typical three-transistor pixel architecture (Fig. 1.1), with different geometries of the
sensitive area. The designed chip houses also the interface circuits required for read-
ing, addressing and interfacing the sensitive component.
The particle detection principle is based on a photodiode, a reverse biased pn
junction used to detect the impinging radiation by converting in electrical charge the
energy released into the material. In high energy physics the sensor requirements are
typically very harsh, such as high efficiency and good spatial localization. A good
tolerance to radiation damage is offered by modern submicrometric VLSI processes,
guaranteeing the correct functionality of the sensor and a longer operating life.
To collect the maximum amount of charge inside the pixel, the chip substrate or the
epitaxial layer, if available, tends to be used as the p-type region of the photodiode,
whereas the n-type region is usually made by an n-well or an n+ implantation. In this
work we explore the possibility of using a standard CMOS technology, provided that
the layout of the sensitive element has been designed according to the technology
itself for the specific particle to be detected.
The APS involves the use of a basic electronic signal processing inside the pixel,
directly connected to the sensitive element. In this way it is possible to increase
the reading speed and to reduce the noise due to the lower impact of the parasitic
elements. The price to be paid, however, is the reduction of the fill factor (FF) due to
the “blind” area dedicated to electronic circuits. Therefore, during the pixel design,
an effort has been made to limit the area occupancy of the front-end electronics
and, at the same time, increasing the segmentation (pixel pitch) for better spatial
resolution. The reading is the most critical operation because the photodiode has to
be properly biased by connecting the cathode to the power supply through a NMOS
(M1 in Fig. 1.1a) fixing its voltage to (VD D − Vth ). In Fig. 1.1b an additional voltage
drop has been highlighted due to the capacitive coupling between gate and source of
M1, when the transistor is turned off. The useful signal is represented by the voltage
variation measured at the cathode of the photodiode with respect to this voltage and
therefore this configuration limits the useful excursion of the signal. In addition,
it should be avoid that the source follower (M2 in Fig. 1.1a) leaves the saturation
region, otherwise it would further reduce the voltage swing.
The scaling of CMOS technology introduces significant advantages (for exam-
ple in the reduction of area occupancy) but from the point of view of the sensitive
element requires a greater attention in the design, creating new challenges. In fact
the relationship between pixel dimensions and minimum channel length is not s-
traightforward due to the different scaling. Consequently, beyond a certain level of
technological integration, pixel scaling is no longer convenient, as the improvement
in resolution is no longer sufficient to compensate for a bunch of new disadvantages.
Indeed, while the decrease in the supply voltage tends to be proportional to the scal-
ing, the threshold voltages do not decrease following the same trend, reducing the
useful signal swing.
The chip uses two different layout of 3 T pixels, called Small Pixel and Large Pixel
(Fig. 1.2). They differ in the sensitive area dimensions, respectively 0.25 and 56 µm2 ,
while sharing the same square overall occupation, featuring 10 µm pixel pitch. There-
fore, on a total pixel area of 100 µm2 , the FF of the Small Pixel is around 0.25%
while the FF of the Large Pixel is 56%.
The sensing node (photodiode) is made by a n+ doped implantation, hosted in a
deep p-well which is in turn realized on a standard, p-type substrate (Fig. 1.3). The
metal interconnections have been shaped aiming at minimizing the antenna effects,
at the same time aiming at multiple n+ contacts integration. Within the design flow,
several parametric simulations have been carried out, aiming at exploring the different
combinations of both reset and source follower transistors and photodiode node
geometries and their impact on the pixel performance as a function of an external
stimulus compatible with a MIP generation. As a general outcome, the Small Pixel
exhibits better performance for low radiation intensity, as illustrated in the following.
In particular, in Table 1.1 the post-layout voltage drops (ΔV ) on pixel output
are reported, as a function of the sensitive node dimensions. A larger sensitive area
would in principle collects more charge, with an upper limit corresponding to the
Full Well Capacity (FWC). However, a larger area corresponds to a larger (parasitic)
capacitance, thus reducing the charge to voltage conversion factor. Following these
6 T. Croci et al.
indications, we selected the Small Pixel with active area of 0.5 × 0.5 µm2 , while for
the Large Pixel we selected the option with the maximum area coverage (Fig. 1.2).
Considering the Small Pixel, Tables 1.2 and 1.3 show that the voltage drop in post-
layout simulations tends to decrease at increasing transistor width (W) and length
(L). This is due to the contribution of the reset and source follower transistors to the
sensing node capacitance. According to this finding, the dimensions and aspect ratio
of all the transistors within the pixel has been kept at the minimum value according
to the design rules (150/110). Therefore, along the same line, the transistors within
the Large Pixel have been kept at the minimum value according to the design rules as
well, since the increase of their dimensions does not significantly affect the sensing
node capacitance, being dominated by the large diode diffusion capacitance.
Eventually, in Table 1.4 are reported the post-layout voltage drops as a function
of the radiative stimulus parameters, namely amplitude and duration of the resulting
1 Advanced Radiation Sensors VLSI Design in CMOS Technology … 7
Table 1.2 Voltage drops versus width (W) and length (L) of M1 for the Small Pixel
W (nm) ΔV (mV)
150 190.63
300 134.17
450 98.35
L (nm) ΔV (mV)
110 190.63
220 153.75
330 149.35
Table 1.3 Voltage drops versus width (W) and length (L) of M2 for the Small Pixel
W (nm) ΔV (mV)
150 190.63
300 187.18
450 179.73
L (nm) ΔV (mV)
110 190.63
220 170.82
330 163.82
current pulse (used as input for circuit level simulation purposes). Data coming from
device simulations were exploited to characterize a compact model of the sensing
element: a junction diode was supplemented by a current generator describing a
radiation-induced current pulse as predicted by device simulations. The quantitative
effects of the increase of both pulse amplitude and width are reported in Fig. 1.4.
With reference to noise it should be underlined that the pixel-reset noise (Nr eset ) is
determined by the thermal noise of the photodiode and is proportional to the inverse
of the capacitance seen at the photodiode node. Charge-integration noise (Nintegr ) is
instead due to dark current and is approximately proportional to the inverse of the
8 T. Croci et al.
Table 1.4 Voltage drops as a function of the amplitude and duration of the radiative stimulus for
the Large Pixel
Amplitude (A) ΔV (mV)
600 n 10.7
1.2 µ 14.89
1.8 µ 19.11
2.4 µ 23.35
Duration (ns) ΔV (mV)
2 23.35
4 37.36
6 51.21
Fig. 1.4 Voltage drops as a function of the radiative stimulus parameters for the Large Pixel
(amplitude on the left, duration on the right)
square of the capacitance [2]. Total pixel noise (obtained from the root mean square
of reset and charge-integration noises) is expected to be in the order of a few mV.
A suitable test environment has been set up, due to the different features that have
to be validated, ranging from stand-alone photodiode response to the test of small
matrices. This results in a dedicated sequence of test signals to be generated and
delivered to the chip which have been devised using a standard Arduino Due board
based on a 32-bit ARM core microcontroller. A critical issue concerns the radiation
source to be used for testing purposes. To allow for optical test, coverage of sensitive
areas with metal layers has been avoided in the chip design. A dedicated PCB has
1 Advanced Radiation Sensors VLSI Design in CMOS Technology … 9
also been designed, accounting for size constraints coming from the optical setup.
From the functional point of view, maximum flexibility has again been pursued,
accounting for both manual and automatic test procedures. All the control and I/O
signals can be generated either through on-board hardware circuitry, by means of
routines driving the test board from a PC. Test-board assembly has currently been
completed, and actual test is planned to be carried out in the next months.
1.5 Conclusion
This work aimed at the validation of basic performance of sensitive elements inte-
grated in standard 110 nm LFoundry technology conceived for CMOS Image Sensor
fabrication for particle detection application. The suitability of such an approach, in
particular the adoption of a standard CMOS substrate with optimized pixel layout, has
been verified. Results were very encouraging: a significant SNR, expressed in terms
of output voltage drop, has been obtained in post-layout simulation. A dedicated
PCB has also been designed and fabricated and test on actual chip are on-going.
References
2.1 Introduction
Silicon Photonics (SiPh) has become a viable technology for reducing the size, weight
and energy consumption of optical devices for short-reach optical interconnects. All-
silicon modulators are essential components for such communication links and are
currently being evaluated at the Europen Organization for Nuclear Physics (CERN)
in order to asses their suitability for use in high energy physics (HEP) experiments.
Optical and electronic devices installed in the particle detection region have to ensure
high reliability to radiation exposure. Custom-made SiPh Mach-Zehnder modulators
(MZMs) have already been proved to tolerate radiation levels in line with those
expected for future particle physics experiments [1]. In the context of CERN’s large
hadron collider (LHC) upgrade foreseen for 2026, the beam luminosity boosting
will determine a significant increase in data traffic, on the order of dozens of tera-bits
per second (Tb/s). The installation of optical transceivers with few Gb/s read-out
capabilities will then be required [2]. It represents a data transfer speed roughly one
order of magnitude higher than the throughputs currently achievable with state-of-
the-art HEP front-end circuits, like those belonging the RD53 project [3].
Photonic devices easily reach operational bandwidths above 10 GHz, but the ex-
ploitation of these technologies in compact modules would be possible only after a
careful design of the conditioning electronics which allows to encode a data stream
onto an optical carrier. The aim of this work is to design a full-custom electronic in-
tegrated circuit (EIC) to operate with the MZM presented in [4], withstanding, at the
same time, total ionizing doses (TID) up to 1 Grad and 1 MeV equivalent neutron flu-
ences on the order of a few 1016 cm−2 regarding radiation damage from non-ionizing
energy losses (NIEL).
Section 2.2 introduces the MZM driver (MZMD) core structure and the main cir-
cuital solutions which have been implemented to properly drive a traveling-wave
MZM. A purely electrical characterization of the driver performances in terms of
bandwidth, output voltage amplitude and bit error rate (BER) is detailedly reported
in Sect. 2.3. The following section presents the overall system-level results and de-
scribes the electro-optical setup implemented to perform BER measurements of an
hybrid transmitting unit made of an MZM driven by the developed MZMD. Con-
clusions are drawn in Sect. 2.5, mentioning the further activities that are currently
ongoing towards the realization of a working prototype suitable for HEP environ-
ments.
The electrical characterization of the driver was performed in terms of scattering pa-
rameters measurements, eye diagram plots and BER tests. S-parameters were carried
out gluing the chip on a carrier board and contacting the chip pads with RF and DC
probes, as shown in Fig. 2.2.
Figure 2.3 shows the S21 and S11 parameters of the driver. The 3-dB S21 bandwidth
point is measured around 2.5 GHz, highlighting a potential application of the driver
to bit-rates up to 5 Gb/s. The blue line shows that the input matching network of the
driver works properly up to 4 GHz, whereupon the S11 parameter exceeds -10 dB.
Regarding eye diagrams and BER tests, the EIC was bonded on a custom-made
printed circuit board (PCB). Standard SMA coaxial cables were used to connect
the board to the instruments, while impedance-matched coplanar transmission lines
convey the signals on the PCB. A 12.5 Gb/s pulse pattern generator (PPG) has been
used to generate a pseudo random binary sequence (PRBS) following a PRBS-31
pattern, with voltage characteristics in compliance with standard CML levels. The
eye diagrams obtained feeding the driver with this signal and measuring the output
waveforms with a 23 GHz-bandwidth oscilloscope are shown in Fig. 2.2. The two
eye diagrams present nearly the same amplitude, while higher noise and jitter appear
at 5 Gb/s.
A BER tester (BERT) was then exploited to understand the impact of jitter-related
penalties from a system-level viewpoint. Figure 2.3 reports the BER values for dif-
ferent bit-rates. A plateau at 10−11 is shown for data-rates up to 5 Gb/s, indicating
that no error have been registered out of 1 Tb of transmitted data. This confirms a
quasi error-free operation till a bit-rate of 5 Gb/s.
The circuit radiation resistance was investigated exposing the whole EIC to x-
rays with a dose rate of 4.3 Mrad/h at the INFN-Padova facility. The normalized
voltage amplitude degradation of the output signals with increasing dose level is
14
RL RL
+ + + + + + RL RL
+ +
CML +
3 CML CML +
vout
CML stage vout −
vin stages cascode vout −
stage with M3 M4
omitted stage + M1 M2
peaking Vbias
− − vin + M1 M2
− − − − − − − vin
−
M3
Vmirror
M5
Vmirror
Fig. 2.1 Circuit architecture of the MZMD. Qualitative topologies of the last two stages are sketched on the right
G. Ciarpi et al.
2 Design, Operation and BER Test of Multi-Gb/s Radiation-Hard … 15
1 Gb/s 5 Gb/s
Fig. 2.2 Left: eye diagrams of driver output voltage at different bit-rates. Right: picture of the
on-chip characterization setup
−2 120
−3
−4 100
−7
−8 60
−9
−10 40
−11
−12 20
0 1 2 3 4 5 6 7 0 200 400 600 800
Bit-rate [Gb/s] TID [Mrad]
Fig. 2.3 MZMD circuit-level electrical characterization. Input-output S-parameters are reported
on the left, while in the middle BER performances are shown. On the right, radiation-induced
peak-to-peak voltage (Vpp ) degradation is documented
reported in Fig. 2.3. At 800 Mrad, which was the highest dose level reached during
the test because of limited testing time, the signal amplitude was reduced by 30%
with respect to the pre-irradiation value.
coplanar TL:
wire bonding: PRBS Clock
single mode fiber:
coaxial link:
MZMD
PC Ppd
TLS EDFA TBPF MZM VOA Splitter 90 PD BERT
10
PCB
Optical
Power Oscilloscope
Meter
Fig. 2.4 Electro-optical setup for system-level characterization of an OOK link. Acronyms: TBPF
(tunable band-pass filter), PC (polarization controller)
like those used within HEP experiments which cover 200 m at most. Hence, BER
performances have been evaluated in function of bit-rate and received power on the
photo-detector (PD).
As shown in Fig. 2.4, a standard on-off keying (OOK) transmitting system has
been set up. The same PRBS-31 signal is applied at the driver input as before. A
tunable laser source (TLS) was used to provide light in the C band near 1550 nm.
The wavelength tuning allowed to set the MZM at the quadrature point. Light is
coupled to the PIC using pigtailed fiber arrays and on-chip grating couplers. The
modulated optical signal is attenuated with a variable optical attenuator (VOA) and
then captured with a commercial PD, which is directly connected to the BER tester
(BERT). Because of some issues encountered in the packaging procedure, which
was performed manually, the fiber arrays resulted to be a little misaligned, causing
an increase in optical insertion losses compared to similar devices realized in the
same technology. Therefore, an erbium-doped fiber amplifier (EDFA) was required
to perform BER tests. Even delivering the maximum rated optical power from the
TLS the optical intensity at the MZM output was too low that an EDFA placed
downstream the DUT failed to amplify the signal for photo-detection. The EDFA
was then positioned before the MZM in the optical path, resulting in an injected
power in the PIC of about 20 dBm, and an OSNR of 26 dB. Nevertheless, non-linear
optical effects have not been captured throughout the measurement routines.
Optical eye diagrams and measured BERs as a function of input power Ppd on
the photo-detector are shown for different data rates respectively in Figs. 2.5 and
2.6. The whole system is correctly working up to a bit-rate of 1.5 Gb/s while BER
floors start to appear around 1.7 Gb/s, suggesting a systematic failure of the system.
The eye diagrams at the PD output indeed report a sharp increase in jitter and inter-
symbol interference (ISI) as the bit-rate reaches the 1.7 Gb/s level. Even if such poor
speed achievements are in contrast with the previously presented BER performances
of the stand-alone driver, these unexpected results could also be attributed to the
non-optimum arrangement of wire bondings, as can be seen from Fig. 2.6.
2 Design, Operation and BER Test of Multi-Gb/s Radiation-Hard … 17
Fig. 2.5 Eye diagrams at the PD output for different data rates: a 1.5 Gb/s, b 1.7 Gb/s, c 1.75 Gb/s.
All the plots have the same vertical scale of 20 mV/div
−2 Coplanar TLs
−3 500 Mb/s
−4 800 Mb/s
−5 1 Gb/s
log BER
1.5 Gb/s
−6
1.7 Gb/s
−7 1.725 Gb/s
−8 1.750 Gb/s
−9
MZMD MZM
−15 −10 −5
Ppd [dBm]
Fig. 2.6 BER performance of the data transmitting unit composed of the designed driver and a
MZM bonded together. Acronyms: TL (transmission line)
References
3.1 Introduction
the supply voltage is 1.2 V or even lower. For this reason, in the last ten years, use
of nonstandard devices in place of BJT or diodes has been proposed [1], but at the
cost of a poor portability of the design and with the risks associated to the lack of
accurate models for nonstandard devices. The resistive subdivision technique has
been proposed to implement sub-1V BGR circuits [2], although this technique is
not suitable for high-precision references working in a large temperature range. This
paper discusses a BGR architecture based on a commercial 65 nm CMOS technology
and capable of operating with 1.2 V supply. The proposed IP block has been designed
for operation in the harsh radiation environment of the High Luminosity LHC. The
65 nm CMOS technology chosen for this prototype has been tested up to 1 Grad with
promising results for CMOS transistors [3]. Nonetheless, other components of the
BGR, namely bipolar devices, are affected by bulk damage effects. For this reason, in
order to understand their behavior after irradiation, three different BGR versions (the
first one based on parasitic PNP bipolar transistors, the second based on pn diodes
and the third one based on enclosed-layout MOSFETs biased in weak inversion re-
gion) have been designed and submitted for fabrication in a prototype chip. These
circuits have been fabricated and characterized before and after irradiation up to
225 Mrad(SiO2 ) and the third design (the one based on MOSFETs) demonstrated
the best performance in terms of radiation hardness [4]. Based on this work, a voltage
reference circuit, designed in a commercial 65 nm CMOS technology and capable
of operating in harsh radiation environments up to 1 Grad has been developed and
its characterization is shown in this paper.
The bandgap circuit described in this paper and shown in Fig. 3.1, is based on a current
mode approach [1]. Two currents, one (I2b ) proportional to absolute temperature
(PTAT) and one (I2a ) complementary to absolute temperature (CTAT) are generated
and summed in order to obtain a voltage insensitive to temperature. As already
mentioned in the Introduction, with the purpose of increasing the radiation hardness
of the circuit, only MOSFETs devices have been included in the circuit. In order to
obtain a behavior similar to a bipolar transistor, they have been biased in the weak
inversion region, where the I-V characteristic of the device is:
W VG S − Vth VDS
ID = I0 · ex p · 1 − ex p − (3.1)
L ηVT VT
where the V DS dependence of the drain current can be neglected when V DS ≥ 4VT .
Being M1, M2 and M3 equally sized, the BGR output value is given by:
R3 R1
VREF = VG S1 + ΔVG S . (3.2)
R1 R2
3 A Rad-Hard Bandgap Voltage Reference for High Energy … 21
Fig. 3.1 Schematic of the bandgap reference together with the startup circuit
Since bandgap circuit has two stable operating points, it requires a start-up circuit
to prevent operation in the undesired one. Figure 3.1 shows the startup circuit imple-
mented [5]. It is based on a pull down capacitor. During the power on, a current starts
to charge the capacitor C1 , the current is mirrored by M11 and M12 and it charges the
gate of M13 thus turning the transistor on. M13 pulls down the gate of the bandgap
current mirror injecting current into the bandgap. The power consumption of the
startup circuit after power on is zero because, after startup, M14 is turned on and M13
is cutoff. Moreover, M12 discharges C1 when power supply is switched off.
The proposed bandgap reference was fabricated in a commercial 65 nm CMOS
technology. The chip microphotograph is presented in Fig. 3.2 (left). Extensive ex-
Fig. 3.2 (Left) Die microphotograph (2 mm × 1 mm); (right) measured temperature dependence of
the bandgap reference voltage as a function of the temperature for different configuration bits of R2
22 G. Traversi et al.
Fig. 3.3 Measured output voltage as a function of the temperature (left); measured output voltage
of the bandgap as a function of the absorbed dose of 10 keV X-rays and after annealing (right)
3 A Rad-Hard Bandgap Voltage Reference for High Energy … 23
output branch of the circuit, as implemented in [6]. The measured best temperature
coefficient (TC) of the bandgap reference is 16 ppm/◦ C in a range of −40 to 100 ◦ C,
as shown in Fig. 3.3 (left).
Irradiation tests were carried out taking into account the unprecedented radiation
tolerance requirements of demanding applications such as the HL-LHC [7]. To get
an estimate of the performance of the bandgap circuit, we irradiated one device
up to about 1 Grad(SiO2 ) total dose of 10-keV X-rays. The irradiation was done at
Laboratori Nazionali di Legnaro (Italy) with an X-ray machine at a dose rate of about
1 krad(SiO2 )/s. During irradiation, the bandgaps were biased as in the real application.
Figure 3.3 (right) shows the variation of the output voltage as a function of the TID
for the BGR with N-MOSFET. Annealing after one week at room temperature shows
minor changes on the reference voltage with respect to the pre-irradiation value.
3.3 Conclusion
In this paper, a new radiation hard bandgap voltage reference circuit has been pre-
sented. The circuit has been characterized in a climatic chamber between −40 and
+100 ◦ C and irradiated up to 1 Grad(SiO2 ), yielding up to 5% voltage change at
the total ionizing dose. The BGR here proposed is able to face very high radiation
doses, keeping a reasonable output accuracy, a relatively small area, and a simple
architecture.
Acknowledgements The authors wish to thank Serena Mattiazzo and Devis Pantano (University
of Padova) for providing the source for X-ray irradiation and for their constant support during the
irradiation campaign, and Dr. Francesco De Canio for his contribution to the design and character-
ization activity. The authors are also in debt with Massimo Rossella (INFN Pavia) who have kindly
made the climatic chamber available for the bandgap characterization.
References
1. Banba H et al (1999) A CMOS bandgap reference circuit with sub-1-V operation. IEEE J Solid
State Circ 34:670
2. Neuteboom N, Kup BMJ, Janssens J (1997) A DSP-based hearing instrument IC. IEEE J Solid
State Circ 32:1790–1806
3. Menouni M et al (2015) 1-Grad total dose evaluation of 65 nm CMOS technology for the
HL-LHC upgrades. J Instrum 10(5), art. No. C05009
4. Traversi G et al (2016) Characterization of bandgap reference circuits designed for high energy
physics applications. Nucl Instrum Methods A 824:371–373
5. Li W, Yao R, Guo L (2009) A low power CMOS bandgap voltage reference with enhanced
power supply rejection. In: Proceedings of the 8th IEEE international conference on ASIC, pp
300–304
6. Vergine T, De Matteis M, Michelis S, Traversi G, De Canio F, Baschirotto A (2016) A 65 nm
rad-hard bandgap voltage reference for LHC environment. IEEE Trans Nucl Sci 63(3):1762–
1767
24 G. Traversi et al.
Abstract The paper presents the comparison between two VCO (Voltage Controlled
Oscillator) architectures designed in 65 nm CMOS for aerospace applications. In par-
ticular, the two VCOs have been designed targeting the 6.25 GHz frequency required
in the SpaceFibre standard. The ring oscillator has been designed using three current
mode logic stages connected in a loop. Although its performance in terms of low area
occupation are attractive, the process variations simulations have demonstrated its
inability to generate the target frequency in harsh operating conditions. Instead, the
LC-Tank based oscillator, fixing the central frequency with the resonance of the L-C
tank, has highlighted a lower influence through Process-Voltage-Temperature simu-
lations on the oscillation frequency. Thanks to varactor-based voltage tuning control,
it is able to cover the range from 5.18 to 6.41 GHz. Both architectures are biased with
a supply voltage of 1.2 V. The complete layout of the last solution has been designed
and its parasitic has been extracted for post-layout simulations. Achieved results are
attractive to address the requirements of the new SpaceFibre aerospace standard.
4.1 Introduction
Current trends in satellites show a rapid increase in data traffic and digital processing.
The throughput of next generation digital telecom satellites will exceed terabits per
second of data, which have to be processed on board. For instance, the high-resolution
cameras and synthetic aperture radars need high-speed communications between
the instruments and storage [1]. The optical technology, thanks its high bandwidth-
length product, the lightweight cabling and electromagnetic hardness, can potentially
be the solution for data-rate increment in satellite. In this direction, the European
Space Agency (ESA) has recently released the new SpaceFibre standard for on-board
satellite communication up to 6.25 Gbps [2, 3]. The communication performance is
strongly related to the ability to synchronize the receiver and the transmitter and a
key block for the synchronization is the Phase Locked Loop (PLL). The core system
inside the PLL able to generate the suitable frequency is the Voltage Controlled
Oscillator (VCO). It should be able to generate a tone at 6.25 GHz and be tolerant
to SEE (Single Event Effects) and TID (Total Ionization Dose) up to 300 krad [4]
as the whole PLL system. In literature, there are not examples of rad-hard VCOs
able to work at 6.25 GHz. In [5] a comparison between Ring Oscillator (RO) and
LC-Tank (LC) VCO for PLL were made for Large Hadron Collider’s (LHC) for High
Energy Physics (HEP) applications. Both were designed for a working frequency of
2.56 GHz and, after being exposed to irradiation, the LC oscillator showed a lower
frequency shift than that of the RO solution and a jitter value one order of magnitude
lower.
The goal of this work is to compare the performances of the widely used RO
and LC circuits in radiation environments and to contribute with new approaches for
exploiting the characteristics that have made these systems the most implemented. For
a better comparison, both the VCOs were designed using the same 65 nm commercial-
grade technology, which thanks its thin gate oxide is considered a radiation hard
technology [6, 7]. The design of the VCO based on the ring oscillator and that based on
the LC-Tank approach is presented in Sects. 2 and 3, respectively. Section 4 provides
preliminary layout design and post-layout circuit performance results. Conclusions
are drawn in Sect. 5.
For the Barkhausen oscillation criterion [8], the module of the transfer function
has to be higher than one for the start-up condition and then equal to one to sustain
the oscillation, while the transfer function phase has to be an integer multiple of 2π.
Appling this criterion at the model in Fig. 4.1, we obtain the oscillation condition in
term of design parameters, expressed in Eq. 4.2.
1
gm R ≥ (4.2)
cos θ
where θ is the phase shift introduced by each RC load, which for the Barkhausen
criterion has to be an integer multiple of π /N.
In order to limiting the frequency variation due to process technology and to
reduce area and power consumption, a number of three stages was chosen for the
RO-VCO design. With this choice, in according with Eq. 4.2, the following condition
(Eq. 4.3) is extracted as the main design guideline.
gm R ≥ 2 (4.3)
The designed RO-VCO is composed by three CML (Current Mode Logic) stages,
which thanks to their lower voltage swing and lower output impedance allow to
reach higher frequency performance than the use of the standard CMOS approach
[9]. Moreover, the use of a differential structure allows to obtain higher common
mode disturb immunity than the use of a single ended structure, as CMOS circuits.
The single CML stage, shown in Fig. 4.2, is made by a differential pair amplifier
with a resistive load.
The oscillation frequency of the RO-VCO is expressed by the relation f 0 =
1/(2π RC). Where R is the parallel between the pull-up CML resistive load and the
output MOSFET resistance, while C is the gate capacitance of the following stage.
In order to make a control of the oscillation frequency a couple of varactors were
added at the output of each stage. Accumulation n-MOSFETs devices were used to
design varactors and increasing or decreasing their gate voltage, their capacitances
change shifting the oscillation frequency.
The small length size n-MOSFETs allows to achieve high frequency performance,
but on the other hand, this choice increase the deviation of the device’s parameters
from the typical condition. Although the use of varactors for frequency tuning, the
frequency shift during the process corner simulations was so high that cannot be
compensated using the control voltage.
Table 4.1 lists the oscillation frequency and the tuning range values of the RO-
VCO for the three corners process. The frequency values reported are extracted by
schematic simulations performed with the minimum and the maximum values of
the varactor tuning voltages. The oscillation frequency in the slow-slow corner case
does not reach the 6.25 GHz frequency value required by the SpaceFibre standard,
28 D. Monda et al.
Fig. 4.2 Schematic of the single stage of the ring oscillator and the couple of varactors connected
at the outputs
even using the maximum value of the control voltage. In the fast-fast corner case, the
frequency is higher than the targeted frequency even with the minimum value of the
control voltage. RO-VCO is strongly dependent on the device parameters making it
not usable for this application.
In order to overcome the effects of the device parameters deviation on the oscillation
frequency, a LC-Tank VCO architecture was designed to be compliance with the
SpaceFibre protocol. This architecture bases its oscillation frequency on the filtering
4 Analysis and Comparison of Ring and LC-Tank Oscillators … 29
effect of a L-C tank, leaving to active components only the role of setting the feedback
gain [10] and compensate the loss of the inductor. Figure 4.3 shows the schematic
of the LC-VCO designed to generate the target 6.25 GHz frequency.
A poly-silicon resistor is used to shift the output common mode level at VDD/2,
preventing the damaging or lifetime reduction of the low-voltage MOSFETs used for
the cross-coupled pair. This resistor is connected to the center tap of a symmetrical
inductor chosen for its lower layout area than that of two separate inductors. In order
to achieve the best frequency performance of this technology, the cross-coupled pair is
sized using minimum length mosfets and a mosfet width of 3.6 μm to guarantee a cell
gain of at least 6 dB for start-up condition. The design guideline to respect Barkhausen
oscillation criterion should be gm > 1/R p , where gm is the transconductance of the
n-MOSFETs inside the cross-coupled cell and Rp is the parasitic resistance √of the
inductor [11]. The oscillation frequency of the LC-VCO is set by f 0 = 1/ 2π LC
making possible to tune the central frequency with the use of two varactors connected
at the LC output and using a control voltage in the range 0 V–V DD .
In Fig. 4.4 is shown the frequency response of the VCO for the two extreme values
of the control voltage, highlighting a tuning range of 1.23 GHz. Moreover, Fig. 4.4
Fig. 4.4 Frequency response of the LC-VCO for control voltage equal to 0 V (red line) and for
1.2 V (yellow line); dot lines represent the phase for minimum and maximum value of the control
voltage, respectively
shows a minimum cell gain of about 10 dB, for the minimum value of the control
voltage, allowing to achieve a robust start-up condition for the oscillator.
Corner simulations were performed by changing the production process, temper-
ature and supply voltage. The SpaceFibre standard requires to the system to properly
work under harsh condition. In particular, the system was tested for temperature vari-
ations in the range −55 to 125 °C, fast-slow-typical process corners and for ±10%
supply voltage and polarization current deviations.
The layout for the VCO is shown in Fig. 4.5 where about the 85% of the total area is
occupied by the inductor. For the design of this layout, all choices were made in order
to reduce the parasitic resistance and to guarantee a good matching of simple current
4 Analysis and Comparison of Ring and LC-Tank Oscillators … 31
Fig. 4.5 Layout of the LC-VCO. From left to right there are the poly resistance, the inductor, the
differential pair, varactors and the current tail mirror, respectively
mirror and cross coupled cell. A high parasitic resistance leads to a gain degradation
and a weak start-up condition. For the simple current mirror, the two mosfets used to
implement the diode MOSFETs were placed in the center of the other ten MOSFETs.
The space between the devices is the minimum allowed by technology and the
Design Rule Check (DRC), helping to minimize the devices mismatch.
Post layout simulations show a tuning range of the LC-VCO from 5.18 to 6.41 GHz
in the worst condition, highlighting the capability of this VCO to be used in the
SpaceFibre communication protocol.
electrically tested in standard condition and will be exposed to X-rays to achieve the
300 krad TID. and to heavy ions for SEE characterization.
References
1. Xie L, Wei L (2013) Research on vehicle detection in high resolution satellite images. In: IEEE
fourth global congress on intelligent systems
2. ESA Requirements and Standards Division ESTEC, P.O. Box 299, 2200 AG Noordwijk The
Netherlands. Space engineering, SpaceFibre—very high-speed serial link. European Space
Agency for the members of ECSS, 2019
3. Parkers S, Ferrer A et al (2017) SpaceFibre specification draft K1. Copyright 2017, University
of Dundee
4. Ciarpi G, Magazzù G et al (2018) Design of radiation-Hard MZM drivers. In: 20th Italian
national conference on photonic technologies (Fotonica 2018), vol 26, pp 1–4
5. Prinzie J, Christiansen J et al (2017) Comparison of a 65 nm CMOS ring- and LC-oscillator
based PLL in terms of TID and SEU sensitivity. IEEE Trans Nucl Sci 64(1):245–252
6. Ciarpi G, Saponara S et al (2019) Radiation hardness by design techniques for 1 grad TID rad-
hard system in 65 nm standard CMOS technologies. In: Application in electronics pervading
industry, environment and society, pp 269–276
7. Palla F, Ciarpi G et al (2019) Design of a high radiation-hard driver for Mach-Zehnder Mod-
ulators based high-speed links for hadron collider applications. Nucl Instrum Methods Phys
Res Sect A 936:303–304
8. Voinigescu S (2013) High-frequency integrated circuits. Cambridge University Press
9. Heydari P (2003) Design and analysis of low-voltage current-mode logic buffers. In: Fourth
international symposium on quality electronic design. IEEE
10. Razavi B (1996) A study of phase noise in CMOS oscillators. IEEE J Solid State Circ 31(3):331–
343
11. Razavi B (1998) RF microelectronics, vol 1. Prentice Hall, Upper Saddle River, NJ
Chapter 5
A Compact Gated Integrator
for Conditioning Pulsed Analog Signals
Abstract An extremely compact gated integrator prototype has been realized and
preliminarily characterized. Front-end section of the circuit is based on the high
precision integrator IVC102, whereas the analog to digital conversion and data-
acquisition, as well as the timing control, are performed by an LCP845 microcon-
troller. The system synchronizes signal detection with an external trigger generated
in coincidence with the source pulse, i.e. the gated integrator amplifies the signal
only when a pulse is generated, increasing significantly the signal-to-noise ratio.
As a consequence, the proposed circuitry would represent an affordable, sensitive,
and cost-effective alternative to the continuous-time regime measurement-technique
largely adopted, for example, in radiation dosimetry.
5.1 Introduction
Figure 5.1 shows the schematic of the proposed gated-integrator circuitry. The
front-end section is based on the commercially available switched integrator tran-
simpedance amplifier IVC102 (by Texas Instruments) and an inverting amplifier stage
5 A Compact Gated Integrator for Conditioning Pulsed … 35
useful to establish, at reset (V O = 0 V), an ADC input voltage around 0.7 V. The
read-out section is based on the microcontroller LPC845 (by NXP) equipped with
an ARM-Cortex M0+ processor. IVC102 chip integrates high quality metal/oxide
capacitors characterized by low leakage, excellent dielectric characteristics (typi-
cal non-linearity of ±0.005%) and temperature stability (±25 ppm/°C) [15]. The
IVC102 output voltage, which is proportional to the integrated input charge pro-
vided by the detector, is digitally converted by the 12-bit successive approximation
A/D converter embedded in the LPC845 microcontroller.
The measurement cycle starts by resetting the integrator output at 0 V (closing
the internal switch S2) and integration begins when S2 is open and the charge is
transferred to the integration capacitor closing the S1 switch. A dual power supply
voltage of V CC = ±15 V was used for the IVC102, whereas microcontroller unit,
hence its internal ADC also, is supplied at V DD = 3.3 V. The timing control circuitry
of the system uses the State Configurable Timer (SCT) integrated in the LPC845
microcontroller, which is used to generate the timing signals for IVC102 S1 and S2
MOS switches synchronized to the external sync signal. Figure 5.2 shows an example
of S1 and S2 control signals generated by the realized prototype, as well as the voltage
at the ADC input obtained by leaving float IVC102 input. The example reported in
Fig. 5.2 highlights the case in which an integration period T INT (S2 open, S1 closed)
is located across the rising edge of the synchronism signal. It is worth to observe that,
to null any error induced by charge transferred at the integrator input during switches
commutations, signal acquisition is performed in two phases, before (pre-hold) and
after (hold) the T INT period. As shown in Fig. 5.2, two opposite voltage step ΔV Q
are found both at the start and at the end of the integration period. Therefore, the net
contribution of offset charge injection becomes insignificant if the integration result
is measured as the voltage difference V B − V A .
The SCT is a tool that can perform advanced timing and control operations with
little or no CPU intervention. It allows comparing the timer-counter value with a
match register content, as well as storing the current timer value in capture registers
when certain conditions/events occur. Moreover, it supports distinct user-defined
SYNC SCT_IN0
HOLD
SCT_OUT1
+HV 10p RST
SCT_OUT2
Fig. 5.1 Schematic of the proposed circuitry based on IVC102 integrator and LPC845 microcon-
troller
36 S. Pettinato et al.
Sync
A integration B
pre-hold hold
Fig. 5.2 Signal acquisition is performed in two phases, “pre-hold” and “hold”, before and after
the integration period, respectively, in order to null errors due to charge transfer during S1 switch
commutations. V ADC continuous (green) and dotted (red) lines refer to absence and presence of
input current, respectively
pre-hold hold
Fig. 5.3 Timing for S1 and S2 IVC102 switches performed by the SCT embedded in the LPC845
microcontroller
R 18M
VP IOUT
33p 10p to
C1 C2 IVC102
Agilent
33220A
Fig. 5.4 ADC output as a function of injected charge packets (see formula) achieved by a pulsed
voltage source (Agilent 33220A) coupled to the RC network reported in the inset. On the right, the
error over the full scale for the investigated input-charge range
is captured and timer restarts. After restart condition, when the timer counter reaches
the Match[1] (Match[2]) value, event EV1 (EV2) is generated. Match[1…4] values
are user defined (in our case 100 and 150 µs, see Fig. 5.2). Events EV1 and EV2
are used for S1 and S2 control signals transitions and determine the hold-phase start
and end times, respectively. During this period, the ADC acquires the V B voltage
amplitude.
Representing a measure of the time period T between two pulses, the cap-
tured timer value on EV0 event allows to calculate the match values Match[3] and
Match[4]: the former, T —150 µs, represents the pre-hold start time; the latter, T —
100 µs, the pre-hold end time (see Fig. 5.2). Obviously, such a pre-hold period will
be used for V A acquisition in correspondence of the next input pulse to calculate
proper V B − V A amplitude (here V B represents the quantity acquired after the next
EV0 event).
A preliminary characterization was performed in the lowest measurement range
using the 10 pF internal capacitor of IVC102 in order to evaluate the circuit capability
to acquire typical charge packets generated by a detector irradiated by a pulsed source.
Data of Fig. 5.4 refer to mean values of N = 512 pulse acquisitions. Pulsed signals
were emulated with an Agilent 33220A function generator, providing voltage pulses
with amplitude in the 100 mV − 10 V range, 50 µs duration, and 500 Hz repetition
rate. The function generator output was coupled to an RC network (see the inset of
Fig. 5.4) to emulate charge packets in the 0.1 − 10 pC range generated by a detector
having an equivalent 10 pF capacitance.
As can be seen from the best fit of experimental data shown in Fig. 5.4, the system
shows excellent performance in terms of linearity in the investigated range of charge
packets. The relative error, calculated with respect the nominal expected values, is
lower than ±0.2%, and less than 0.04% for an input charge around 1 pC. Worth to
mention a sensitivity of about 40 fC, estimated by the peak-to-peak output noise
measured amplitude lower than 4 mV at IVC102 output.
38 S. Pettinato et al.
5.3 Conclusions
Acknowledgements The authors would like to thank Marco Pacilli and Fabrizio Imperiali for
fruitful discussions and technical support.
References
1. Bucciolini M et al (2003) Diamond detector versus silicon diode and ion chamber in photon
beams of different energy and field size. Med Phys 30(8):2149–2154
2. Tromson D et al (2010) Single crystal CVD diamond detector for high resolution dose
measurement for IMRT and novel radiation therapy needs. Diam Relat Mater 19:1012–1016
3. Marsolat F et al (2013) Diamond dosimeter for small beam stereotactic radiotherapy. Diam
Relat Mater 33:63–70
4. Girolami M et al (2012) Diamond detectors for UV and X-ray source imaging. IEEE Electron
Device Lett 33:224–226
5. Girolami M et al (2012) Optimization of X-ray beam profilers based on CVD diamond detectors.
J Instrum 7:C11005
6. Conte G et al (2007) X-ray diamond detectors with energy resolution. Appl Phys Lett 91:183515
7. Salvatori S et al (2017) Nano-carbon pixels array for ionizing particles monitoring. Diamo
Relat Mater 73:132–136
8. Pacilli M et al (2012) Polycrystalline CVD diamond pixel array detector for nuclear particles
monitoring. J Instrum 8:C02043
9. Muraro A et al (2016) First neutron spectroscopy measurements with a pixelated diamond
detector at JET. Rev Sci Instrum 87:11D833
10. D’Antonio E et al (2018) High precision integrator for CVD-diamond detectors for dosi-
metric applications. In: 2018 IEEE international symposium on medical measurements and
applications (MeMeA), pp 1–6
11. See for example https://www.ptwdosimetry.com/en/products/unidos-webline/
5 A Compact Gated Integrator for Conditioning Pulsed … 39
12. Khan FM, Gibbons JP (2014) Khan’s the physics of radiation therapy. Lippincott Williams &
Wilkins, Philadelphia
13. Reichert J, Townsend J (1964) Gated integrator for repetitive signals. Rev Sci Instrum 35:1692–
1697
14. Betts J (1970) Signal processing, modulation and noise. English Universities Press, London
15. Collier JL et al (1996) A low-cost gated integrator boxcar averager. Meas Sci Technol 7:1204
16. See for example ClinacR iX System, Varian Medical Systems, Inc., CA (USA). https://www.
varian.com/oncology/products/treatment-delivery/clinac-ix-system
17. Precision switched integrator transimpedance amplifier, IVC102, datasheet, Texas Instruments.
www.ti.com/lit/ds/symlink/ivc102.pdf
18. Salvatori S et al (2006) Compact front-end electronics for low-level current sensor measure-
ments. Electron Lett 42:682–684
Part II
Internet of Things
Chapter 6
Multivariate Microaggregation with
Fixed Group Size Based on the Travelling
Salesman Problem
Abstract Due to the growing use of IoT and 5G technologies, data are collected at
an unprecedented pace. These data are used to improve decision-making processes.
However, they could endanger individuals privacy, which is protected by interna-
tional regulations. In this article, we propose a privacy-preserving microaggregation
technique, inspired by the Travelling Salesman Problem, to protect individuals priva-
cy through k-anonymity. We recall the basics on microaggregation and the TSP and,
we describe the algorithm behind our approach. Also, we report experiments with
real benchmark data sets showing that our approach outperforms current methods
for low cardinality values.
6.1 Introduction
6.2 Background
Microdata refers to data belonging to individuals and they consist of several attributes
with a diversity of features. Microaggregation is a family of perturbation-based sta-
tistical disclosure control (SDC) methods originally designed to protect continuous
numerical microdata. Formally, microaggregation can be defined as follows:
Consider a microdata set D with p continuous numerical attributes and n records
(i.e., the result of observing p attributes on n individuals). Groups (also gcalled
subsets) of D are formed with n i records in the ith group (n i ≥ k and n = i=1 n i ),
where g is the number of resulting groups, and k a cardinality constraint. Optimal
microaggregation is defined as the one yielding a k-partition maximizing the within-
groups homogeneity. The sum of squares criterion is commonly used for measuring
the homogeneity in each group. In terms of sums of squares, maximising within-
groups homogeneity is equivalent to finding a k-partition minimizing the within-
groups sum of squares (SSE) [8] defined as:
6 Multivariate Microaggregation with Fixed Group Size … 45
g
ni
SS E = (xi, j − xˆi )(xi, j − xˆi ) , (6.1)
i=1 j=1
where xi, j is the jth record in group i, and xˆi is the average record of group i. The
total sum of squares (SST), an upper bound on the partitioning information loss, is
computed as if only a single group exists, as follows:
n
SST = (xi − x̂)(xi − x̂) , (6.2)
i=1
where xi is the ith record in D and xˆi is the average record of D. Note that all the
above equations use vector notation, so xi is a vector belonging to R p .
The microaggregation problem consists in finding a k-partition
g with minimum
SSE, this is, the set of disjoint subsets of D so that D = m=1 sm , where sm is the
mth subset and g is the number of subsets, with minimum SSE (it is ease to see
that the cardinality of the groups in the
Eoptimal
k-partition must lie between k and
2k − 1). A normalised measure L = SS SST
of information loss is typically used (i.e.,
0 ≤ L ≤ 1). Optimal microaggregation is an NP-hard problem [2] for multivariate
data and it requires heuristic approaches, which can be divided in two big families:
– Fixed-size microaggregation: These heuristics yield k-partitions where
all subsets/groups have size k, except perhaps one group which has size between
k and 2k − 1, when the total number of records is not divisible by k.
– Variable-size microaggregation: These heuristics yield k-partitions where all
groups have sizes in (k, 2k − 1). The challenge is how to enforce cardinality
constraints on groups without substantially increasing SSE.
In this section, we briefly recall the TSP by summarizing two of its most important
formulations [3, 5]. First, we describe the TSP as a permutation problem and, next,
we formulate it as a graph theoretic problem.
– Combinatorial optimization formulation: Given a set of cities, the goal is to find
the shortest tour that visits each city exactly once and then returns to the starting
city. Formally, the TSP can be stated as follows: The distances between n cities
are stored in a distance matrix D with elements di, j where i, j = 1, . . . , n and the
diagonal elements di,i are zero. A tour can be represented by a cyclic permutation
π of {1, 2, . . . , n} where π(i) represents the city that follows city i on the tour.
Therefore, the TSP nis reduced to finding a permutation π that minimizes the length
of the tour L = i=1 di,π(i) . Following a brute-force approach, the tour length of
(n − 1)! permutation vectors have to be compared, and it is known to be an NP-
complete problem [5].
46 A. Maya López and A. Solanas
1. For each starting city s, find a Hamiltonian path H path (s) traversing all n points
in the dataset D with the minimum possible length, starting in city s. Let π H path (s)
be the permutation of {1, . . . , n} expressing the order in which the points are
traversed by H path (s).
– At the end of this iteration, we have a set of n Hamiltonian paths, each start-
ing from each city (record) in the data set. This is, we have n permutations
π H path (i) , ∀i ∈ [1, n].
2. From the set π H path (i) , ∀i ∈ [1, n] build a “neighbourhood matrix” (R) so that R
is a squared matrix n × n, whose elements ri j represent the number of times node
i and j have been found (in all permutations) at k − 1 or less edges away from
each other.
– To build R we iterate a sliding window of k elements over each permutation
position and for all permutations. Note that high values of ri j indicate higher
chances for i and j to be clustered together.
3. Given the aforementioned matrix R, generate clusters/groups of cities/records
of size k: The group generation starts by finding the maximum value ri j ∈ R,
and assigning elements i and j to the first group. Next, the maximum value
max(ri, p , rq, j ) ∈ R, ∀ p, q ∈ [1, n]|( p = j, q = i) is found and element p or q,
as appropriate, is added to the group. This procedure is repeated (k − 2) times to
create each group. Groups are created following the same procedure until there
remain no unassigned elements in D. As a result, a k-partition of D is obtained.
4. Finally, to obtain a microaggregated data set D from D, compute the centroid
(i.e., the average vector) of each group in the k-partition and replace each record
xi in D by the centroid x̂ g of the group g to which it belongs.
With the aim to validate our intuition that TSP heuristics could be used to find good
microaggregation solutions, we have compared our approach with two well-known
and good-performing microaggregation algorithms (i.e., Maximum Distance to Av-
erage Vector (MDAV) and, Variable-MDAV [8]) over two real microdata sets that
are frequently used in the literature as benchmarks (i.e., Census and Tarragona [2]).
Census contains 1080 records with 13 numerical attributes and Tarragona has 834
records with 13 numerical attributes.
Our method is a fixed-size microaggregation heuristic. Therefore, to study the
information loss for several group sizes, we have varied k in the range [3, 4, 5, 10]
– which are the typical values used for statistical agencies – , and we compared the
results with those obtained by MDAV and V-MDAV, for the same values of k. The
results are shown in Table 6.1.
48 A. Maya López and A. Solanas
Table 6.1 Information loss obtained by MDAV,V-MDAV and our method (MF-TSP)
Dataset Method k=3 k=4 k=5 k = 10
Census MDAV 5.66 7.51 9.01 14.07
V-MDAV 5.69 7.52 8.98 14.07
MF-TSP 5.30 8.47 10.01 17.01
Tarragona MDAV 16.96 19.70 22.88 33.26
V-MDAV 16.96 19.70 22.88 33.26
MF-TSP 15.45 18.86 24.90 37.19
It can be observed that our approach performs better than MDAV and V-MDAV
for k = 3 in Census and Tarragona and, for k = 4 for Tarragona. In a nutshell, we
have an initial indication that our method could lead to better solutions for small
values of k while it yields to worse results for larger cardinalities.
6.5 Conclusion
The deployment of IoT and 5G technologies opens the door to the collection of large
amounts of data used to obtain information and make better decisions on business,
healthcare [9], transportation, etc. Despite its utility, analysing huge amounts of data
could jeopardise individuals privacy and current regulations mandate companies to
put in place the right measures to guarantee individuals privacy. With this aim, we
have proposed a new fixed-size multivariate microaggregation method inspired in
the heuristic solutions of the Travelling Salesman Problem, that helps to guarantee
individuals privacy through k-anonimity.
After introducing the basics on Microaggregation and the TSP, we have described
our algorithm and we have empirically shown that it performs better than off-the-
shelf, well-known microaggregation methods for low cardinalities over benchmark
data sets frequently used in the literature. Our proposal represents the first step
towards the creation of a more solid TSP-based microaggregation algorithm that
would outperform current methods, not only for small cardinalities but for any k
as well, and it opens the door to a fruitful research line in the field of SDC. As
further work, we plan to improve our clustering algorithm over Hamiltonian paths
permutations and test alternative TSP heuristics.
Acknowledgements The authors are supported by the Government of Catalonia (GC) with grant
2017-DI-002. A. Solanas is supported by the GC with project 2017-SGR-896, and by Fundació
PuntCAT with the Vinton Cerf Distinction, and by the Spanish Ministry of Science & Technology
with project RTI2018-095499-B-C32.
6 Multivariate Microaggregation with Fixed Group Size … 49
References
1. Batista E, Solanas A (2018) Process mining in healthcare: a systematic review. In: 9th Inter-
national conference on information, intelligence, systems and applications. IEEE, pp 1–6
2. Domingo-Ferrer J, Sebé F, Solanas A (2008) A polynomial-time approximation to optimal
multivariate microaggregation. Comput Math Appl 55(4):714–732
3. Hahsler M, Hornik K (2007) TSP—infrastructure for the traveling salesperson problem. J Stat
Softw 23(2):1–21
4. Johnson O, Liu J (2006) A traveling salesman approach for predicting protein functions. Source
Code Biol Med 1:3
5. Liao YF, Yau DH, Chen CL (2012) Evolutionary algorithm to traveling salesman problems.
Comput Math Appl 64(5):788–797
6. Samarati P (2001) Protecting respondents identities in microdata release. IEEE Trans Knowl
Data Eng 13(6):1010–1027
7. Solanas A, Casino F, Batista E, Rallo R (2017) Trends and challenges in smart healthcare
research: a journey from data to wisdom. In: 3rd IEEE international forum on research and
technologies for society and industry. Modena, Italy, pp 1–6
8. Solanas A, Martinez A (2006) VMDAV: a multivariate microaggregation with variable group
size. In: 17th COMPSTAT symposium of the IASC, Rome, pp 917–925
9. Solanas A, Patsakis C, Conti M, Vlachos IS, Ramos V, Falcone F, Postolache O, Pérez-Martínez
PA, Di Pietro R, Perrea DN, Martínez-Ballesté A (2014) Smart health: a context-aware health
paradigm within smart cities. IEEE Commun Mag 52(8):74–81
Chapter 7
Modular Design of Electronic Appliances
for Reliability Enhancement
in a Circular Economy Perspective
Abstract The design of electronic systems must consider the possibility of their
repair, reuse and recycle, in order to reduce the waste. In this paper, we present a
design methodology for modularization of electronic appliances which optimize its
end of life cost. The optimization algorithm is based on the partitioning of electronic
components by mean of simulated annealing optimization, and it has been applied
to the design of a real industrial test case.
7.1 Introduction
Electronic devices keep spreading every day. The fundamental problem is that all
these new electronic devices are usually not designed to last. When a phone breaks,
probably the consumer is going to buy a new one, rather than repair the old one,
and in that case the old phone simply becomes electronic waste, which has to be
disposed. Therefore, more electronic products don’t just mean more opportunities,
but also more waste and this waste poses a serious environmental and economic issue
[1].
The economic problem comes from the fact that the end-of-life (EoL) treatment of
these devices is an expensive process; moreover, these electronic devices also contain
precious materials. These problems are becoming so important that many countries
introduced or are introducing specific directives and specifications about the WEEE
recycling or disposal. The European Community first tackled the problem with the
WEEE directive 2002/96/CE, in 2002, and today the WEEE disposal is ruled by the
2012/19/EU directive. Furthermore, the European Commission has launched an EU
action plan for the Circular Economy which aims to support the transition towards an
economy in which valuable materials, products and resources are maintained as long
as possible, while reducing the generation of waste. Basically, the EoL industries
have three possibilities: they can try to repair a broken device; they can dismantle it,
in order to try to reuse part of it; they can recycle it, if the device is beyond repair,
or cannot be dismantled, or it is simply too expensive to repair o dismantle it. These
approaches go under the name of “3Rs”: repair, reuse, recycle. The 3Rs approach is a
possible solution to the reduction of the waste of electronic appliances, in opposition
with the fact that some classes of products end their life even if they are working,
mainly due to fashion choices.
In recent years, many researchers faced the aspect of the reuse of electronic com-
ponents. Many authors tried to estimate the remaining useful life (RUL) using sta-
tistical models based on a real-world data fitting (see for example [2]). In all these
works, the authors point out the importance of obtaining high-quality data related to
the operating life of a device in order to get good RUL estimates. A few attempts to
collect and manage lifecycle data for electronic equipment have also been done [3].
Other cloud-based approaches are discussed in [4–6].
If the RUL estimate suggests that can be economically convenient to reuse some
parts of a device, an EoL industry can proceed to the disassembling phase. The
disassembling of an equipment can either be economically feasible or not; it usually
depends on the particular device assembly and layout, on the component’s size, on
the practical difficulty of accessing a particular component or device part. A strategy
to optimize the disassembly sequence can be found in [7].
In order to make the disassembly process easier for the EoL industries, the dis-
assembly problem has been tackled starting from the device’s design phase. This
approach leads to the so-called “design for disassembly”. The authors of [8] propose
a selective parallel planning method, which groups parts into modules and try to
remove simultaneously grouped part from products.
The idea of grouping components with similar features into modules in order to
speed up the disassembly process can be pushed further by dividing the appliance
into modules. The idea of modularity has widely been used to improve the product
reliability, scalability, feasibility of component change and maintenance, but not
so much to improve disposal, ease of reuse, reduction of waste and recycling. If a
modular device breaks, we can decide to change and waste only the module which
broke, or we can decide to use the modules which keep working in new products.
In summary, there are just a handful of projects which try to find an optimal modu-
lar structure in the context of the 3Rs. In this paper, we present a design methodology
which tries to find an optimal modularization for an appliance. The goal is to reduce
the cost of the device considering the cost of the repair of the device in case of fault.
The cost function considers the cost of a module, the cost of the interconnections
between two modules, and a fault probability for each module. The optimum is found
using the simulated annealing optimization algorithm. In Sect. 7.2 we present the
design methodology. In Sect. 7.3 we briefly describe the optimization algorithm and
the parameters required for the optimization. In Sect. 7.4 we’ll present the results of
the algorithm application to a real-world case.
7 Modular Design of Electronic Appliances for Reliability … 53
Many are the parameters used to define the probability that a system is correctly
functioning. The failure probability density f (t) is the probability that a failure
occurs in the time interval [t, t + dt]. The cumulative probability of failure F(t) is
the integral of the probability density
t
F(t) = f (τ )dτ (7.1)
0
f (t)
λ(t) = (7.3)
1 − F(t)
t
− λ(τ )dτ
F(t) = 1 − e 0 (7.4)
t
− λ(τ )dτ
f (t) = λ(t)e 0 (7.5)
N
λ B (t) = λi (t) (7.6)
i=1
t t
N
N N
− λi (τ )dτ − λi (τ )dτ
FB (t) = 1 − e 0 i=1 =1− e 0 =1− (1 − Fi (t)) (7.7)
i=1 i=1
Usually the companies are interested in a reduced failure rate for the first 5–
15 years and they are not interested on the behavior in a longer term. Therefore, the
parameter we consider in this work is the probability of failure of the device FB (t)
in a fixed time t (for example 8 years).
The idea behind a modular design is to divide the whole equipment, consisting
in N components, into M distinct, interconnected modules, the generic j-th module
consists in N j components. The following relationship holds
M
N= Nj (7.8)
j=1
where C B j is the cost of the j-th module and FB j its fault probability that can be
expressed as
Nj
Nj
CBj = Ci j FB j (t) = 1 − 1 − Fi j (t) (7.10)
i=1 i=1
In (7.10) Ci j and Fi j are the costs and the fault probabilities of the i-th component
of the j-th module, respectively. The total cost of the equipment is therefore
M
M
M
CT O T = C B j 1 + FB j (t) + Ccon j,k (7.11)
j=1 j=1 k=1
where Ccon j,k is the additional cost term, which considers the cost of the intercon-
nections between module j and module k Ccon j,k takes into account the cost of the
connectors and cabling among the modules. If the components that are electrically
connected are in the same module, they do not contribute to the connection cost.
The number of modules M and the way the components are placed in the different
modules are design parameter, that depends mainly on the modularization feasibil-
ity: the more the modules, the more complex the connections among them, the more
7 Modular Design of Electronic Appliances for Reliability … 55
difficult to actually implement the design. In addition, the connection cost takes into
account the disassembling and reassembling costs.
In this work, the optimization goal is to decide which component goes into which
module, in order to minimize this cost function C T O T . For example, we might expect
that coupling a high cost component with a low fault probability component is going
to reduce the overall cost, but the increase in the interconnections cost might frustrate
this reduction. So, we have an enormous number of combinations which have to be
searched to find the optimum components placement, and the simulated annealing
is the algorithm we chose to perform this search.
serial_inputs costs 6.351 euro, and its fault probability is 0.00559. Starting from this
netlist, the software derives a connectivity matrix. An example is shown in Table 7.2.
This matrix contains a row and a column for each component. Each cell in the matrix
contains a number which specifies the number of connections between each couple
of components. For example, the MCU and the serial_inputs share 14 connections,
while the amplifier and the power_supply share two connections. This matrix is used
to evaluate the interconnection cost between the modules Ccon j,k in the cost function
in (7.11). The interconnections cost is evaluated by multiplying the cost of a single
connector by the number of nodes shared between two components in two different
modules.
In general, the SA algorithm must perform a huge number of iterations before
reaching the “frozen” state, but its speed usually depends mainly on the dimensions of
the solutions space. In common electronic equipment, there could be a few hundreds
of components, and the SA should move around all these components. Nevertheless,
the designer can force some components to be together in the same module. This
allows us to reason in term of “macroblocks” rather than “components”, to speed up
the optimization algorithm.
7.4 Results
amplifier
connector
input2
MCU
power supply
• power_supply: the power supply section which supplies the required voltage to all
the components (5 and 3.3 V), consisting of 28 components;
• amplifier: audio amplification section that includes the operational, the digital
rheostat that establishes the gain and the final amplifier that sends the signal to the
speakers, consisting of 59 components;
• connectors: includes all the connectors that allow the board to interface with the
display, speaker etc., consisting of 16 components;
• serial_input: part of serial communication including RS485, can_bus and
VEGA_serial, consisting of 27 components;
• input1 and input2: opto-isolated inputs that carry signals from the connectors to
the micro through transceivers, consisting of 60 and 64 components, respectively.
The total number of single components are 287. The costs and the fault proba-
bilities for these macroblocks are given in Table 7.1, while their connection matrix
is shown in Table 7.2. Costs and fault probabilities during the first 8 years of life
have been obtained analyzing the fault history of similar electronic boards. We used
a mixture model for the cumulative probability of failure of the i-th component of
the j-th macroblocks
with 2 blocks, and 8 modules with 1 block. With the above defined constraints, the
number of blocks per module in (7.10–7.11) is N j = p, j = 1..M and M = N / p.
Therefore (7.11) becomes:
N/p
p
p
CT O T = Ci j 2− 1 − Fi j + Ccon N (N − p) (7.13)
j=1 i=1 i=1
N
CT O T = Ci 2 − (1 − F) p + Ccon N (N − p) (7.14)
i=1
In test 2 the normalized cost is the same for all the blocks (chosen equal to 1/8 =
0.133), blocks 1-4 have higher failure probability (F = 0.2) with respect to blocks
5–8 (F = 0.05). The normalized connection cost is 0.008. In tests 3 and 4 the 8 blocks
have different normalized cost and failure probability. The normalized connection
cost is 0.012 and 0.008, respectively for test 3 and 4.
The configuration of the best solutions and the total cost defined in (7.13) are
reported in Table 7.5. Table 7.6 reports, for the different tests, the normalized total
cost including connection cost of the best solution as a function of the number of
modules M = N / p = 1, 2, 4, 8 for each module A-H identified in Table 7.5.
For the selected value of connection cost, the best solution of test 1 and 2 is
obtained using 8 modules. For the test 2 case, the best solution groups together the
blocks with the highest fault probability. For example, in the case of two modules,
all the high fault probability blocks are placed in the same module.
In test 3, with high connection cost, the best solution is obtained grouping all the
blocks in a single module. In test 4, with medium connection cost, the best solution
is obtained grouping the blocks in 4 modules, 2 blocks for each module. High fault
probability and high cost blocks are placed together to low fault probability and
low-cost blocks to reduce the total cost.
60
In general, the best solution is found grouping as much as possible the blocks,
when the connection cost is high. Each block is placed in a separated module, when
the connection cost is zero, as can be seen in (13).
The optimum block partitioning of the real application of a board of the Vega
s.r.l. company is one single module, that could be seen as an obvious solution. The
simplified test cases have been useful to draw some considerations. In general, we
can see how the optimizer produces modules with high cost and low fault probability,
and modules with low cost and high fault probability.
7.5 Conclusions
The idea of grouping components in modules has been used to speed up disassembly
in the recycle or reuse phase of EoL. The idea of modularity has considered in this
work to improve the product reliability. In this paper, we present a design method-
ology to find an optimal modularization for an appliance, with the goal of reducing
the cost of the device considering the cost of the repair of the device in case of fault.
The methodology and software developed have been applied in a real test case of an
electronic board for elevator control. The results show that the partition of the device
into modules should keep high cost block separated to high fault probability blocks.
The cost of the partitioning is taken into account in the cost function by the cost of
the connectors.
Acknowledgements The authors would like to thank Andrea Vesprini and Andrea Medori of
VEGA company. The work presented is part of a regional RAEEcovery project supported by EU
funding (https://www.raeecovery.com).
References
7. Smith S, Smith G, Chen W (2012) Disassembly sequence structure graphs: an optimal approach
for multiple-target selective disassembly sequence planning. Adv Eng Inf 26:306–316
8. Smith S, Hung P (2012) A parallel disassembly method for green product design. In: Pro-
ceedings of IEEE conference 2012 electronics goes green 2012+, 9–12 Sept 2012, Berlin,
Germany
9. Kumar V, Singh L, Tripathi AK (2017) Reliability analysis of safety-critical and control
systems: a state-of-the-art review. IET Softw 12(1):1–18
10. Rausand M, Hoyland A (2004) System reliability theory: models, statistical methods and
applications. Wiley
Chapter 8
Pest Detection for Precision Agriculture
Based on IoT Machine Learning
Abstract Apple orchards are widely expanding in many countries of the world, and
one of the major threats of these fruit crops is the attack of dangerous parasites such as
the Codling Moth. IoT devices capable of executing machine learning applications in-
situ offer nowadays the possibility of featuring immediate data analysis and anomaly
detection in the orchard. In this paper, we present an embedded electronic system
that automatically detects the Codling Moths from pictures taken by a camera on top
of the insects-trap. Image pre-processing, cropping, and classification are done on a
low-power platform that can be easily powered by a solar panel energy harvester.
8.1 Introduction
near the sensors and finally transmit a report of a few bytes, thanks to compression
methods [4]. Moreover, machine learning can improve the performance of a preci-
sion agriculture application because this type of algorithms can quickly detect and
classify parasites, diseases, and weeds.
This paper focuses on a smart application that detects automatically dangerous
parasites for apple orchards, the Codling Moth. This insect looks like a butterfly,
and it is a major problem for apple orchards. Thanks to an insect glue trap it is
possible to take a picture and classify if there are any Codling Moth and finally send
a notification to the farmer. The classification is done near sensor thanks to a specific
low cost and low power hardware, and an energy-efficient solution is proposed to
sustain the system as long as possible.
The system consists of a trap that looks like a little hive as shown in Fig. 8.1, where
a pheromone bait and a glue layer capture the attracted insects even at low-density
presence. The farmer usually takes periodic inspections of the traps or mount a
wireless camera that sends the captured pictures wirelessly for remote evaluation.
This process is expensive and time consuming for the farmer. The proposed work
detects the presence of the parasites thanks to a machine learning approach that sends
only notifications of threats and their position to the farmer.
The workflow of the proposed application is summarized in Fig. 8.2. A camera
takes pictures inside the trap periodically, the board detects and crops new insects
not yet analyzed for the classification. Eventually, a notification is transmitted to the
farmer about the detection of parasites.
For this purpose, the hardware is based on a Raspberry Pi3 with a Pi Camera. It is in
charge of image pre-processing and cropping, whereas a Movidius Neural Compute
Stick (NCS), which features the Intel Myriad X neural accelerator, completes the
classification stage. Classification is done by a machine learning algorithm that uses
a Convolutional Neural Network (CNN) model tailored for the NCS. The uncommon
feature of this IoT application is that the classification stage is elaborated in-situ (near
the camera). The processing results, consisting of few bytes after the classification,
are transmitted using long-range and a low power communication like LoRaWAN [5–
7]. Thanks to the technical features of this standard, the end nodes can transmit data
in a range of 15 km [8]; additionally, LoRaWAN guarantees the integrity of the
transmitted data because its protocol also defines security encryption [9].
Deep learning is a class of algorithms widely used in machine learning. The network
implemented in this project is, in particular, a CNN. This type of networks are widely
used in image classification and object recognition problems. Before the training
stage of the Deep Neural Network (DNN), a clear and quite large dataset of pictures
is necessary to build up the network in an optimal way. The dataset generation stage is
fundamental for supervised methods, and each image used for training and validation
stages is known and labeled a priori. It implies that a good dataset for the pictures
used during training is crucial for global performance. The dataset generation session
started with a small set of row pictures, as shown in Fig. 8.3a (approximately 300) that
has been incremented when more insects have been trapped during the experiments.
The dataset is divided into two classes: codling moth and general insects. For this
specific task a VGG16 model, developed by the Oxford University, is used [10]
training all the layers of the network. Then the model is converted to a graph model
used to perform the classification on the Vision Processing Unit (VPU).
The camera captures the floor of the insect trap, as shown in Fig. 8.3, pictures may
contain a high number of insects to classify. Thus, the images are processed with
OpenCV functions to extract each insect in sub tiles from the original taken picture.
The task is developed to extract easily features like the color (a dark subject on white
background) and the shape of the insects through a Blob Extraction algorithm. The
process for image crop is all developed through OpenCV functions, and it consists in:
68 A. Albanese et al.
In this way, the iteration of the algorithm is useful for achieving better images for
training and evaluation sessions, and also to extend the data-set.
For the training stage, we use the effort of the rapid development of neural networks
for image classification based on TensorFlow library [11].
This step is an offline process that is executed in a host computer (like a cluster),
and it aims to optimize the neural network through a large dataset of labeled images.
Therefore the system can learn from the category assigned the images. The basic
element of a DNN is the neuron (or node). It is multiplied by a so-called weight
value only when the input is ready. For example, if a neuron has four inputs, it has
four weight values which can be adjusted during the training time. A DNN could be
improved through many parameters involved in the process. In our case, the most
important parameters, which affect the performance in a significant way, are the
number of epochs and the image size. The first determines how many times the
entire set of training vectors is used to update the weights; at the end of each epoch, a
validation step is computed to evaluate the ongoing training process. The image size,
instead, is obtained by scaling each picture that feeds the DNN. So the objective is to
find the optimal tradeoff for the two parameters to complete the training stage while
meeting the hardware constraints. In our application, the following three different
configurations were used:
• 75 epochs, image size 224 × 224;
• 10 epochs, image size 112 × 112;
• 10 epochs, image size 52 × 52.
The results obtained in the training tests are shown in Fig. 8.4.
Notice that training and validation accuracy using 75 epochs (default parameter) is
going to be saturated. This means that the network does not provide enough accuracy
during the test stage and is not able to generalize as good as required.
Thus, the epochs can be decreased to achieve better results: as shown in the
graphs 10 epochs are enough for excellent accuracy. Moreover, in order to avoid
possible overflow and to save memory on the Raspberry Pi 3, the image size is
decreased to work with a simpler model and to meet the hardware constraints. Image
size of 112 × 112 and 52 × 52 have been tested and used. The chosen image size
shows worse performance with respect to the one obtained using a bigger image size.
Nevertheless, the measured accuracy is 98% which satisfies the requirements of this
class of parasites monitoring systems. After the training and the validation stage, the
neural network model file is ready. It is possible to test the performance of the DNN
model through a new set of data (a subset of the original dataset), which was never
used by the DNN. This step helps to assess the performance and the generalization
of the network, and it is crucial to confirm the accuracy computed during validation.
70 A. Albanese et al.
(a) 75 epochs, image size (b) 10 epochs, image size (c) 10 epochs, image size
224x224. 112x112. 52x52.
(d) 75 epochs, image size (e) 10 epochs, image size (f) 10 epochs, image size
224x224. 112x112. 52x52.
Fig. 8.5 Example of codling moth detection (red boxes) and general insects (blue boxes)
8 Pest Detection for Precision Agriculture Based on IoT … 71
An example of the output from the classification stage is presented in Fig. 8.5. Our
DNN provides a measure of the accuracy, which indicates how the detected insect
is more similar to a general insect or to a Codling Moth. The tests were done in an
apple orchard for 12 weeks, with the insect glue trap shown in Fig. 8.1, where 62
insects were captured. The 70% of them were Codling Moth, while the remaining
30% were general insects. In this case, the tested pictures are of different sizes.
Classification results are summarized as follows:
• 80.6% was classified correctly;
• 4.8% was false positives;
• 6.4% was false negatives;
• 8.1% was uncertain.
8.3 Conclusions
This paper presents a machine learning-based smart camera tailored for precision
agriculture services. The camera detects automatically if dangerous parasites are
trapped by the commercial pheromone boxes, in apple orchards and sends an alarm
to the farmer. Future work will investigate the performance improvement in terms of
classification accuracy and energy consumption, by developing a custom DNN and
by extending the training dataset for addition pest types. Moreover, we will include
an energy harvester capable of self-sustaining the energy consumption of the smart
trap, to permit an unattended activity indefinitely.
Acknowledgements This research was supported by the IoT Rapid-Proto Labs projects, fund-
ed by Erasmus+ Knowledge Alliances program of the European Union (588386-EPP-1-2017-FI-
EPPKA2-KA).
References
1. Ding W, Taylor G (2016) Automatic moth detection from trap images for pest management.
Comput Electron Agric 123(C):17–28
2. Magno M, Tombari F, Brunelli D, Di Stefano L, Benini L (2013) Multimodal video analysis
on self-powered resource-limited wireless smart camera. IEEE J Emerg Sel Top Circ Syst
3(2):223–235
3. Magno M, Tombari F, Brunelli D, Di Stefano L, Benini L (2009) Multimodal aban-
doned/removed object detection for low power video surveillance systems. In: 2009 Sixth
IEEE international conference on advanced video and signal based surveillance, pp 188–193
4. Brunelli D, Caione C (2015) Sparse recovery optimization in wireless sensor networks with a
sub-nyquist sampling rate. Sensors 15(7):16654–16673
5. Polonelli T, Brunelli D, Benini L (2018) Slotted ALOHA overlay on LoRaWAN—a distributed
synchronization approach. In: 2018 IEEE 16th international conference on embedded and
ubiquitous computing (EUC), Oct 2018, pp 129–132
72 A. Albanese et al.
Abstract The objective of this work is to analyze packet flows and classify them
as traffic that belongs to IoT devices or to traditional non-IoT communication. We
employ two methods: a clustering approach, which learns directly from the structure
of the dataset, and a classification tree, trained with the collected data and evaluated
using 10-fold cross validation. The results show that classification trees outperform
clustering on all datasets, and achieve high accuracy on both homogeneous simulated
and real deployment traffic data.
9.1 Introduction
Protocol and packet classification is at the basis of several services that can be of-
fered by network operators and by device manufacturers. For instance, differentiated
services require that the kind of communication be recognized, in order to provide
customized quality of service or to analyze the performance of the network. In our
specific case, we are interested in distinguishing between traditional user traffic, such
as e-mail, web surfing and media streaming, from traffic originating from independent
devices, such as sensors, remote controls, fleet tracking and environmental monitor-
ing. This last category of devices constitutes what is known as the Internet of Things
9.3 Results
9.3.1 Clustering
The first method that we have used for classification is a semi-supervised clustering
approach, based on the SeLeCT self learning classifier proposed by Grimaudo et
al. [6]. We proceed as follows. We select in Weka the SimpleKMeans algorithm, and
partition the dataset into a number of disjoint classes. We instruct the algorithm to ig-
nore the IoT/NON_IoT label given to each flow, making the approach unsupervised.
In other words, the algorithm will try to determine classes irrespective of the label
that was assigned in the first place. Clustering is then run several times, progressively
increasing the number of clusters. Ideally, two clusters would be sufficient, but nat-
urally the unsupervised method is unable to aggregate IoT and non-IoT flows so that
they are completely separated. With several clusters, instead, we might find smaller
aggregates which are mostly IoT or mostly non-IoT. To make the determination, for
each cluster, we inspect the number of actual IoT and NON_IoT flows that belong
to the cluster. Clusters which have a majority of IoT flows are then labeled as IoT,
while the others are labeled as NON_IoT. This is the supervised step of the approach:
while clusters are identified based solely on the flow features, the destination of the
cluster is determined based on the previous knowledge of the flow classification.
The first set of experiments makes use of the initial dataset, comprising mostly
the simulated IoT flows. Table 9.1 shows in detail the results of clustering, obtained
through 10-fold cross validation. The first column reports the number of clusters.
The second and third columns report the confusion matrix: for each class (shown in
the last column), the table shows the number of flows that were included in a cluster
which was labeled as IoT or NON_IoT, respectively. The following four columns
give a summary of the performance: we compute the True Positive (TP) and the
False Positive (FP) rates, as well as the Precision and Recall measures for both IoT
and NON_IoT flows. As the number of cluster increases, we get a better Recall for
the IoT flows, reaching a maximum of 96.6% for the division in 50 clusters. As we
increase the number of clusters, the overall performance slightly increases, although
we are less accurate on the IoT flow.
We observe that the number of IoT flows correctly categorized as IoT flows
increases up to 50 clusters. Increasing the number of clusters gives no improvement,
in fact the number slightly decreases. The number of NON_IoT flows incorrectly
categorized as IoT, on the other hand, steadily decreases as the number of clusters
increases. A division in 50 clusters seems to provide the best trade off. Figure 9.1,
left, shows in dark color the four clusters labeled as IoT traffic for the 50-cluster
case, in terms of acknowledge rate from client and server.
76 G. Cirillo et al.
Fig. 9.1 Acknowledge rate of the client (X axis) and of the server (Y axis). IoT labeled clusters
shown in dark color, non-IoT clusters in light green and red. Left: simulated IOT flows. Right:
captured IoT flows
We have conducted the same analysis including the 15,000 flows from the Aus-
tralian deployment. The expectation is that the results will be somewhat less satis-
factory, because of the increased diversity of the devices in use. While there still are
areas which are clearly identifiable, overall the distribution of IoT flows (blue dots)
shown in Fig. 9.1, right, is much more dispersed. The confusion matrix is therefore
far from ideal, as shown in the second part of Table 9.1. The situation slightly im-
proves when using 100 clusters, however the precision is still fairly low, and the
computational complexity of determining cluster membership increases. One of the
reason why clustering does not provide good performance is that the different fea-
9 Statistical Flow Classification for the IoT 77
tures contribute symmetrically to the Euclidean distance from the cluster centroid.
This is less of a concern with more homogeneous features, but induces confusion
when traffic has a higher degree of overlapping.
Classification trees have been shown to perform well in protocol recognition [2]. We
have generated several classification tree, for the initial simulated dataset and for the
complete dataset. We have also analyzed the influence of the different parameters
on both performance and tree size. As usual, the accuracy is evaluated through the
resulting confusion matrix using 10-fold cross validation. Our first experiment deals
with the simulated and the complete dataset using the full set of attributes. The
trees performs particularly well, as shown in Table 9.2 where the confusion matrix
highlights that only a few of the flows are misclassified.
In particular, the performance is superior to many other methods that we have
analyzed (including SVM, Naïve Bayes and 3-level perceptron [7]), with an average
precision and recall that exceed 99% for both datasets. Table 9.3 shows the tree
information in terms of computational complexity.
The first column reports the total size of the tree (number of nodes), while the
second column counts the number of leaves in the tree. The size of the tree gives
an estimate of the amount of memory required to store the tree information. The
following three columns provide information regarding the depth of the tree: the
minimum and the maximum depth to reach a leaf, as well as the average, where the
depth is weighted by the number of flows in the training set that are associated with
each particular leaf. The data shows that the lower variability associated with the
simulated flows results in a much smaller and shallower tree for classification.
It is interesting to study the influence of each attribute on the classification accu-
racy. This could be useful, for instance, to select only a subset of the attributes that
provide most of the performance. To choose the most relevant attributes, we proceed
in two ways. The first is a greedy search, whereby we evaluate the classification
performance using progressively more attributes. Hence, we start by evaluating the
performance of all trees that use only one attribute, and keep the attribute that provides
the best performance. Then, we evaluate all trees with two attributes, having fixed
the first in the previous step. The results show that performance increases quickly
with the addition of more attributes. In both the simulated and the complete dataset,
three specific attributes are selected among the first four. These correspond to the
fraction of acknowledge from the client to the server, the fraction of bytes from client
to server, and the client minimum round trip time. In the simulated case, the fraction
of packets from client to server completes the set, whereas for the complete case
the minimum server round trip time is used. The second mechanism for attribute
selection makes use of the facility provided by the Weka framework. We perform a
Wrapper Subset Evaluation, which is a scheme similar to the one employed above,
using a Greedy Step-wise incremental search. In all cases, we select the J48 algorithm
for evaluation. In both the simulated and complete case, Weka selects eight attributes
out of the available 14, including the ones that we have determined using the manual
procedure above. The results of generating the classification tree are shown in Ta-
ble 9.4. The accuracy is slightly better than that of the tree that uses all the attributes
together. This may be an indication that there is some degree of “overfitting”, i.e.,
that there are too many parameters to choose from.
9 Statistical Flow Classification for the IoT 79
9.4 Conclusions
References
1. Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and
techniques. Morgan Kaufmann, Cambridge
2. Grimaudo L, Mellia M, Baralis E (2012) Hierarchical learning for fine grained internet traffic
classification. In: Proceedings of IWCMC, Aug 2012
3. Fontugne R et al (2010) Mawilab: Combining diverse anomaly detectors for automated anomaly
labeling and performance benchmarking. In: ACM CoNEXT10, Dec 2010
4. Sivanathan A, Habibi Gharakheili H, Loi F, Radford A, Wijenayake C, Vishwanath A, Sivara-
man V (2018) Classifying IoT devices in smart environments using network traffic character-
istics. IEEE Trans Mob Comput
5. Shafiq MZ, Ji L, Liu AX, Pang J, Wang J (2013) Large-scale measurement and characterization
of cellular machine-to-machine traffic. IEEE/ACM Trans Netw 21(6):1960–1973
6. Grimaudo L, Mellia M, Baralis E, Keralapura R (2014) SeLeCT: self-learning classifier for
internet traffic. IEEE Trans Netw Serv Manage 11(2):144–157
7. Pant V, Passerone R, Welponer M, Rizzon L, Lavagnolo R (2017) Efficient neural computation
on network processors for IoT protocol classification. In: Proceedings of the first new generation
of circuits and systems conference, NGCAS 2017, Genova, Italy, 7–9 Sept 2017
Chapter 10
Using LPWAN Connectivity for Elderly
Activity Monitoring in Smartcity
Scenarios
The Internet of Things paradigm has already affected the way healthcare services are
provided [1]. In non-urban areas, in mountain areas, in smaller islands, or in any case
characterized by a sparse population, in which the use of single clinical sites is not
conceivable, it is necessary to promote the use of telemonitoring, teleassistance and,
more in general, telehealth solutions. In this perspective, the use of ICT applications
in home care results to be an increasing research area, with a huge set of ICT solutions
that can be used to enhance accessibility to home care [2]. For instance, daily activity
and mobility result to be a strong indicator for people’s health [3]. Additionally,
permanent digital monitoring would allow earlier diagnosis and faster response times,
providing new digital biomarkers able to anticipate and prevent hard clinical endpoint
such as falls. Here relies the importance of monitoring the activities of elderly people
and chronic patients in the home ecology.
As stated in the introduction, falls are ones of the leading cause of injuries [10]
in geriatric population, and a sedentary lifestyle leads to a lower quality life [11].
Figure 10.1 shows the proposed tracking system application scenario, where a self-
sufficient elderly person can carry out normal daily activities wearing the device.
The data, collected by local LoRaWAN gateway(s), are tunneled through an Internet
The LoRaWAN is a network with star-of-stars topology. The vast majority of infor-
mation is transferred with “uplink” transactions: they are started by the end nodes
and directed to the backend servers. Wireless messages are collected by gateways,
which run the “packet forwarder” software, that tunnels messages over the air into the
wired backhaul network (and vice versa, when reversed transactions—“downlink”—
are needed). Regarding security aspects, messages are encrypted on a session base
by means of application keys, while authentication at the network level is provided
by network keys; another backend server is generally in charge of managing the
84 D. Fernandes Carvalho et al.
Fig. 10.3 Architecture of the LoRaWAN solution used in the BSL project
In this section the capabilities of the proposed wearable device are detailed. In partic-
ular, first it is shown how the system can collect information about physical activity
and then the delays in transmitting such information are evaluated.
In Fig. 10.4 an example is reported, regarding the data obtained from the analysis of
two movements. In the left part (Fig. 10.4a) there are the acceleration components
measured during a walk at a normal rate. The system is able to compute an activity
level related parameter which is periodically sent to the healthcare physician for
helping him in deciding if the patient has a sufficiently active lifestyle. In Fig. 10.4b
10 Using LPWAN Connectivity for Elderly Activity Monitoring … 85
Fig. 10.4 Acceleration component retrieved by the system: a normal walking, b ahead fall ending
face downward
we can observe an ahead fall ending with face downward. In this case, the device
can send an automatic help request.
In order to measure the application delay [12] inside the Patavina NetSuite infrastruc-
ture, the experimental setup of Fig. 10.5 has been built; it consists of one single node
(located in the University laboratory and based on a PC connected to the LoRaWAN
modem RN2483) sending information via uplink to several user end points (imple-
mented by IOT2040 platforms; EP1 is connected to the Internet via the University
reliable and fast access; EP2, located in Brescia and EP3, located in Milan, leverage
on ADSL links).
In this way, timestamp T1 is registered when a LoRaWAN uplink transmission
initiates. Each EPn is a MQTT subscriber of the topic “event of interest” in the MQTT
Fig. 10.5 Experimental setup with different end points (EP1 is connected to the Internet via the
University reliable and fast access; EP2, located in Brescia and EP3, located in Milan, leverage on
ADSL links)
86 D. Fernandes Carvalho et al.
6000 6000
OD (ms)
MD (ms)
4000 4000
2000 2000
0 0
EP1 EP2 EP3 EP1 EP2 EP3
Fig. 10.6 Boxplot of the overall end-to-end delay OD for the three considered endpoints
Broker; when a new message is received, the message is timestamp tagged as T3n .
Moreover, the AS is in charge of registering the timestamp T2 when the “event of
interest” arrives. The following metrics are calculated based on these timestamps:
the LoraWAN backbone delay is ND = T2 − T1; the MQTT broker delay is MDn =
T3n − T2; and the overall end-to-end application delay is ODn = T3n − T1. Time
dissemination is performed by means of TM1000A NTP time servers, each one
UTC-synchronized via a GPS receiver. The NetSuite is natively UTC-synchronized.
The experiments last for one day, summing a total number of 1440 messages
transmitted every 60 s. Without losing generality, the user message length is 30 B and
includes the transmission timestamp and a sequence number for sorting, totalizing
the time on air of about 226 ms (Spreading Factor = 7 and Coding Rate = 4/5).
Regarding the network delay, the average delay is NDAVE = 438 ms and the standard
deviation is NDSTD = 592 ms; however, it is interesting to highlight that some outliers
exist, leading to a maximum value NDMAX = 4738 ms. The distribution of the MD
and OD metrics are reported in Fig. 10.6a and b, respectively. The three endpoints
(EP1, EP2 and EP3) have an average OD delay of about 500 ms, enough for long-
term monitoring and possible fall detection and notification. As expected, the EP3
has the worst performance, due to the poor performance of the available internet
connection.
10.5 Conclusions
In this work a wearable system for continuously tracking the physical activity of
elderly has been proposed and described. Patient movements are collected by means
of a MEMS accelerometer and used to compute resuming activity-related parameters
by the local microcontroller. The device is complemented by a LoRaWAN modem,
which exploits the LoRaWAN infrastructure to update periodically several supervi-
sory center (e.g. hospital) or patient relatives. Doctors can then estimate if the patient
is doing enough activity or not. Accelerometer data are used to detect falls as well;
in such a case, a notification is promptly sent.
10 Using LPWAN Connectivity for Elderly Activity Monitoring … 87
References
Abstract This article reports the results of fault injection on a microcontroller based
on the RISC-V (Riscy) architecture. The fault injection approach uses fault simulation
based on Modelsim and targets a set of 1000 fault injected per microcontroller block
and per benchmarck. The chosen benchmarks are the Dhrystone and CoreMark that
may represent generic workloads. The results show certain block are more prone
to fault than others, as also confirmed by a vulnerability analysis that correlates the
number of observed faults and the rate of access to the blocks.
D. Asciolla
LIRMM, University of Montpellier, Montpellier, France
e-mail: [email protected]
L. Dilillo
LIRMM, CNRS, University of Montpellier, Montpellier, France
e-mail: [email protected]
D. Santos · D. Melo
Laboratory of Embedded and Distributed Systems, University of Vale do Itajaí, Itajaí, Brazil
e-mail: [email protected]
D. Melo
e-mail: [email protected]
A. Menicucci
Department of Space Engineering, Delft University of Technology, Delft, Netherlands
e-mail: [email protected]
M. Ottavi (B)
Department of Electronics Engineering, University of Roma Tor Vergata, Rome, Italy
e-mail: [email protected]
11.1 Introduction
The rest of the paper is organized as follows. Sections 11.2 and 11.3 detail the
RISC-V platform and fault injection methodology, respectively. Section 11.4 intro-
duces the chosen algorithmic benchmarks. Sections 11.5 and 11.6 present the simu-
lation results as well as an analysis of the microcontroller vulnerability. Conclusions
are given in Sect. 11.7.
11.2 Platform
The chosen RISC-V platform is the Parallel Ultra Low Power (PULP) Platform.
It is been designed from Integrated Systems Laboratory (IIS) of ETH Zrich and
Energy-efficient Embedded Systems (EEES) group of the University of Bologna
[5]. It is an open-source platform and very useful for our purposes, because it is
possible to access all parts of the system. In this specific case, from this platform, the
Pulpino microcontroller has been chosen. It is built for RISC-V Riscy and zero-riscy
core [6]. Pulpino offers a separate memory for instructions and data. It uses AXI
(Advanced eXtensible Interface) interface as its main interconnect and a bridge to
APB (Advanced Peripheral Bus) for simple peripherals [6]. All architectural details
about Pulpino platform [6] are shown in Fig. 11.1.
The choice of Riscy core for this study is based on the following reasons. Firstly,
it is a four-stage RISC-V core and it can run most of the typical workloads. It is a
32-bit core and for the chosen configuration, it can manage only integer numbers. It
implements the RV32I instruction set. All software that runs over this core has been
compiled using the GNU RISC-V Toolchain [7] with an optimization level equal
to 3.
Riscy core architecture overview [6] is shown in Fig. 11.2. For the characterization
campaign, we are focusing over all sequential modules inside the core. In Riscy
core there are four stages: Instruction Fetch, Instruction Decode, Execution and
the Write Back. All stages are separated by an interface register. In the Instruction
Fetch stage, sequential parts are inside the prefetch buffer, the hardware loop control
module and inside the instruction fetch top-level registers. In the Instruction Decode
stage, sequential modules are inside the register file and the controller unit. In the
Execution stage, the control-state registers and the multiplier are both sequential
modules. In the write-back stage, the load-store unit is the only sequential module.
The only architectural modification that was made consists of the redefinition of
state machines inside sequential modules by using the binary codification instead of
labels. This modification was introduced to simplify the fault injection procedure.
For replicating the typical effects of space radiation environment on electronics [8],
a simulation-based fault injection technique was chosen. This technique allows full
access to the entire processor without any architectural modifications. One of the
main disadvantages of this technique is that, being simulation based, required a long
time to run compared to execution on hardware emulators such as the FPGA based
ones [9]. This simulation-based strategy is based on TCL (Tool Command Language)
scripts, that allow the manipulation of signals for fault injection and observe fault
effects. The HDL (Hardware description language) simulator used for the system
simulation and to run TCL scripts was the Modelsim [10] from Mentor Graphics.
The used fault model is based on SEU occurrence in sequential logic blocks. In each
simulation, a single fault is injected to cause a bit flip inside the chosen sequential
11 Characterization of a RISC-V Microcontroller … 95
block. Other effects could be Multiple Bit Upset (MBUs), Multi Cell Upsets Single
Event Latchups, but they are out of the scope of this study.
The first step of the procedure, a golden simulation, with no injected fault , is per-
formed to obtain the reference data to be used for the detection of mismatches caused
by fault injections. The Riscy core fault injection is performed for all sequential sub-
systems. For each sequential subsystem, 1000 simulations are performed, 1 fault
injected per simulation run. The following steps [11] summarize the tasks executed
for each simulation:
(1) Selection of a flip-flop, in a certain sequential subsystem, where the fault will
be injected. This is done selecting, in a random way, from a list that contains all
signals, in the VHDL code, which implement registers. Each signal corresponds
to each bit of the register.
(2) Selection of a random instant when the fault will be injected. In order to avoid a
fault during the logging process, the fault is injected before the reporting process.
(3) Simulation runs until the chosen injection instant.
(4) Injection of the fault by forcing a bit flip in the target sequential element.
(5) Simulation runs until the end of the algorithmic benchmark.
(6) Making a copy of the register file content.
(7) Storing the print out of the program results.
If exceptions are generated during the execution, they are stored in a file and whether
the core doesn’t respond after a threshold time a relative log is generated.
Data obtained during the simulation campaign are used to classify fault effects that
can be summarized in five categories [11] that are listed below:
• No Effect—The simulation finishes obtaining the correct result from the program
and the content of the register file is equal to the reference one.
• Latent—The simulation finishes obtaining the right result, but the content of the
register file is not equal to the reference.
• Wrong result—The system has a failure and the simulation finishes obtaining the
wrong program result.
• Timed out—The simulation takes an abnormal amount of time to finish the program
execution compared to the reference.
• Exceptions—The core generates exceptions during the simulation.
96 D. Asciolla et al.
Latent errors potentially can propagate and lead to a system failure in the future, but
these errors may also be masked by the normal core functioning.
This section presents and discusses the results of the fault injection campaign and
the measure of the utilization of sequential modules. These simulations are useful
for vulnerability estimation.
The resources utilization has been measured using a Modelsim simulation running a
TCL script. Coremark is configured to perform 1 cycle while Dhrystone 1000 cycles.
In this study, the focus is over the sequential parts inside modules that are the target
11 Characterization of a RISC-V Microcontroller … 97
of the characterization through fault injection. A simulation both for Dhrystone and
CoreMark is performed obtaining the resources utilization for the entire simulation
time. In this simulation, the measure of the resource utilization is performed counting
how many times the value, stored in a given register, changes from the beginning to
the end of the simulation. This was made using a TCL script that runs in Modelsim.
The workload is similar for both simulations, as shown in Fig. 11.3, for Dhry-
stone and CoreMark. The most used modules are the prefetch unit, the instruction
fetch-instruction decode pipeline register and the instruction decode-execute pipeline
register. There are modules that are never used during the benchmark execution like
the hardware loop for Dhrystone and the multiplier registers and the interrupt con-
troller for both benchmarks (in the run simulations).
In Fig. 11.4 are shown the results of the simulation campaign. As mentioned above,
for each microcontroller block, 1000 simulations were performed, with a fault has
been injected for each run.
This procedure is repeated for each block, obtaining the results showed in the
plot.
The graph shows that the most critical sequential modules are the controller and
the register file, with injected faults that cause a large number of latent errors and
wrong results. Despite the fact that the controller is used with a lower frequency
than the register file, it causes a large number of failures when it undergoes to fault
injection.
Exceptions are generated from modules inside the instruction fetch stage, in the
instruction decoder stage and in the execution-write back pipeline register. In this
core, exceptions are used to report a wrong instruction operation code.
The hardware loop module, the interrupt controller and the control-state registers
don’t cause any failure when faults are injected.
98 D. Asciolla et al.
Figure 11.5 shows the results of the simulation campaign with the same procedure
used above.
Like in the analysis concerning the other benchmark, it can be noticed that the
most critical modules are the controller and the register file, which present a large
number of latent errors and wrong results. The controller is again accessed with lower
frequency w.r.t. the register file, but it displays high vulnerability.
Exceptions are generated from modules inside the instruction fetch stage, in the
instruction decoder stage and in the execution-write back pipeline register. In this
core, exceptions are used to report a wrong operation code of the instruction.
The interrupt controller and the control-state registers don’t cause any failure
when faults are injected. In this case, CoreMark stimulates the usage of the hardware
loop module and we noticed system failures caused by faults injected inside this
module.
In this case, the multiplier module shows less latent errors respect the Dhrystone
workload.
Results obtained from the fault injection campaigns present similar trends for the
two workloads. In this section, we try to calculate the vulnerability of each block of
the RISC-V Riscy core, by introducing the following equation:
f
× c, if u > 0
v= u (11.1)
0, otherwise
– v resource vulnerability.
– f failure rate. It is equal to the number of wrong results normalized to the number
of simulations;
– u resource utilization. For each module, it is equal to the number of clock cycles
of activity over the total number of clock cycles.
– c normalization constant equal to 100.
This approach allows to extrapolate a general evaluation of the block vulnera-
bility that is independent of the used benchmark algorithm. Since it is based on the
correlation between the amount of detected failures and the actual use of the blocks
of the microcontroller. Figure 11.6 shows the results of the vulnerability associated
with each block.
The plot uses a logarithmic axis to easily visualize the results.
It can be noticed that there is a correlation between the vulnerability calculated
from both campaigns. The subsystems that show a high vulnerability are the instruc-
tion fetch registers, the controller and the register file.
Lower but significant vulnerability magnitude is showed for the prefetch unit, in-
struction fetch-instruction decode pipeline register, the instruction decode-execution
pipeline register, ex-wb pipeline register and the load store unit.
Vulnerability is normalized to the resource utilization for the chosen benchmark.
There is information about the vulnerability for the hardware loop only from the
campaign using CoreMark. For Dhrystone this module wasn’t used and it didn’t
generate system failures.
100 D. Asciolla et al.
11.7 Conclusion
This paper introduces a detailed analysis of the SEU effects in the RISC-V Riscy
core. The results are based on data obtained from the fault injection campaigns based
on simulation-based injection technique. The workload is similar both for Dhrystone
and Coremark benchmarks as representative of generic applications.
From simulation results, showed in Fig. 11.4 for Dhrystone workload and in
Fig. 11.5 for CoreMark workload, the most critical sequential modules are the con-
troller and the register file. Despite the fact that the controller is used with lower
frequency than the register file, it causes a large number of failures when it under-
goes to fault injection. Exceptions are caused by faults injected in modules inside the
instruction fetch stage, in the instruction decoder stage and in the execution-write
back pipeline register. CoreMark stimulates the usage of the hardware loop module
and we noticed a relevant system failures caused by faults injected inside this module.
The interrupt controller and the control-state registers don’t cause any failure
when faults are injected.
The most used resources are the prefetch unit, the instruction fetch-instruction
decode pipeline register and the instruction decode-execute pipeline register. There
are modules that are never used during the benchmark execution like the hardware
11 Characterization of a RISC-V Microcontroller … 101
loop for Dhrystone and the multiplier registers and the interrupt controller for both
benchmarks (in the run simulations).
From the study of vulnerability, showed in Fig. 11.6, the most critical module
result to be the controller module.
The Vulnerability can be used to estimate, for each module, the system failure
rate when executing other software. This can be done simply making the product
between the tabled vulnerability values and the utilization value measured for the
given application. These represents an important information for the design of fault-
tolerant Risc-V core, since it can be used to evaluate the best redundancy techniques
in terms of time usage and hardening impact for each composing block, on the base
of its vulnerability.
References
Abstract Machine learning in embedded systems has become a reality, with the
first tools for neural network firmware development already being made available
for ARM microcontroller developers. This paper explores the use of one of such
tools, namely the STM X-Cube-AI, on mainstream ARM Cortex-M microcontrollers,
analyzing their performance, and comparing support and performance of other two
common supervised ML algorithms, namely Support Vector Machines (SVM) and k-
Nearest Neighbours (k-NN). Results on three datasets show that X-Cube-AI provides
quite constant good performance even with the limitations of the embedded platform.
The workflow is well integrated with mainstream desktop tools, such as Tensorflow
and Keras.
12.1 Introduction
As seen above, NNs have gained momentum also in the embedded system field.
Our analysis focuses in particular on one of the above mentioned recently released
libraries, namely the STM X-Cube-AI expansion package, which is usable within the
STM32CubeMX configuration tool. The package provides automatic conversion of
pre-trained Neural Network and integration of generated optimized library into the
user’s project. The workflow we accustomed to consists in developing a NN on a PC in
python, using the Tensorflow library and Keras as wrapper. We normalize the vectors,
in order to reduce the convergence time. Once the developer finds a NN configuration
providing acceptable accuracy according to tests on the PC, its model is saved in a
.HDF5 file, which is imported by CubeMX. The CubeMX “Analyze” function then
estimates the memory footprint (Flash and RAM) and suggests a list of possible
target microcontrollers, accordingly. Once the target is decided (or the developers
has checked suitability of the target at hand), a new project can be started, including
the “AI-Application” and “X-CUBE-AI” packs. CubeMX allows then performing a
validation both on desktop, which estimates complexity through the Multiply and
Accumulate Operation (MACC) figure, and on target. Writing the C program for
the target, exploiting the “network” library, can be done in few lines of code that
configure the network from the recorded weights, set the input and output tensors
and then execute the prediction.
As a term of comparison, we employed also the following two algorithms:
• Support Vector Machine (SVM). We used the sklearn python framework for train-
ing the SVM on the PC, with linear kernel and the model obtained through cross-
validation. sklearn does not support the gpu acceleration, and the svm method is
not able to exploit multi-core architectures. This is a limitation of our approach,
as the long training times prevented us from a full exploration of the alternatives
(e.g., for more complex kernels). The implementation on the target is as simple as
executing the y = w*x + b prediction, where x and y are the inputs and output, w
the support vectors and b the bias.
• k-nearest neighbours (k-NN). In k-NN, no model is learned, and all the training set
is recorded. We implemented the algorithm in C from scratch, using the Euclidean
distance criterion and majority voting.
We conducted the experimental analysis using two well established ARM microcon-
trollers produced by STM, namely an F401RE and an F746. The former belongs to
the mainstream Cortex-M4 family, the latter to the high performance M7. Results are
generally reported in Tables 12.1, 12.2, 12.3 and 12.4 for the F4 case only, while F7
is explicitly considered in Table 12.3. In all cases, we first developed the classifiers
106 V. Falbo et al.
on a PC, and then deployed on the target, with the needed adjustments, especially in
terms of performance.
We used three binary classification datasets: Sonar (209 samples × 60 features)
[11], the UCI Heart diseases available on Kaggle (303 × 13) [12], and Viruses (24,736
× 13), a data traffic analysis dataset developed by the University of Genova. All the
datasets are cast to float32, according to the target execution platform.
For the Sonar dataset (Table 12.1), we report data for a NN with two hidden dense
layers (40 and 30 tanh neurons each, after an initial ReLU dense layer with 100
nodes, and an output sigmoid node). With a more complex network (5 wider layers,
300 ReLU input), we get a Flash footprint of 253 kB, and a lower accuracy, of 86%.
For k-NN, the best k is 1. For all classifiers, accuracy is the same as on an i7 core
PC.
For the Heart disease dataset (Table 12.2), feature selection (implemented through
the Orthogonal Matching Pursuit (OMP) algorithm) was necessary to improve the
12 Analyzing Machine Learning on Mainstream Microcontrollers 107
ML in embedded systems has become a reality, with the first tools for NN firmware
development already being available for developers. Analyzing three different algo-
rithms with three different datasets, we saw that the NNs implemented by the STM
X-Cube-AI package provides quite constant good performance even with the limi-
tations of the embedded platform. The workflow is well integrated with mainstream
desktop tools, such as Tensorflow and Keras. Also SVM performs quite well, with
a small footprint. But its development is less well supported by tools compared to
NNs. For k-NN, it is known that performance tends to worsen as the training set size
increases [13].
Research still lacks publicly available IoT datasets, that would facilitate the
experience by scholars and practitioners, in different application domains.
For future work, we are interested in a more detailed analysis (particularly on
the space-time tradeoff), with different types of NNs and more relevant datasets.
Moreover, it will be interesting to study performance and application of unsupervised
learning algorithms, that look even more suited for field deployment, as they do
not need human data processing for the training phase. Finally, given the limited
facilities of the edge, distributing embedded ML computation is likely to become a
major architectural challenge for the upcoming years.
108 V. Falbo et al.
References
1. Shi W, Cao J, Zhang Q, Li Y, Xu L (2016) Edge computing: vision and challenges. IEEE
Internet Things J 3(5):637–646
2. https://www.arm.com/products/silicon-ip-cpu/machine-learning/project-trillium
3. https://pages.arm.com/machine-learning-on-arm-cortex-m-microcontroller.html
4. https://www.tensorflow.org/lite/guide/build_arm64
5. https://www.st.com/en/embedded-software/x-cube-ai.html
6. Parodi A, Bellotti F, Berta R, De Gloria A (2018) Developing a machine learning library for
microcontrollers. In: Saponara S, De Gloria A (eds) Applications in electronics pervading
industry, environment and society. ApplePies 2018. Lecture Notes in Electrical Engineering,
vol 550. Springer, Berlin
7. Andrade L, Prost-Boucle A, Pétrot F (2018) Overview of the state of the art in embedded
machine learning. In: 2018 Design, automation & test in Europe conference & exhibition
(DATE), Dresden, pp 1033–1038
8. Lai L, Suda N (2018) Enabling deep learning at the LoT Edge. In: 2018 IEEE/ACM international
conference on computer-aided design (ICCAD), San Diego, CA, pp 1–6
9. Cerutti G, Prasad R, Farella E (2019) Convolutional neural network on embedded platform
for people presence detection in low resolution thermal images. In: ICASSP 2019—2019
IEEE international conference on acoustics, speech and signal processing (ICASSP), Brighton,
United Kingdom, pp 7610–7614
10. http://fidoproject.github.io/
11. http://fizyka.umk.pl/kis-old/projects/datasets.html#Sonar
12. https://www.kaggle.com/ronitf/heart-disease-uci
13. Islam MJ, Wu QMJ, Ahmadi M, Sid-Ahmed MA (2007) Investigating the performance of
Naive-Bayes classifiers and K-nearest neighbor classifiers. In: International conference on
convergence information technology (ICCIT 2007), Gyeongju, pp 1541–1546
Chapter 13
Quality Aware Selective ECC
for Approximate DRAM
Approximate computing is a design paradigm for low power systems that proposes
to expand the degrees of freedom in digital system design by allowing inaccurate
or approximate operations in circuits. The idea at the base of approximate comput-
ing is the fact that many real-world applications do not require exact mathematical
computations, since their input and output data are inherently affected by noise and
errors. Approximate memories are part of the building blocks of this approach and
are intended as memory circuits that do not store data exactly and indefinitely, but
are affected by errors during read/write operations or tend to spontaneously forget
data with the passage of time [1, 2].
Depending on their technology, circuits for approximate memory have been pro-
posed by scaling Vdd for SRAMs [3] and by reducing refresh rate under the nominal
value for DRAMs [4]. These circuit-level proposals lay the groundwork for practical
implementations that can be used in programmable architectures as main approxi-
mate memory [5]. Applications that can tolerate a certain amount of errors can then
allocate their data structures and buffers in these memories. These application, called
ETAs (Error Tolerant Applications), will produce an output with degraded quality as
the effect using approximate memories. The final assumption of approximate com-
puting is that the amount of approximation (i.e. errors) can be tailored on the specific
problem, trading off energy savings up to the limit of acceptable output quality.
The first and intuitive approach is to design the memory array in order to save MSBs
in exact bit-cells. Considering DRAMs, [6] proposes using two different refresh
rates, one at nominal rate and one at reduced rate. Cell arrays are rearranged in a
way that the nominal refresh rate is applied to bit cells for MSBs (exact MSBs),
while the reduced refresh rate is applied to bit cells for LSBs (approximate LSBs).
The number of exact MSBs and approximate LSBs depends on applications, for
example, for 32 bit words a number from 1 to 8 exact MSBs and, respectively, 31–24
approximate LSBs have been found to be of interest [7]. Requiring exact cells in an
approximate data word has direct impact on the following characteristics:
13 Quality Aware Selective ECC for Approximate DRAM 111
– it raises output quality under the same error rate in LSB cells or, conversely, allows
for higher error rate in LSBs while meeting the required output quality;
– it reduces overall energy saving, since a portion of the cells is working at nominal
conditions;
– it increases circuit complexity requiring dual refresh rate in DRAMs.
The second approach results from exploration of the relation between output quality
and BER on the LSBs [7]. LSBs in a data word can be dropped and set to a constant
value (i.e. 0) with a marginal impact on output quality degradation. It is a technique
that is proposed since it achieves energy savings with a simple circuit implementation
(bit cells are powered off or even omitted).
Previous works have proposed to use selective ECC in SRAM to reduce errors in
MSB (1) by enlarging memory words as in classical ECC memory systems (i.e. 32 bit
memory word are expanded to 36 bit, introducing 4 bit ECC) [8] (2) by reusing LSB
dropped bits [9]. The contribute of our work is (1) to design selective ECC specific
for approximate DRAM memory systems (2) to allow tailoring selective ECC to the
specific application, by first analyzing its output quality degradation related to bit
error rate, looseness level and dropped bits.
The idea of quality aware selective ECC consists in a two step process. First, an
application is analyzed in order to find the desired tradeoff between output quality
and approximate memory parameters (i.e. error rate, level of approximation [1],
dropped bit); then an error correcting code is chosen in order to reduce error rate in a
specific portion of data bits. In order to avoid increasing memory requirements with
additional ECC bit, bit dropping and reuse is always considered for the additional
check bits required by ECC.
In order to reduce hardware complexity, (n,k) SEC (single error correcting) Hamming
codes were considered. In this notation, k indicates the number of protected bits (data
bits), while n is the code length, including additional check bits. We note that SEC
codes can provide also error detection (e.g. double error detection typically), but for
our scope error detection is not used: in case of detected errors, program execution
continues as for undetected errors, in approximate memory. Table 13.1 summarizes
112 G. Stazi et al.
the most common Hamming codes. We note that, as a general rule, increasing the
number of data bits k produces more efficient codes, since the rate k/n increases.
However, larger k are effective at very small error rates (as is common in exact
memories). In approximate memories typical error rates are much larger (i.e. from
10−4 to 10−2 errors/(bit × s) [1]) and, as consequence, shorter codes are desirable
since enlarging n increases the probability of multiple errors within the same word,
which cannot be corrected.
With Looseness Level we intend the concept, introduced in [1], of having a certain
number of exact MSBs in an approximate data words. As an example, Table 13.2
reports results obtained on a 32 bit integer FIR filter, showing how Looseness level
(i.e. the number of exact MSBs) can impact output SNR.
Instead of using exact DRAM cells for MSBs, the idea is to use a single, and
slower, refresh rate for all cells, while using SEC ECC in order to reduce error rate
in MSBs. In this way, MSBs are still affected by errors, but their error rate is reduced
with respect to LSB cells.
Table 13.3 reports results obtained on the same 32 bit integer FIR filter, showing how
bit dropping (i.e. powering them off and reading them as ‘0’) impacts output SNR.
As already confirmed in literature, output SNR is only slightly dependent on LSBs.
Instead of powering them off, these LSBs can be effectively reused as checkbits for
the MSBs, without requiring additional bits.
Given the list of Hamming codes in Table 13.1, it appears that the most suitable for
our application are Hamming (3,1),(7,4) and (15,11). This choice depends on two
factors, first we assume to protect single 32 bit words in memory, in order to not
impact read/write speed; infact, protecting with a single code larger data size would
require to multiple read/write on the entire data. Secondly, given the relatively high
bit error rate of approximate memories, longer SEC codes tend to fail due to the
rising probability of multiple errors.
Figure 13.1 shows the formats considered for 32 bit data, where k MSBs are
protected by SEC ECC, 32 − n bits are left unprotected and n − k dropped and
reused as checkbits. Assuming a uniform error probability pe for each bit, expressed
as errors/(bit × s), the probability of having i errors in a set of n bits is:
n i
Pe (n, i) = p (1 − pe )n−i ;
i e
Considering the SEC ECC code, protected bits will contain errors for i ≥ 2; hence:
n n
n
Pecce (n) = Pe (n, i) = pei (1 − pe )n−i ;
i=2 i=2
i
In order to get a measure of the improvement, we can find the equivalent error rate
peqe , considered as the error rate that n bits (without ECC) should have to produce
the same Pecce (n).
n
0
Peqe (n) = Pe (n, i) = 1 − Pe (n, i) = 1 − (1 − peqe )n ;
i=1 i=0
Equivalent bit error rate peqe for ECC protected bits can be obtained with
Peqe (n) = Pecce (n):
peqe = 1 − n 1 − Pecce (n);
Assuming 32 bit data stored in approximate memory, Fig. 13.1 resumes how selec-
tive ECC could be applied using Hamming (3,1), (7,4) and (15,11) codes. The most
appropriate choice depends on the application; for example, according to Tables 13.2
and 13.3, a range from 8 to 12 protected MSBs results in an output SNR between 60
and 93 dB, while dropping 4 LSBs does not significantly impact SNR. In this case
Hamming (15,11) seems the most suitable choice.
Table 13.4 reports the results that can be obtained on typical target application
considering the previous Hamming codes. It shows that MSBs protected by SEC
codes expose and equivalent BER significantly lower than unprotected bits. Consid-
ering the previous example, Hamming (15,11) and a BER of 10−3 on cells produces
an equivalent BER of 6.94 × 10−6 on MSBs.
13.5 Conclusion
In this paper we proposed the use of selective ECC in approximate DRAM memory
tailored to quality requirements of applications. We started from the consideration
that in many works and use cases it has been demonstrated the effectiveness of
limiting approximate cells to LSBs while leaving a portion of MSBs exact. However,
this approach requires higher complexity in memory circuits and circuits surrounding
the cell array. For DRAMs, it requires to produce and distribute multiple refresh rates
in the array.
Due to the relatively high error rates in approximate memories, SEC codes reduce
but do not eliminate errors. This is completely acceptable and we demonstrated that
for typical error rates in the order of 10−3 to 10−4 , Hamming codes (7,4) and (15,11)
can reduce error rate on MSBs of factor between 1/100 and 1/1000. Future works
will implement the technique in simulation models and apply it to error tolerant appli-
cations, allowing the characterization and the comparison with respect to previous
techniques.
References
14.1 Introduction
Hash DRBG family is based on SHA1 and SHA2 functions, but only SHA2
cryptographic primitives are taken into exam since SHA1 offers low security strength
and it is considered outdated. The parameters related to a DRBG mechanism based
on SHA2 Hash function are reported in Table 14.1.
CTR1 DRBG mechanism is based onto a block cipher core used in counter mode.
The parameters of this mechanism are listed in Table 14.2.
Concerning Hash DRBG, the characteristics of available SHA2 IP core are listed
in Table 14.3. SHA-224 and SHA-384 are discarded from the options, since they
offer a shorter output block keeping area and latency equal to respectively SHA-256
and SHA-512. The two remaining functions show some differences:
• SHA-256 has lower latency per block than SHA-512 but the latter offers a higher
throughput since it provides 512 bit every 80 clock cycles;
• comparing the areas, SHA-256 results to be more compact and this reflects also on
internal state registers area footprint: as it can be seen in Table 14.1, the variable
seedlen is 440 for SHA-256 and 888 for SHA-512; this implies that the internal
state requires around 900 registers for the former and 1800 for the latter.
Now, the expected throughput of these two hash functions during generation phase
in a Hash DRBG implementation can be calculated:
CTR DRBG proved to be best in class for both area and throughput. The char-
acteristics of available AES IP core are presented in Table 14.4 for AES-128 and
AES-256.
Since our focus is on highest level security strength implementations, only AES-
256 is to be considered for the trade-off. As shown in the table, area is lower than
SHA-256 and throughput is higher than SHA-512:
Despite all these considerations, CTR DRBG has not been chosen to be imple-
mented. The reason lays in the doubts about the effective capability of this mech-
anism to reach maximum security strength. In [8], the author claims that, while
Hash-based DBRGs satisfy security requirements, block cipher-based ones should
be avoided since the pseudo-random permutation inside each AES round coupled
with the counter mode outputs a sequence which is indeed distinguishable from a ran-
dom source. The choice ultimately fell on Hash DRBG, implemented with SHA-256
14 Digital Random Number Generator Hardware Accelerator … 121
Fig. 14.1 Comparison between NIST approved DRBG mechanisms based on logic complexity in
kGE and throughput
core. This ensures a compact implementation for the mechanism and the possibil-
ity to extend the design for supporting multiple cores to increase the throughput.
Figure 14.1 reports the characteristics in terms of logic complexity and throughput
of several DRBG implementations, relying on the available IP cores (SHA and AES)
as primitives, their features when synthesizing on 45 nm standard-cell technology
[9] and methods to construct DRBG using such primitives [2].
The design architecture of Hash DRBG with SHA-256 core is shown in Fig. 14.2,
and it makes use of the following blocks:
• state registers for V, C and Reseed counter, with length respectively of 440, 440
and 20 bits, a 128-bit register to store an optional personalization string, for inter-
nal state randomization, and a 512-bit entropy register to store the input entropy
content;
• a SHA-256 core with 512-bit input and 256-bit output, with a latency of 64 clock
cycles;
• a serial adder with 440-bit inputs and modulo 440-bit output, which works in
parallel with the SHA-256 core and stores the result of the addition into one of the
its input registers, as shown in Fig. 14.2, in order to minimize area occupation;
• multiplexer network to address all data in internal state and from the previous
operation to the inputs of the SHA-256 core and adder;
• a Finite State Machine (FSM), which controls the flow of operations, i.e., instance,
reseed and generate;
122 L. Baldanzi et al.
Reseed Entropy
V C Pers. String
Count Content Reg.
440 440 20 128 512
MulƟplexer Network
FSM
440 440 512
256
440
• a DRBG self-test module (not present in Fig. 14.2), in order to diagnose possible
failures inside the circuitry.
14.4 Results
For the Hash DRBG IP-core characterization, two different technologies have been
identified as representative of potential targets for implementations of such hardware
accelerator for security applications: Intel Stratix IV FPGA and Silvaco PDK 45 nm
Open Cell Library [7] (i.e., ASIC standard-cell technology). In both cases different
implementation effort corners were tested, in order to evaluate the trade-off between
throughput and area. Concerning the Intel Stratix IV FPGA technology, the synthesis
and layout flow performed with high performance constraints gives a maximum
operative frequency of 180 MHz, meaning a throughput of 720 Mbps considering
the single core instance, for an overall occupation of 5949 ALMs (Adaptive Logic
Modules). The implementation on Silvaco ASIC standard-cell is able to reach a
throughput even up to 4 Gbps, since the maximum frequency is equal to 1 GHz
still for single core version of the IP-core, for a logic complexity of 118.98 kGE
corresponding to an area of approximately 0.094 mm2 .
14.5 Conclusions
This paper presented the IP-core design related to a digital Random Number Gen-
erator (RNG), one of the most significant part required to implement algorithms for
authentication, confidentiality, message integrity and security applications in general.
14 Digital Random Number Generator Hardware Accelerator … 123
The proposed architecture is based on one of the Deterministic Random Bit Genera-
tors (DRBGs) approved by NIST according to trade-off analysis between throughput,
area and security strength. Hash DRBG with SHA-256 as cryptographic core proved
to be the most efficient solution in terms of throughput per logic complexity, among
the solutions offering maximum security strength (i.e., 256 bits).
The RNG IP-core obtained has been tested by means of NIST Statistical Test
Suite, thus stating that the sequences of bits generated cannot be distinguished from
a true random sequence of numbers, and therefore validating its use for cryptographic
applications. It has been also implemented on FPGA and ASIC standard-cell tech-
nologies for characterization. The implementation on Intel Stratix IV FPGA reported
a throughput of 720 Mbps at 180 MHz with a maximum occupation of about 6000
ALMs, while the synthesis on Silvaco 45 nm ASIC standard-cell [7] reported a
throughput of 4 Gbps at 1 GHz with a maximum logic complexity of about 119 kGE.
References
1. Barker E, Kelsey J (2016) Recommendation for Random Bit Generator (RBG) constructions.
Special Publication 800-90C, NIST
2. Barker E, Kelsey J (2015) Recommendation for random number generation using deterministic
random bit generators. Special Publication 800-90A, NIST
3. Lo Bello L, Mariani R, Mubeen S, Saponara S (2019) Recent advances and trends in on-board
embedded and networked automotive systems. IEEE Trans Ind Inf 15:1038–1051
4. Pelzl J, Paar C (2011) Understanding cryptography. Springer, Berlin
5. Dang QH (2015) Secure hash standard. Technical report, NIST
6. Dichtl M, Golić JD (2007) High speed true random number generation with logic gates only.
In: Cryptographic hardware and embedded systems—CHES 2007. Lecture Notes in Computer
Science, vol 4727. Springer, Berlin, 45–62
7. Vasyltsov I, Hambardzumyan E, KimBohdan Y-S, Karpinskyy B (2008) Fast digital TRNG based
on metastable ring oscillator. In: Cryptographic hardware and embedded systems—CHES 2008.
Lecture Notes in Computer Science, vol 5154. Springer, Berlin, 164–180
8. Schmid M (2015) ECDSA—Application and implementation failures
9. Silvaco PDK 45nm Open Cell Library. https://www.silvaco.com/products/nangate/FreePDK45_
Open_Cell_Library/index.html
Chapter 15
An Energy Optimized JPEG Encoder
for Parallel Ultra-Low-Power
Processing-Platforms
Abstract The energy autonomy and the lifetime of battery-operated sensors are
primary concerns in industrial, healthcare and IoT applications, in particular when a
high amount of data needs to be sent wirelessly such as in Wireless Camera Sensors
(WCS). Onboard real-time image compression is the appropriate solution to decrease
the system’s energy. This paper proposes an optimized algorithm implementation
tailored for PULP (Parallel Ultra Low Power) processors, that permits to shrink
the image size and the data to transmit. Our optimized JPEG encoder based on
a Fast-Discrete Cosine Transform (DCT) function is designed to achieve the best
trade-off between energy consumption and image distortion. The parallel software
implementation requires only 0.495 mJ per frame and can support up to 80 fps
satisfying the most stringent requirements in WCSs applications without requiring a
dedicated hardware accelerator.
15.1 Introduction
The energy autonomy and the lifetime of battery-operated sensors are primary con-
cerns in industrial, healthcare, and IoT applications, in particular when a high amount
of data needs to be sent wirelessly. In this scenario, Wireless Camera Sensors are
usually left in the environment to acquire and transmit visual data [1, 2]. From a
system-level viewpoint, the energy consumption is dominated by the radio subsys-
tem and is proportional to the number of bytes to transfer [3–5]. Concerning WCSs,
on-board real-time image compression is the appropriate solution to decrease the sys-
tem’s energy [6, 7]. In fact, bringing the intelligence close to the sensor enables the
reduction of transmission costs thanks to the compression of the data dimensionality
[8].
Executing computationally heavy tasks, such as an image compression pipeline,
without assuming a dedicated hardware acceleration engine (which may not be avail-
able or affordable for cost reasons) typically requires adequate computing capabili-
ties and a large memory footprint. However, because of the available energy supply
resources (i.e., small batteries or inefficient energy harvesters), [9] WCSs usually
includes low-power MCUs (e.g., ARM Cortex-M or RISC-V PULP), which presents
limited resources that can prevent executing data filtering tasks under real-time con-
straints [10]. To address this challenge, we propose an optimized image compression
algorithm implementation tailored for a RISC-V multi-core processor, that permits
to shrink the image size and the data to transmit. We developed an optimized JPEG
(Joint Photographic Experts Group) encoder based on Fast-DCT (FDCT) image
compression algorithm, with an adaptive trade-off between energy consumption and
image distortion. Our software solution is tailored for a parallel fixed-point comput-
ing hardware and exploits the DSP-oriented instructions included into the RISC-V
extended ISA (Instruction Set Architecture) of PULP. When compared with a JPEG
implementation on ARM Cortex-M4, our solution achieves a frame rate of 22 fps
and is eight times more energy-efficient, if running on the GAP-8 processor, an eight
cores embodiment of the PULP architecture.
Table 15.1 Number of cycles required to execute the JPEG algorithm on different implementations
Functions Cycles 1 Cycles 2 Cycles 3 Parallel
DCT + Zig-Zag (8 × 8 block) 107,500 1947 1873 1611
Quantization (8 × 8 block) 2539 2539 2220 2368
Huffman (8 × 8 block) 984 984 984 802
Total (all image) 130,147,237 13,072,581 6,092,954 2,307,672
MSE 53 100 100 100
PSNR (dB) 31 29 29 29
Speedup – 10 21 56
The first fast DCT was proposed by Chen [14], which has an excellent regular struc-
ture, but it requires as many as 16 multiplications for each 8-point block. Hou [15]
proposed a recursive algorithm, with 12 multiplications and 29 additions. Although
the number of operations is the same as other fast algorithms, it has the advantage
of the smaller number of variables necessary for the execution. The function pro-
posed by Loffler [16] involves 11 multiplications and 29 additions. Additionally, the
authors proposed a parallel solution that simultaneously executes three multiplica-
tions. Finally, the algorithm proposed by Arai [17] features a simplification of the
DCT processing. It requires only 5 multiplications and 29 additions. Moreover, it can
be easily implemented with fixed-point operations, speeding up the code execution
in the absence of a Floating Point Unit (FPU). The aforementioned works make clear
that using an optimized DCT algorithm heavily decreases the number of operations
required by the JPEG encoder and that a parallelized execution can be applied.
In this work, we based our development on Noritsuna, a JPEG encoder optimized
for Cortex-M4 [18]. This implementation supports floating-point operation at low
memory impact, but it is not tailored for real-time compression since it is based
on a non-fast DCT algorithm (Table 15.1—Cycles 1). To overcome this issue, we
replace the DCT algorithm with the Arai [17] FDCT implementation. However,
the Noritsuna’s algorithm implementation applies to individual 8 × 8 image blocks,
hence demanding low L1 memory footprint and favoring a block-wise parallelization
scheme for multi-core implementation. After an in-depth study, we selected the
application described in [19] as a comparison for this paper; indeed, it needs only
10 Mcycles (220 ms @ 48 MHz) to compress a QVGA grayscale frame, about
8 Kcycles/block, one of the best performance with a low-power ARM Cortex-M4.
Similarly to our solution, this implementation exploits fast DCT, but it is optimized on
Cortex-M4 architecture featuring an L1 scratchpad memory of 80kB (with QVGA
resolution), greater than the GAP-8 cluster memory. Among other solutions, the
authors in [12] describe an optimized firmware that needs 22–26 Mcycles to compress
a 752 × 480 pixel in RGB format (~9 Kcycles/block), whereas the paper in [6]
requires 300 Kcycles to process a single 8 × 8 block, with an average execution time
of 9207 ms on a Texas Instruments MSP430. The deployment in [6] uses up to 29 mJ
to encode a single 128 × 128 picture. These latter implementations feature higher
energy consumption than our solution.
128 T. Polonelli et al.
15.2 GAP-8
In this work, we use GAP-8 SoC; a RISC-V ISA multi-core processor based on the
PULP open-source computing platform [20]. It integrates a state-of-the-art RISC-V
microcontroller core with a rich set of peripherals, and a powerful programmable
parallel processing engine for flexible multi-sensor (image, audio, inertial) data anal-
ysis. These two subsystems are shown in Fig. 15.2c and are respectively called Fabric
Controller (FC) and Cluster. The FC is an advanced MCU based on a RISC-V single-
core. It features an extended ISA for energy-efficient digital signal processing, and
it is equipped with a fast access-time data memory (L1). The 512 KB L2 memory
is used for storing the code and most of the volatile variables. The cluster, residing
on a dedicated frequency and voltage domain, is turned on when applications need
computation-intensive functions. It contains 8 RISC-V cores identical to the FC,
allowing the SoC to execute the same code on either the fabric controller or the clus-
ter. This 8-core cluster is served by a shared L1 data memory (64 kB). The shared L1
can serve all memory requests from the cores in the cluster with single-cycle access
latency and low average contention rate (<10% on data-intensive kernels).
Maximizing the power efficiency is an essential factor in low power devices;
hence, GAP-8 contains an internal DC/DC directly connected to an external battery
or energy harvester sources. It provides voltage in 1.0–1.2 V range when the circuit
is active.
The original version of the firmware [18] is composed of the following steps: (i)
generation of the header file; (ii) image decomposition into 8 × 8 pixel blocks, and if
the overall dimensions are not multiple integers of 8, the missing blocks are padded
with values calculated from the average value on the edges, then the level shifting is
executed; (iii) application of the DCT to every block, followed by the quantization,
and zigzag operations; (iv) Huffman; (v) writing back the compressed data into the
L2 memory.
Since the GAP-8 architecture is not equipped with an FPU, all the operations are
implemented with a fixed-point representation. For this data type, we must select
in advance the number of bits dedicated to the integer and the fractional parts and,
depending on this choice, the JPEG encoder can achieve higher precision (increasing
the number of fractional bits) or a broader dynamic range. We individuate the best
trade-off by selecting 15 bits for the fractional part and 16 bits for the integer part
(16Q15). To quantitatively evaluate the differences between both representations, we
adopt as mean metrics the Peak Signal to Noise Ratio (PSNR) and the Mean Squared
Error (MSE) since they are widely used in the scientific community as evaluation
indexes in the field of image processing [21]. The 16Q15 representation covers the
15 An Energy Optimized JPEG Encoder for Parallel Ultra-Low-Power … 129
dynamical range required by the algorithm and increases the PSNR of 0.3%, and the
MSE is practically unchanged concerning the floating-point original code.
Table 15.1—Cycles1 reports the GAP-8 performance metric to run the JPEG
implementation on a QVGA image (324 × 240). This initial version requires more
than 130 M cycles, at 50 MHz the frame conversion time is approximately 2.5 s.
The latency breakdown individuates the DCT routine as the most onerous part from
a computational point of view, as we expected from the description of the firmware
in [18]. The two-dimensional DCT on an area of 8 × 8 pixels has been described
previously and, following the formula given in [18], we need 3136 additions and 8192
multiplications, meaning a considerable load on processors, especially RISC, where
multiplications require greater use of resources. The optimization usually focuses on
reducing the number of arithmetic operations to be performed during the DCT. Like
most of the fast algorithms, also the one proposed by Arai, Agui, and Nakajima [17]
exploits the separability of the two-dimensional DCT and reduces it to the calculation
of a one-dimensional DCT on eight elements for all the rows and subsequently for
the columns. This algorithm is considered the fastest: it requires 29 additions and 5
multiplications for the DCT 1D and 464 additions and 144 multiplications for the
2D DCT on the 8 × 8 block. The JPEG encoder performance with AAN (Arai Arui
Nakajama) DCT is presented in Table 15.1—Cycles 2. With this change, the major
improvement in performance was achieved, dropping the total number of cycles by
89%, mainly due to the relative reduction by 98% in the execution of the DCT. On
the other hand, since the AAN algorithm is an approximation of a standard DCT,
it has an impact on the quality of the output image, increasing the MSE of about
86%. However, as shown in Fig. 15.1, the image quality difference perceived from a
human eye is negligible despite the MSE and PSNR indexes drop; hence the AAN
DCT can be considered a suitable replacement in our JPEG encoder.
The first (Cycles 1) implementation (Noritsuna [18]), written following the the-
oretical definition of the 2D DCT, presents a complexity O(n4 ); the AAN instead
reduced the complexity to O(n log2 n) motivating the notable latency reduction.
Fig. 15.1 Image quality comparison between both FAST-DCT (AAN) and DCT algorithms
130 T. Polonelli et al.
A third optimization step for sequential execution is performed using the hardware
features available on GAP-8 SoC, such as the DSP-oriented extended-ISA instruc-
tions (built-in) and the single cycles access memory (L1). The built-in functions are
extensions of the RISC-V instruction set, developed to speed up some computation-
ally heavy operations. Among the most commonly used, we exploit the Multiply
Accumulate (MAC) instructions, which multiply two variables and accumulate the
partial sums, and the FIXED_MUL, which multiplies two fixed-point variables in
one single cycle. The final number of cycles required for an in-line execution is
presented in Table 15.1—Cycles 3.
To run the JPEG encoder on the GAP-8 cluster, the algorithm steps are executed
by making use of the available 8 RISC-V cores. The initial section of the JPEG
file header can be performed only once at the beginning of the program since it is
fixed (Fig. 15.2a—Header Writing). The rest of the JPEG algorithm workload is
distributed among the cluster by letting any core operates on different image 8 ×
8 block (Fig. 15.2a—Multi-core functions). Indeed, during the compression of the
pictures, it is sufficient writing to the output file (L2) the bytes containing only the
information concerning the actual image starting from the byte following the last of
the header (Fig. 15.2a—Footer writing). The image blocks reading function can be
easily performed in parallel, similarly to level shifting, discrete transform of cosines,
zigzag reordering, and quantization tasks. Instead, the Huffman task operates on
data produced by previous steps. Hence it is executed as a sequential task on a single
core. In addition to this, the Huffman encoding does not have a predefined number
of bits needed to encode a symbol, but the output is of variable length. For this
reason, it was considered necessary to separate this last step from parallel execution
Fig. 15.2 a JPEG sub-functions, the multi-core algorithms can be parallelized. Instead, the mono-
core function must be executed sequentially; b Energy per frame and maximum fps compared to
the cluster frequency and voltage; c GAP-8 overview
15 An Energy Optimized JPEG Encoder for Parallel Ultra-Low-Power … 131
by executing it sequentially from a single cluster core. With the parallel execution of
the firmware, we reach 2,307,672 cycles (Table 15.1—Parallel) and conversion time
of about 45 ms @ 50 MHz, which corresponds to 22 frames per second. The speedup
reached with a parallel execution is 2.64 with eight cores because the Huffman is
executed sequentially.
15.5 Conclusions
In this paper, we present an optimized JPEG encoder based on the FDCT, which is
parallel executed on GAP-8, a multi-core RISC-V SoC.
The encoder can reach up to 86 fps @ 200 MHz, but at 100 MHz the MCU
requires only 0.495 mJ to compress a frame, reaching the best trade-off between
the compression rate (46 fps) and the energy consumption. When compared with
a JPEG implementation on ARM Cortex-M4 (48 MHz), our solution (@ 50 MHz)
achieves a frame rate 4.8× higher with and requires 8 times less energy to encode a
single image. Instead, if compared to Noritsuna [18], our solution features 56× lower
number of clock cycles. Lastly, we exploit the JPEG encoder in a real deployment,
a QVGA sensor with a Wi-Fi module. In this application, our solution can reduce
the system energy up to 14× at 20 fps with respect to stream raw images through a
Wi-Fi connection.
References
13. Rao K et al (2014). Discrete cosine transform: algorithms, advantages, applications. Academic
Press
14. Chen W et al (1977) A fast computational algorithm for the discrete cosine transform. IEEE
Trans Commun 25(9):1004–1009
15. Hou H (1987) A fast recursive algorithm for computing the discrete cosine transform. IEEE
Trans Acoust Speech Signal Process 35(10):1455–1461
16. Loeffler C et al (1989 May) Practical fast 1-D DCT algorithms with 11 multiplications. In:
International conference on acoustics, speech, and signal processing. IEEE, pp 988–991
17. Arai Y et al (1988) A fast DCT-SQ scheme for images. IEICE Trans (1976–1990) 71(11):1095–
1097
18. Noritsuna 2019, https://github.com/noritsuna/JPEGEncoder4Cortex-M. Available online: July
2019
19. Moodstocks 2016, https://github.com/Moodstocks/jpec. Available online: July 2019
20. Flamand E et al (2018 July) GAP-8: a RISC-V SoC for AI at the edge of the IoT. In: 2018 IEEE
29th international conference on application-specific systems, architectures and processors
(ASAP). IEEE, pp 1–4
21. Hore A et al (2010 Aug) Image quality metrics: PSNR vs. SSIM. In: 2010 20th international
conference on pattern recognition. IEEE, pp 2366–2369
22. Polonelli T et al (2018 Oct) Slotted ALOHA overlay on LoRaWAN-A distributed synchro-
nization approach. In: 2018 IEEE 16th international conference on embedded and ubiquitous
computing (EUC). IEEE, pp 129–132
23. Polonelli T et al (2019 Feb) Slotted ALOHA on LoRaWAN-design, analysis, and deployment.
In: Sensors (Switzerland), 19(4)
Part IV
VLSI & Signal Processing
Chapter 16
VLSI Architectures for the
Steerable-Discrete-Cosine-Transform
(SDCT)
Abstract Since frame resolution of modern video streams is rapidly growing, the
need for more complex and efficient video compression methods arises. H.265/HEVC
represents the state of the art in video coding standard. Its architecture is however
not completely standardized, as many parts are only described at software level to
allow the designer to implement new compression techniques. This paper presents
an innovative hardware architecture for the Steerable Discrete Cosine Transform
(SDCT), which has been recently embedded into the HEVC standard, providing bet-
ter compression ratios. Such technique exploits directional DCT using basis having
different orientation angles, leading to a sparser representation which translates to
an improved coding efficiency. The final design is able to work at a frequency of
188 MHZ, reaching a throughput of 3.00 GSample/s. In particular, this architecture
supports 8k UltraHigh Definition (UHD) (7680 × 4320) with a frame rate of 60 Hz,
which is one of the best resolutions supported by HEVC.
16.1 Introduction
In recent years, a large effort has been devoted to the field of video compression to
cope with the increasing demand of high resolution multimedia contents. The latest
standard proposed by ITU-T and ISO/IEC groups is the H.265/HEVC compression
algorithm [8]. It extensively employs inter-frame and intra-frame prediction to exploit
the temporal and the spatial redundancies present in video streams. H.265/HEVC
requires computational load to detect and process intra mode, so many efforts have
been done in order to lower the complexity [6] of the detection phase. The difference
between the predicted block and the actual block of pixels is called residual block and
it is lossly coded taking advantage of transforms (Discrete Sine Transform, DST, and
Discrete Cosine Transform, DCT) and quantization. While the DST is used only for
the smallest block size, namely 4 × 4 pixels, the DCT is used for all the other sizes,
typically up to 32 × 32. Chen et al. [1] has shown how to reduce the complexity of the
Integer Cosine Transform enabling solution up to 64 × 64. Since DCT is increasing
in complexity and computational load, faster and low-power architectural solutions
such as [7, 9] are required. Recently, Fracastoro et al. [3] proposed a directional
DCT, called Steerable DCT (SDCT), which is better suited than DCT to compress
directional data. The SDCT is based on the work of Zeng et al. [10] and makes
possible to divide the directional cosine transform into a traditional DCT followed
by a geometrical rotation. The kernels used for the SDCT are different from the DCT
ones as they depend on the steering angle, with the limit case of 0 degrees rotation for
which the SDCT coincides with the DCT. This paper presents a low power hardware
accelerator for SDCT able to reach the throughput required by HEVC for the 8k
UltraHigh Definition of 7680 × 4320 pixels. At first the architecture is analysed in
Sect. 16.2 and then Sect. 16.3 will present the obtained results for the basic SDCT
accelerator and some implementations stemming from it.
where x are the input samples, x̂ are the results obtained by applying the T transform
matrix, R(θ ) is the rotation matrix, while x̃ is the result of the SDCT. The SDCT
can be thus implemented as a DCT followed by a steering transformation. The DCT
part can be implemented as suggested in the literature, for example using a folded
architecture [5], and then applying rotations when all the samples returned by the
2D-DCT are available. This means that the steering part of the architecture, which
handles the rotations, has to work faster than the DCT. This issue has been addressed
in this work and one of the possible solution is to define two clock regimes, one for the
2D-DCT and one, faster, for the steering part, in order to comply with the throughput
offered by the 2D-DCT transform block. A FIFO memory between the two parts
acts as a buffer memory. The whole structure is depicted in Fig. 16.1. The 2D-DCT
block is based on the architecture proposed in [5] by Meher et al., which is very
efficient, especially in the folded fashion, and scalable to transforms of size 4, 8, 16
and 32. The steerable part is shown in Fig. 16.2. It is composed by an input memory
16 VLSI Architectures for the Steerable-Discrete-Cosine-Transform (SDCT) 139
(IM), an output memory (OM) and the lifting blocks that perform the rotation [2].
Some multiplexers are used to bypass the lifting blocks for the case of no rotation,
returning directly the result given by the DCT. The IM is required also to reorder the
samples as the steering process is computed on the custom zig-zag order, given in
Fig. 16.3, that is different from the classic zig-zag ordering, as the vectors are rotated
in pairs with respect to the diagonal elements. Rotation by lifting scheme:
1−cos θ
1−cos θ
cos θ sin θ 1 1 0 1
= sin θ sin θ (16.2)
− sin θ cos θ 0 1 − sin θ 1 0 1
order to further simplify the architecture, the multiplication for P and U coefficients
from Eq. 16.2
1 − cos θ
P= (16.3)
sin θ
U = − sin θ (16.4)
In Fig. 16.4 is implemented as shift and add, as the number of possible rotation
angles have been fixed to 8 (from 0, no rotation, to 7), as reported as optimum in [4]
by Masera et al. The steerable block thus introduces 2 × N clock cycles of latency
for the reordering stage plus 4 clock cycles due to the internal pipeline. Therefore,
in the event that all the SDCT have a length N = 32, the latency is equal to 68 clock
cycles, which corresponds to the worst case.
The unit presented so far is able to compute SDCT of lengths 4, 8, 16 and 32. This
type of structure has been designed to be implemented inside the HEVC standard.
16 VLSI Architectures for the Steerable-Discrete-Cosine-Transform (SDCT) 141
Anyway, this algorithm could be also used for video compression standards with
lower constraints and for image compression standard, such as JPEG. Therefore, two
reduced SDCT unit have also been developed. The first is able to compute SDCT
of length 4, 8 and 16, named SDCT-16, while the second is capable of computing
SDCT of length 4 and 8, named SDCT-8. These two units have a reduced throughput
of 50% and 75% respectively, so they have a parallelism of 16 or 8 data instead of
32, reducing the size of all the memories. In particular the length of both rows and
columns of all memories is halved in the SDCT-16 unit, while is four time lower in
the SDCT-8 unit with respect to SDCT-32. As a result the area occupation of these
units is much lower than the SDCT-32 one. Moreover, just one clock domain has
been used for both DCT and steerable block.
16.3 Results
In order to satisfy the HEVC speed requirements for a video resolution of 7680 ×
4320 and a frame rate of 60 fps, the proposed structure needs a throughput of almost
3 GSample/s. As discussed in Sect. 16.2, the folded version presented in [5] has been
implemented since this approach guarantees the required throughput. This structure
has a processing rate of 16 pixels per cycle, therefore the architecture needs a fre-
quency of at least 187 MHz (2.99 × 109/16 MHz). Clock gating has been enabled
for the synthesis, leading to a smaller area and lower power consumption. The tech-
nology employed for the synthesis is the UMC 65 nm. The following architectures
have been considered and synthesized:
– two-dimensional DCT
– SDCT
– reduced SDCT-16
– reduced SDCT-8.
For the SDCT implementation, several clocks have been tested for the steering part,
namely 1×, 2×, 4× and 8×. By increasing the Steerable unit frequency it is possible
to decrease the parallelism and consequently the number of input/output ports of the
buffers (Table 16.1).
It can be noticed that by reducing the data parallelism of the Steerable unit, the
size of the input memory (IM) and output memory (OM) decreases considerably,
while the size of all the other sub-blocks slightly increases (Table 16.2).
In literature there are no other SDCT hardware architectures, so it is not possible
to make comparisons. However, Table 16.3 presents an overview of the obtained
results. As it can be noticed, the area and power results of the SDCT-16 are around
60% smaller than the complete SDCT. On the other hand, the SDCT-8 area is around
75% smaller than the SDCT-16 and 90% smaller than the complete SDCT while the
throughputs are reduced respectively by 50% and 75%. Finally, comparing the DCT
and the SDCT architecture we can observe that the hardware overhead to support
up to N = 32 is very large. However, removing the hardware support for the steering
142 L. Sole et al.
part with N = 32 (SDCT-16), the area becomes comparable with the one of the DCT.
As a consequence, this solution can be of interest to increase the rate-distortion
performance [4].
16 VLSI Architectures for the Steerable-Discrete-Cosine-Transform (SDCT) 143
16.4 Conclusion
This paper provides an efficient and compact hardware architecture accelerator for
the SDCT algorithm to be used in the HEVC algorithm. Many of the design choices
explained above present an optimized approach, such as the lifting-based approach,
in which the hardware resources are reduced to a minimum. Moreover, the flexibility
showed by this architecture makes it appealing for a wide range of applications,
being able to work with different coding formats. The proposed SDCT framework
is able to cope with 8k UltraHigh Definition (UHD) (7680 × 4320 pi xels) with a
frame rate of 60 Hz for the 4:2:0 YUV format, which is one of the highest resolution
supported by HEVC. The steerable DCT is a viable solution to improve compression
efficiency, as reported in [4]. Further work will cover the integration of the proposed
accelerator in a complete HEVC framework to validate the performances in a real
case scenario.
References
1. Chen Z, Han Q, Cham W (2018) Low-complexity order-64 integer cosine transform design
and its application in hevc. IEEE Trans Circ Syst Video Technol 28(9):2407–2412
2. Daubechies I, Sweldens W (1998) Factoring wavelet transforms into lifting steps. J Fourier
Anal Appl 4(3):247–269. https://doi.org/10.1007/BF02476026
3. Fracastoro G, Fosson SM, Magli E (2017) Steerable discrete cosine transform. IEEE Trans
Image Process 26(1):303–314
4. Masera M, Fracastoro G, Martina M, Magli E (2019) A novel framework for designing direc-
tional linear transforms with application to video compression. In: ICASSP 2019—2019 IEEE
international conference on acoustics, speech and signal processing (ICASSP), pp 1812–1816
5. Meher PK, Park SY, Mohanty BK, Lim KS, Yeo C (2014) Efficient integer DCT architectures
for HEVC. IEEE Trans Circ Syst Video Technol 24(1):168–178
6. Ogata J, Ichige K (2018) Fast intra mode decision method based on outliers of DCT coefficients
and neighboring block information for h.265/hevc. In: 2018 IEEE international symposium on
circuits and systems (ISCAS), pp 1–5
7. Oliveira RS, Cintra RJ, Bayer FM, da Silveira TLT, Madanayake A, Leite A (2019) Low-
complexity 8-point dct approximation based on angle similarity for image and video coding.
Multidimension Syst Signal Process 30(3):1363–1394. https://doi.org/10.1007/s11045-018-
0601-5
8. Sullivan GJ, Ohm J, Han W, Wiegand T (2012) Overview of the high efficiency video coding
(HEVC) standard. IEEE Trans Circ Syst Video Technol 22(12):1649–1668
9. Sun H, Cheng Z, Gharehbaghi AM, Kimura S, Fujita M (2019) Approximate DCT design
for video encoding based on novel truncation scheme. IEEE Trans Circ Syst I Regul Pap
66(4):1517–1530
10. Zeng B, Fu J (2008) Directional discrete cosine transforms–a new framework for image coding.
IEEE Trans Circ Syst Video Technol 18(3):305–313
Chapter 17
Hardware Architecture for a Bit-Serial
Odd-Even Transposition Sort Network
with On-The-Fly Compare and Swap
17.1 Introduction
Sorting is comparing and swapping array elements until a desired order is reached.
The complexity of the design depends on the algorithm itself and the data stored.
With the increased storage capacity of memory units and the emergence of high-
level computing and data analysis applications, sorting algorithms seeped forward
to become one of the most frequently executed tasks in software, thus optimizing the
applications overall performance [1]. For instance, search algorithms prefers sorted
data lists for maximum efficiency. Additionally, sorting is also useful in data exchange
operations employed to solve problems in graph theory, computational geometry,
deep learning, computer graphics, computer based simulations, and image processing
in near real-time [1–5]. With this critical dependency on sorting, and the diversity
of applications that embodies it, developers turned their attention to improving the
efficiency of such algorithms by targeting lower-level implementations on dedicated
processors and field programmable gate array (FPGA) for parallelism and accelerated
convergence while combining speed and flexibility [1, 6–8].
One of the most popular architectures focused on implementing sorting algorithms
sequentially, until no further significant improvement was made. Research henceforth
concentrated on parallelizing these algorithms by massive pipeline and maximum
resource consumption for maximum performance, hence a trade off between con-
vergence speed and resource utilization. However, with the increase of deploy-able
embedded systems, wearable devices and internet connected units, additional power
consumption and resource utilization constraints have emerged [1, 9–11]. One of
the simplest and frequently used sorting algorithm is the odd-even transposition sort
whose performance is of the order O(N ) [1, 3, 12]. The odd-even transposition sort
algorithm provides both parallelism and flexibility, supporting larger size arrays for
a reference area size, while preserving an acceptable and efficient space-time factor
[1, 3, 12, 13]. In addition, developments and improvements achieved on electronic
components, resulted in minimizing transistor size and gate switching delay, allowed
the use of higher clock frequencies and low complexity sequential operations. These
improvements motivated re-exploring sequential, bit-serial odd-even transposition
sorting network architectures to achieve flexibility, computational simplicity, mini-
mal resource utilization and higher operating frequency while preserving parallelism,
pipelining and performance [1, 3, 14–16].
Previous work and suggested architectures presented numerous ways for increas-
ing the performance of sorting algorithms by implementing them in hardware [1, 9,
12, 13, 16, 17], and on multi-core processing units i.e. graphical processing units
(GPU) [3, 4, 18, 19]. However, such improvements focuses on massive parallelism
and multi-core processing for big data analysis and are not suitable for deployable
systems i.e. (FPGA). Moreover, recent work in [1] proposed an optimized, shift-
based, hardware implementation of the parallel-data Odd-Even Transposition sorting
algorithm, with high flexibility for general purpose applications, capable of sorting
arrays of length larger than two times the number of available processors. How-
ever, the suggested design in [1] increases the sorter capacity by adding additional
17 Hardware Architecture for a Bit-Serial Odd-Even Transposition … 147
storage registers to temporary hold shifted data back and forth the sorting process
thus increasing latency and time required for convergence making it unsuitable for
limited resource devices and time critical application. In contrast to the modification
presented in [1] which operates on N data bits in parallel, the motivation behind
this work is to propose a sequential, bit-serial based Odd-Even transposition sort-
ing network architecture with on the fly compare and swap. The suggested sorter is
capable of sorting larger arrays for the same area size without the need of additional
storage components. Additionally, this work focuses on providing higher operating
frequency, minimized resource consumption and minimal computational complexity
while preserving parallel operations and pipeline by employing bit level operations.
The idea behind the following modification, is to expand the network capabilities to
handling array sizes larger by a maximum of two times then the available processing
cells while reducing routing complexity and interconnections. Such modification
minimizes the sorting cell structure by limiting its access to two elements instead of
148 G. Akkad et al.
three. The additional third element is shifted back and forth within the sorter network,
as shown in Figs. 17.2 and 17.3. As shown in Fig. 17.2, the blue boxes represents the
working registers of processor P for a given sorting cycle. In the classical version,
processor P has access to three registers, two local registers in the holding cell and
one neighboring register in the right cell, hence an increase in routing complexity. In
contrast, in a shift-based network, the processor P has access to only two registers
i.e. cell local registers, where the additional element is shifted back in forth the
sorting cell. Such modification resulted in a major reduce in routing complexity and
fewer resource utilization at the cost of a temporary storage register and increased
latency [1].
While the previously suggested modification, minimizes routing complexity and
allows the network to sort larger array sizes, it suffers from an increased latency,
slower conversion and requires additional storage registers proportional to the num-
ber of elements to be sorted. Thus it is of great interest to depict a sorting network
capable of handling larger arrays for a fixed reference area with minimal routing,
comparison and swap complexity.
17 Hardware Architecture for a Bit-Serial Odd-Even Transposition … 149
This architecture present a minimal size bit-serial odd even transposition sorting
network with on the fly compare and swap capable of sorting larger array sizes in
a fixed reference area for a higher clock rate and minimal routing requirements.
The proposed approach preserves the parallel nature of the algorithm with O(N )
performance complexity.
The proposed architecture is serial, with bit-level operations, thus greatly minimizing
resources utilization. Moreover, data is processed sequentially while loaded to the
storage registers, most significant bit (MSB) first allowing the processing element to
perform an on the fly swap the following cycle, without the need of an intermediate
stage or additional storage elements. The sorting cell structure is shown in Fig. 17.4.
As shown in Fig. 17.4 the cell structure is formed of three stages: Data input and
routing, Storage and Processing. Moreover, each sorter cell is controlled by a local
state machine to synchronize operations and re-route the input when needed. The
cell operation is detailed as follows:
150 G. Akkad et al.
1. The input stage is formed of two multiplexer levels of four and two 1-bit multiplex-
ers respectively. The input stage routes the appropriate input data to the storage
registers and perform swapping operation when required. The data is routed based
on the decisions made by the local control unit i.e. from local registers, from right
cell or from left cell.
2. The second stage handles the storage process and is formed of two N -bits shift
registers. The input is shifted in most significant bit (MSB) first.
3. The third and final stage handles the comparison process and is formed of two
input multiplexers and a reduced size 1-bit comparator. The comparison process
is done MSB first and starts as soon as one input bit is shifted in lasting for N
cycles. Moreover, by considering the N -th local registers as the main comparators
input the swap decision can be decided at the N -th cycle i.e. when all data bits
are processed hence swapping can be done on-the-fly in the next cycle by re-
routing the inputs. Additionally, the comparison operation can begin comparing
the swapped data directly. Such technique greatly reduces the latency of the design
and eliminates the need for additional storage elements and operation cycles.
17 Hardware Architecture for a Bit-Serial Odd-Even Transposition … 151
Thus the presented bit-serial structure preserves parallelism, operates on larger array
sizes for a reference area given the major reduction in resource utilization i.e. using
bit-level operators. Additionally, the design can be easily expanded to handle M-
bits data where M > N by adding additional, M − N storage registers. While the
increase in data bits requires additional processing cycles per iteration, the following
problem is negligible by the dramatic increase in clock frequency achieved where a
bit-level sequential structure can be considered as a fully pipelined architecture.
Similarly the shift-based Odd-Even transposition sort simulation is conducted for the
array m = [8, 12, 4, 15, 2, 11, 6, 3, 5, 14, 16, 10, 1, 9, 13, 7, 12, 8, 10, 9, 13, 11,
15, 14]. As shown in Fig. 17.5, the sorting process required 26 cycles for completion
i.e., 133.848 ns divided in 13 sort cycles and 13 shift cycles for an operating frequency
of 194.250 MHz. The synthesis results shows the use of 1% logic slices i.e. 169 out
of 14,752 and less than 1% 4-input look up tables (LUTs) i.e. 200 out of 29,504 for
the mentioned FPGA [1]. As 26 clock cycles were needed to sort the 16 elements
array, and the clock cycle is 5.148 ns. Sorting the 16 elements requires 133.848 ns this
number is 100.152 ns in the classical version. Taking the worst case scenario of 30
clocks the time needed is 154.44 ns which was 115.56 ns in the classical version. This
152 G. Akkad et al.
slowdown is caused by the added shift operations allowing the design to sort larger
array sizes for a fixed number of processors. Such penalty increases proportionally
to the added elements [1].
on a maximum clock rate of 1.612 ns per cycle equivalent to 620.34 MHz. Synthe-
sis results shows the use of 80 slice Flip-Flops (FFs) and 51 4-input look up tables
(LUTs). Additionally the design was synthesized for N = 16, 32-bits unsigned data.
In order to better assess the performance and resource utilization of the mentioned
designs and to highlight the superior advantage of the proposed bit-serial architecture
synthesis results are presented and compared in Table 17.1. Thus, as presented in
Table 17.1 the proposed architecture is superior to the classical and shift based
parallel network, can operate at a maximum frequency of 620.34 MHz and provides
a major reduction in resource utilization. Additionally, unlike parallel computation
based structures, the proposed design is flexible where a change in the number of
operating bits results in a proportional increase of storage elements i.e. registers.
In this paper, an optimized Bit-Serial Odd-Even Transposition sort with on the fly
compare and swap hardware architecture was proposed. This implementation out-
perform previous parallel structures, is minimal in size, easily expandable to sort
different data length i.e. bits, while preserving algorithm parallelism, complexity and
pipelined structure. Additionally the presented structure can run at a much higher
frequency given the simplicity of the employed bit level operations. Moreover, the
sorting process begins while the data is being loaded into its memory, which means
that the sorter doesn’t require additional swap cycles. Further work could be made
in this subject by adopting an optimized data loading technique. Improve the design
to operate on signed data and fixed point representations.
154 G. Akkad et al.
References
1. Ayoubi R, Istambouli S, Abbas AW, Akkad G (2019) Hardware architecture for a shift-
based parallel odd-even transposition sorting network. In: The 4th international conference on
advances in computational tools for engineering applications, IEEE. Zouk Mosbeh, Lebanon
2. Batcher KE (1968) Sorting networks and their applications. In: Proceedings of the April 30–
May 2, Spring joint computer conference, pp 307–314. AFIPS ’68 (Spring), ACM, New York,
NY, USA. https://doi.org/10.1145/1468075.1468121
3. Francis RS, Mathieson ID (1988) A benchmark parallel sort for shared memory multiproces-
sors. IEEE Trans Comput 37(12):1619–1626
4. Singh DP, Joshi I, Choudhary J (2018) Survey of GPU based sorting algorithms. Int J Parallel
Prog 46(6):1017–1034. https://doi.org/10.1007/s10766-017-0502-5
5. Vasicek Z, Sekanina L (2008) Novel hardware implementation of adaptive median filters. In:
2008 11th IEEE workshop on design and diagnostics of electronic circuits and systems, pp 1–6
6. Akkad G, Ayoubi R, Abche A (2018) Constant time hardware architecture for a Gaussian
smoothing filter. In: 2018 International conference on signal processing and information secu-
rity (ICSPIS), pp 1–4
7. Akkad G, Mansour A, ElHassan B, Roy FL, Najem M (2018) Fft radix-2 and radix-4 FGPA
acceleration techniques using hls and hdl for digital communication systems. In: 2018 IEEE
international multidisciplinary conference on engineering technology (IMCET), pp 1–5
8. Akkad G, Mansour A, ElHassan B, Roy FL, Najem M (2018) Twiddle factor generation using
Chebyshev polynomials and hdl for frequency domain beamforming. In: Applications in elec-
tronics pervading industry, environment and society, Springer lecture notes in electrical engi-
neering. Springer
9. Chen R, Prasanna VK (2017) Computer generation of high throughput and memory efficient
sorting designs on FPGA. IEEE Trans Parallel Distrib Syst 28(11):3100–3113
10. Farmahini-Farahani A, Duwe HJ III, Schulte MJ, Compton K (2013) Modular design of high-
throughput, low-latency sorting units. IEEE Trans Comput 62(7):1389–1402
11. Rjabov A (2016) Hardware-based systems for partial sorting of streaming data. In: 2016 15th
Biennial baltic electronics conference (BEC), pp 59–62
12. Hematian A, Chuprat S, Manaf AA, Parsazadeh N (2013) Zero-delay FGPA-based odd-even
sorting network. In: 2013 IEEE symposium on computers informatics (ISCI), pp 128–131
13. Korat UA, Yadav P, Shah H (2017) An efficient hardware implementation of vector-based odd-
even merge sorting. In: 2017 IEEE 8th Annual ubiquitous computing, electronics and mobile
communication conference (UEMCON), pp 654–657
14. Huang C-Y, Yu G-J, Liu B-D (2001) A hardware design approach for merge-sorting net-
work. In: ISCAS 2001. The 2001 IEEE international symposium on circuits and systems (Cat.
No.01CH37196), vol 4, pp 534–537
15. Durad MH, Akhtar MN (2014) Performance analysis of parallel sorting algorithms using MPI.
In: 2014 12th International conference on frontiers of information technology, pp 202–207
16. Olarlu S, Pinotti MC, Zheng SQ (2000) An optimal hardware-algorithm for sorting using a
fixed-size parallel sorting device. IEEE Trans Comput 49(12):1310–1324
17. Lipu AR, Amin R, Mondal MNI, Mamun MA (2016) Exploiting parallelism for faster imple-
mentation of bubble sort algorithm using FPGA. In: 2016 2nd International conference on
electrical, computer telecommunication engineering (ICECTE), pp 1–4
18. Faujdar N, Ghrera SP (2017) A practical approach of GPU bubble sort with CUDA hardware. In:
2017 7th International conference on cloud computing, data science engineering—Confluence,
pp 7–12
19. Yildiz Z, Aydin M, Yilmaz G (2013) Parallelization of bitonic sort and radix sort algorithms on
many core GPUS. In: 2013 International conference on electronics, computer and computation
(ICECCO), pp 326–329
Chapter 18
Variable-Rounded LMS Filter
for Low-Power Applications
18.1 Introduction
Nowadays the reduction of power consumption is a key point in the design of digital
circuits and important efforts are dedicated to develop new methods and techniques.
Battery life, low self-heating and reliability are important design aspects in all mod-
ern electronic systems, and the problem is surely exacerbated if high operative fre-
quencies are considered. In this scenario, precision-scalable approaches [1–3] are
proposed with the assumption to tolerate some approximations for performances
improvement. Audio and image processing, for instance, can leverage on limits of
human senses to improve efficiency. In the area of data mining and neural network,
data features are exploited to develop error-resilient applications [4, 5]. Also in the
field of adaptive filters some precision-scalable techniques have been proposed for
the Leas Mean Square (LMS) algorithm. Very used for applications as system iden-
tification, channel equalization or noise cancellation, it is composed by a FIR section
and a learning part as shown in Fig. 18.1a. Unlike for canonical filters, LMS does not
(a) (b)
d(n) d(n)
x(n) FIR x(n) FIR
y(n) y(n)
SECTION α, β SECTION
wn(n) - +
wn(n) +
- ROUND. +
Z-1 + BLOCK Z-1
e(n) EVAL.
wn+1(n) e(n) wn+1(n) BLOCK
xRND(n)
LEARNING LEARNING
α,β
Fig. 18.1 a Standard LMS filter and b Variable-rounding LMS block diagram
have an a priori defined impulse response, but it changes its internal coefficients min-
imizing, in an approximate way, the mean square error (MSE) between its output and
a desired signal. For this purpose, at each iteration, sum of products is executed, and
multiplications between input samples and an error signal are performed to compute
MSE gradient estimate (responsible for coefficients updating). As consequence LMS
provides the usage of a large number of multipliers and registers, offering serious
concerns from a power consumption point of view. In [6] approximate multipliers [7,
8] are used in the FIR section to reduce dissipation, but regime performances are not
scalable. In [9] a run-time procedure observes coefficients magnitude and, following
an external threshold, decides which terms are negligible for the output computation.
Consequently, relative registers and multipliers are frozen. On the other hand, a not
negligible increase in the area is due to the presence of additional blocks for regime
detection and coefficients analysis, in addition to a relevant degradation of regime
performances when high power saving is demanded. In this paper a variable rounding
multiplication is explored in the LMS updating section for the gradient computation
(as underlined in red in Fig. 18.1b). The idea is that if error signal is very small, it
is possible to use a rounded version of input samples for gradient computation with
negligible worsening of regime performances. In this way part of the multipliers par-
tial products matrix is turned off, allowing power consumption saving. An advantage
of this approach with respect to the technique of [9] is that it allows a power reduction
in all multipliers of the learning section of the LMS filter. In addition, according to
error behavior, circuit can decide between two different kinds of rounding, and the
use of only one observation logic for the error signal is a very attractive solution.
For a major comprehension of our proposal, in Sect. 18.2 a brief summary of LMS
algorithm is offered and in Sect. 18.3 the low-power implementation is addressed.
Finally, in Sect. 18.4 results and circuit implementation in TSMC 28 nm CMOS
technology are discussed.
18 Variable-Rounded LMS Filter for Low-Power Applications 157
LMS computes its impulse response in an iterative way in order to minimize differ-
ences between the output signal y(n) and the desired signal d(n) in the mean square
sense [10]. Considering input samples x(n) and LMS coefficients wn (n), and defining
the filter dimension DIM, y(n) is given by the following expression:
D
I M−1
y(n) = wn (i) · x(n − i). (18.1)
i=0
A comparison between y(n) and d(n) allows the computation of the error signal
e(n) used to underline the deviation respect to the desired behavior:
It is worth noting that a proper choice of the step size parameter μ guarantees
algorithm convergence and good regime performances [10].
The key idea of this paper is approximating the gradient computation by using, in
(18.4), an approximated version of x(n − i), x RND (n − i), where some LSBs are
rounded:
If we call εgrad the absolute value of the gradient error, we can write:
Therefore, the lower is the absolute value of the error e(n) the larger can be the
error of x RND (n − i) (εxRND ) for a prescribed εgrad value. As shown in Fig. 18.1b, the
proposed implementation provides an Evaluation Block for the error signal analysis
and Rounding Blocks to obtain the approximate input x RND (n − i). The Evaluation
158 G. Di Meo et al.
Fig. 18.2 a Evaluation block and b Rounding block schemes for the (n-i)-th acquired input sample
Block, represented in Fig. 18.2a, computes error signal module (through XOR oper-
ation between e(n) and its sign bit), and divides its first most significant bits in two
groups (we call them MSB1 and MSB2 group).
Starting from the two groups MSB1 and MSB2, the proposed approach uses a
two-level approximation. If all the bits of MSB1 group are zero,α flag is set to zero.
If also the bits of MSB2 group are all zero, the other flagβ is also set to zero. The
flag α and β control the Rounding Block (represented in Fig. 18.2b). In the case α
= 0, K least significant bits of x are nullified through an AND operation. In the case
in which also β = 0, additional K least significant bits of x are also nullified. In
order to perform a rounding operation, a variable rounding constant RC is computed
according to the following conditions:
In this way, x RND (n) is multiplied with K (or 2K) nullified LSBs, stacking at zero
K (or 2K) rows of the partial products matrix (see Fig. 18.3a). In addition, since
gradient LSBs are zero, all coefficients LSBs are not updated and it is possible to
(a) (b)
gradRND,n(i)
clk CG_cell
α=0 α clk_α +
α=0 clk CG_cell wn+1(i)
β=0 α clk_β
MSBs K LSBs K LSBs
FF FF FF
clk clk_β clk_α
wn(i)
Fig. 18.3 a Multiplier for gradient computation and b clock gating for i-th feedback register.
Nullified LSBs and rows are represented in gray in the figure on the left
18 Variable-Rounded LMS Filter for Low-Power Applications 159
freeze relative flip-flops. Then, as shown in Fig. 18.3b, two clock gated cells, enabled
by α and β respectively, are introduced to manage all registers in the learning section.
To verify low-power properties, standard and proposed LMS are used to identify
three different unknown systems. In particular a Low-pass FIR filter, a Low-pass
IIR and an High-pass IIR filter are considered with order 40, 10 and 13 respectively.
Convergence capabilities is investigated observing regime MSE, obtained by 25
independent simulations and averaging respective error signals. Considered length of
LMS filter (DIM) is equal to 40. For low-power assessments, circuits are synthesized
and routed in TSMC 28 nm CMOS technology and Post-Route results are analyzed.
Inputs and coefficients are expressed in fixed-point 12-bit arithmetic, while error
signal is on 18 bits. Soft rounding is demanded if 14 error MSBs are zero and
hard approximation acts if 16 MSBs are nullified. We propose K = 2, then x RND (n)
exhibits two or four nullified LSBs. All multipliers are synthesized with tree carry-
save topology and fast vector merging adder.
Table 18.1 reports error performance. The regime MSE of proposed approach
is very close to the standard LMS implementation, highlighting that the additional
approximation results almost negligible with respect to other error sources. In addi-
tion, Fig. 18.4 shows the regime frequency response of the filters in the three con-
sidered cases in comparison to the frequency response of the target system. Again,
we note very similar performances between standard and proposed LMS with very
good in-band matching and very similar behavior in the stop-band.
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Normalized frequency (pi rad/sample) Normalized frequency (pi rad/sample) Normalized frequency (pi rad/sample)
Fig. 18.4 Harmonic responses of unknown systems and LMS circuits. From the left to the right:
Low-pass FIR, Low-pass IIR and High-pass IIR identification case
160 G. Di Meo et al.
Electrical post Place & Route performances are compared in Table 18.2. We
can observe that proposed solution results only in a 1.2% area occupation increase
(needed for additional control logic). In regime conditions, proposed LMS exhibits
a sensible power dissipation reduction with respect to standard LMS. Percentage
reduction is 26–27% for Low-pass FIR and IIR target systems. A lower percentage
reduction (22%) is highlighted for High-pass IIR case where the regime MSE is
higher.
18.5 Conclusions
A novel low-power implementation has been proposed for the LMS algorithm. A
variable rounding on acquired input samples limits multipliers switching activity in
the feedback section and approximation is demanded if error signal is very small.
Results reveal a negligible worsening of regime MSE and area increase along with
the possibility to reduce power consumption up to 27% respect to standard LMS
filter.
References
1. Xu Q, Mytkowicz T, Kim NS (2016) Approximate computing: a survey. IEEE Des Test 33(1):8–
22
2. Han J, Orshansky M (2013) Approximate computing: an emerging paradigm for energy-
efficient design. In: 2013 18th IEEE European test symposium (ETS), Avignon, pp 1–6
3. Chippa VK, Chakradhar ST, Roy K, Raghunathan A (2013) Analysis and characterization of
inherent application resilience for approximate computing. In: 2013 50th ACM/EDAC/IEEE
design automation conference (DAC), Austin, TX, 2013, pp. 1–9
4. Raha A, Jayakumar H, Raghunathan V (2016 Mar) Input-based dynamic reconfiguration of
approximate arithmetic units for video encoding. In: IEEE Trans Very Large Scale Integr
(VLSI) Syst 24(3):846–857
5. Moons B, Verhelst M (2016) A 0.3–2.6 TOPS/W precision-scalable processor for real-time
large-scale ConvNets. In: 2016 IEEE symposium on VLSI circuits (VLSI-Circuits), Honolulu,
HI, pp 1–2
18 Variable-Rounded LMS Filter for Low-Power Applications 161
6. Esposito D, Di Meo G, De Caro D, Petra N, Napoli E, Strollo AGM (2018) On the use of
approximate multipliers in LMS adaptive filters. In: 2018 IEEE international symposium on
circuits and systems (ISCAS), Florence, pp 1–5
7. De Caro D, Petra N, Strollo AGM, Tessitore F, Napoli E (2013) Fixed-width multipliers and
multipliers-accumulators with min-max approximation error. IEEE Trans Circuits Syst I Regul
Pap 60(9):2375–2388
8. Petra N, De Caro D, Garofalo V, Napoli E, Strollo AGM (2011) Design of fixed-width
multipliers with linear compensation function. IEEE Trans Circuits Syst I Regul Pap
58(5):947–960
9. Esposito D, Di Meo G, De Caro D, Strollo AGM, Napoli E (2019) Design of low-power approx-
imate LMS filters with precision-scalability. In: Saponara S, De Gloria A (eds) Applications
in electronics pervading industry, environment and society. ApplePies 2018. Lecture Notes in
Electrical Engineering, vol 550. Springer, Cham
10. Haykin S (2002) Adaptive filter theory. Prentice-Hall
Chapter 19
A Simulink Model-Based Design
of a Floating-Point Pipelined
Accumulator with HDL Coder
Compatibility for FPGA Implementation
19.1 Introduction
of using general-purpose processors. This allows to have exactly the resources needed
for the task and to optimize the system for performance or physical size, depending
on the use case.
The design of dedicated hardware architectures is traditionally done by using a
Hardware Description Language (HDL) and, after subsequent verification methods,
the system is implemented on the destination platform. However, as proved by differ-
ent works [6–8], dealing with high abstraction level frameworks enables the designer
to eliminate the verbosity of highly typed programming languages (such as VHDL
or Verilog) and to focus the attention on system functionalities only. This is possi-
ble, for example, by using MATLAB/Simulink software. A high-level, block-based
design can be developed and the behavior of the system can be simulated in the same
environment. Moreover, with the dedicated HDL Coder tool, an HDL code can be
automatically generated from the system block diagram and hence used to program
the selected platform.
This methodology is the basis of our work on the development of a Support Vector
Machine (SVM) algorithm for HAR to be embedded in an FPGA-based wearable
device.
Among the SVM blocks employed in our dedicated Simulink design, the accumu-
lator is one of the most frequently used. Hence, the aim of this paper is to present a
Simulink model of an accumulation circuit full compatible with the HDL Coder
workflow and which exploits the advantages of a model-based design approach
[9, 10].
The paper is organized as follows. In Sect. 19.2 related works are discussed while
in Sect. 19.3 the designed architecture is introduced. In Sect. 19.4, results are shown
and in Sect. 19.5 conclusion are drawn.
General FPGA-based SVM architecture deals with data with high dynamic data
range: thus, it is based on floating-point arithmetic, as this is the best solution with
data with this requirement [11]. For this reason, we focused on floating-point accu-
mulators’ architecture. The accumulation operation becomes critical when a floating-
point adder with latency is used: in this case, to produce a correct result, the input
data frequency must match this latency value [12]. Many solutions have been pre-
sented in literature to face this issue. In [13], Ni and Hwang presented a version of
the system in which, thanks to an articulated control logic, only one adder and a
buffer are employed. In [14] a version with a better throughput has been proposed.
On a parallel-side branch, several works presented dedicated architectures for the
adder part. In [15], Luo and Martonosi broke down the floating-point adder structure
to embed delayed additions at the cost of a more complex control logic. A similar
approach has been used in [12], and, in [16], Wang et al. presented several reduction
circuits able to work with variable floating-point precision.
19 A Simulink Model-Based Design of a Floating-Point Pipelined … 165
In [17], Zhuo et al. presented two main architectures: the Fully Compacted Binary
Tree (FCBT) and the Single Strided Adder (SSA). The FCBT is an accumulator based
on two classical floating-point adders and a number of buffers k, found to be:
k = logn − 1. (1)
19.3 Architecture
The proposed Simulink model is shown in Fig. 19.1. To design the proposed model,
basic Simulink blocks have been used. However, since in Simulink a specific block
modeling an adder with latency is missing, an Adder With Latency block has been
created as a cascade of an adder and a delay block. This configuration also allows
166 M. Bassoli et al.
Fig. 19.1 The simulink accumulator model based on the work of [19]. In this example, the system
has been configured to model a pipelined accumulator with an adder latency of p clock cycles
to configure the latency of the adder with a customizable value of p. The rest of the
architecture features three Switch blocks (R_Switch, A_Switch and B_Switch) and
three main Logic blocks (External Signaling Logic, Main Control Logic and Adder
Supervisor Logic).
The Switch blocks are used as routing elements and their behavior is equivalent
to the Register Transfer Level (RTL) multiplexer element. With this configuration,
the Register can be shared by both operand A and B. Moreover, as a control logic
rule, the input data can only be used as operand A while the operand B comes from
the feedback path each time the adder output is valid.
The Logic blocks are subsystems which produce the control signals for the entire
architecture. In detail:
• Main Control Logic: it is the core control unit of the system. As explained in [19],
it controls the data path of the input data stream, the Register and the adder to
avoid data collisions and data loss. The detailed operation of the logic is reported
in Table 19.2 and an execution example is shown in Table 19.3;
• External Signaling Logic: it is the logic dedicated to the management of the
data_last input flag and to produce the result_ready output flag. The output can
be considered ready when all the input conditions are verified: data_last raised
by the user, internal adder pipeline empty (meaning no other operands are to
be processed) and last adder result placed in the Register. The first condition is
evaluated by capturing the user data_valid assertion through a Set-Reset (S-R)
Flip-Flop (FF), the second is directly given by the pipeline_empty signal from the
Adder Supervisor Logic and the third is evaluated by verifying whether the R_sel
19 A Simulink Model-Based Design of a Floating-Point Pipelined … 167
Table 19.3 Example of 4 input elements and a pipelined adder with a latency of 2 clock cycles
Cyc. Data A B R Result
0 X1 X1
1 X2 X2 X1
2 X3 X3
3 X4 X4 X1 + X2 X3 X1 + X2
4 X3
5 X3 X1 + X2 + X4 X1 + X2 + X4
6
4
7 i=1 Xi
bus is equal to one. The FF S-R is reset by the result_ready signal delayed by
one clock cycle (reset_ready’), so to set the system ready for the next streaming
accumulation. The circuit dedicated to this task is shown in Fig. 19.2a;
• Adder Supervisor Logic: by checking if a new couple of inputs are presented to
the adder, it notifies if any data is inside the pipeline. In addition, it signals when
a sum operation has been completed and the adder output is valid. The internal
logic is shown in Fig. 19.2b. The new_input bit signal goes high each time a new
couple of operands is presented to the adder and it is used as the input of the shift
Fig. 19.2 a External signaling logic function; b Adder supervisor logic function
168 M. Bassoli et al.
register represented by the FF1, FF2, …, FFp, with p the length of the internal
adder pipeline. When the sum_valid bit goes high, p clock cycles are elapsed,
meaning the addition result is ready. Moreover, if the pipeline_empty bit is low,
means no new operands have been presented in the last p clock cycles, i.e. the
internal adder pipeline is empty.
In Table 19.3, an example of the running algorithm is shown with the internal
adder latency configured to be 2 clock cycles.
For simplicity, in this use case, four data elements are read, one every clock cycle,
while the adder is a two-stage pipeline operator. At the cycle 0, the first element is
presented. Since the adder produces a valid output only with a pair of input operands,
the element is stored in the Register. At the next cycle, a new input data is ready
and now the two operands can be pushed in the adder pipeline. The working mode
repeats these steps until the first sum is generated by the adder, here at cycle 3. In this
situation, the Register is already storing a value (X 3 ), so the control logic pushes into
the adder the new incoming input together with the sum just generated. The Register
is set in a hold state. At cycle 5, when a new couple of data is available, the adder
is fed with the value stored in Register and the last generated sum. After two clock
cycles (i.e. the adder pipeline latency), the final accumulation value becomes valid.
19.4 Results
The presented model has been compared with Xilinx Floating-point accumulator
Intellectual Property (IP) core for FPGA implementation. To have comparable results,
both architectures has been configured to have a total accumulator latency of 30 clock
cycles. For the Simulink model, this means using an adder pipeline latency p of 11
clock cycles and an input streaming length n of 5 values, as found by using the DB
architecture equation of Table 19.1.
In Fig. 19.3, a Simulink example of an input stream of 5 random floating-point
values in the range −100 to 100 is reported.
As shown, the input flags data_valid and data_last are attached to the input stream
to notify whether the value is valid and the last. After the data_last flag has been
asserted and the whole system finishes its internal processing, the output_ready flag
is raised for one clock cycle. This notifies the user about the result readiness.
To test the HDL Coder compatibility, a non-target-specific VHDL code generation
has been carried out for an architecture based on the floating-point 32-bit format.
The generate code has then been imported in Vivado software and, after synthesis
and implementation elaborations for a Xilinx Artix-7 XC7A100T-CSG324 FPGA
target device, results have been reported in Table 19.4. Both systems perform the
same data processing: accumulation of a 32-bit floating-point input stream, with a
total latency of 30 clock cycles and an input of 5 streaming values.
19 A Simulink Model-Based Design of a Floating-Point Pipelined … 169
Fig. 19.3 Example of an execution of the accumulator model: a input data values; b external data
valid input signal (data_valid); c external input signal to notify the last value of the set (data_last);
d output value (result); e internally generated output signal to notify the output is valid (result_ready)
As shown, the presented model features lower resources usage then the Xilinx IP
implementation. This result was expected because the internal fixed-point accumu-
lator of the IP had to be configured to match the full data range and precision of the
32-bit floating-point format.
170 M. Bassoli et al.
19.5 Conclusion
References
1. Bassoli M, Bianchi V, De Munari I (2018) A plug and play IoT wi-fi smart home system for
human monitoring. Electronics 7(9):200
2. Montalto F, Guerra C, Bianchi V, De Munari I, Ciampolini P (2015) MuSA: wearable multi
sensor assistant for human activity recognition and indoor localization. Biosyst Biorobotics
11:81–92
3. Guerra C, Bianchi V, De Munari I, Ciampolini P (2015) CARDEAGate: low-cost, ZigBee-
based localization and identification for AAL purposes. In: 2015 IEEE Instrumentation and
Measurement Technology Conference (I2MTC)
4. Bianchi V, Bassoli M, Lombardo G, Fornacciari P, Mordonini M, De Munari I (2019) IoT
wearable sensor and deep learning: an integrated approach for personalized human activity
recognition in a smart home environment. IEEE Internet Things J 6(5):8553–8562
5. Gaikwad NB, Tiwari V, Keskar A, Shivaprakash NC (2019) Efficient FPGA implemen-
tation of multilayer perceptron for real-time human activity classification. IEEE Access
7(8651457):26696–26706
6. Giardino D, Matta M, Re M, Silvestri F, Spanò S (2018) IP generator tool for efficient hard-
ware acceleration of self-organizing maps. In: International Conference on Applications in
Electronics Pervading Industry, Environment and Society (APPLEPIES)
7. Hai JCT, Pun OC, Haw TW (2015) Accelerating video and image processing design for FPGA
using HDL Coder and Simulink. In: 2015 IEEE Conference on Sustainable Utilization and
Development in Engineering and Technology (CSUDET)
8. Michael T, Reynolds S, Woolford T (2018) Designing a generic, software-defined multimode
radar simulator for FPGAs using Simulink® HDL Coder and Speedgoat real-time hardware.
In: 2018 International Conference on Radar (RADAR)
9. Choe J et al (2019) Model-based design and DSP code generation using Simulink® for power
electronics applications. In: 2019 10th International Conference on Power Electronics and
ECCE Asia (ICPE 2019–ECCE Asia), pp 923–926
10. Perry S (2009) Model based design needs high level synthesis—a collection of high level
synthesis techniques to improve productivity and quality of results for model based electronic
design. In: 2009 Design, Automation and Test in Europe Conference and Exhibition (DATE
’09), pp 1202–1207
11. Flynn MJ, Oberman SF (2001) Advanced computer arithmetic design
19 A Simulink Model-Based Design of a Floating-Point Pipelined … 171
12. Nagar KK, Bakos JD (2009) A high-performance double precision accumulator. In: 2009
International Conference on Field-Programmable Technology (FPT’09)
13. Ni LM, Hwang K (1985) Vector-reduction techniques for arithmetic pipelines. IEEE Trans
Comput C–34(5):404–411
14. Sips HJ, Lin H (1991) An improved vector-reduction method. IEEE Trans Comput 40(2):214–
217
15. Luo Z, Martonosi M (2000) Accelerating pipelined integer and floating-point accumulations in
configurable hardware with delayed addition techniques. IEEE Trans Comput 49(3):208–218
16. Wang X, Braganza S, Leeser M (2006) Advanced components in the variable precision
floating-point library. In: 2006 14th Annual IEEE Symposium on Field-Programmable Custom
Computing Machines (FCCM)
17. Zhuo L, Morris GR, Prasanna VK (2007) High-performance reduction circuits using deeply
pipelined operators on FPGAs. IEEE Trans Parallel Distrib Syst 18(10):1377–1392
18. Zhuo L, Morris GR, Prasanna VK (2005) Designing scalable FPGA-based reduction cir-
cuits using pipelined floating-point cores. In: 19th IEEE International Parallel and Distributed
Processing Symposium (IPDPS 2005)
19. Tai Y-G, Lo C-TD, Psarris K (2012) Accelerating matrix operations with improved deeply
pipelined vector reduction. IEEE Trans Parallel Distrib Syst 23(2):202–210
20. Huang M, Andrews D (2013) Modular design of fully pipelined reduction circuits on FPGAs.
IEEE Trans Parallel Distrib Syst 24(9):1818–1826
Chapter 20
Bitmap Index: A Processing-in-Memory
Reconfigurable Implementation
20.1 Introduction
The Processing-in-Memory paradigm requires that logic and storage elements are
merged together. This paradigm is particularly suited for all those algorithms that
need to perform huge amount of simple operations on data stored. To demonstrate
the advantages that the PIM approach can provide, we choose to implement an
architecture able to solve the Bitmap indexing problem. The Bitmap indexing is an
important algorithm often used in database management systems.
The Bitmax Indexing is an algorithm used to identify, inside a database, entries
that have specific characteristics. For example, inside the database of Fig. 20.1.A the
query consist in the identification of how many man own a motorbike or a sport car.
To reach this goal each feature is indexed using a binary representation. The gender
column, for example, is divided in two sub-columns, one representing the male gender
and one representing the female gender. Then each sub-column is represented using
single bits. For example the first entry of the database is a female, so the M column
contains ‘0’, while the F column contains ‘1’ (see Fig. 20.1a). Searching for a specific
query inside such database means performing simple logic operations between each
sub-column, as depicted in Fig. 20.1b.
In our architecture, instead of memorizing the database inside the memory fol-
lowing the same structure proposed in Fig. 20.1, we memorize the transpose of the
matrix of bit representing the database. With this solution every row of the memory
contains a column representing a specific feature. As a consequence to search for
(a) M F
0 1
1 0
1 0
Fig. 20.1 a Given a table, bitmap indexing transforms each column in as many bitmap as the number
of possible key-values for that column. b In order to answer a query logic, bitwise operations are
to be performed. c Practical scheme of the execution of the query
176 M. Andrighetti et al.
(a) DATA_IN
DATA 0
LIM ARRAY (b) BANK
INSTR. MEM.
IN
OP. DISPATCHER
ADDR. RF
1
BANK OPERATION DECODER
QUERIES
ADDRESS DEC.
GHOST ROW
BREAKER BANK
BREAKER LIM ROW
CONTROL
UNIT
DELAY LIM CELL
1s
COUNTER
0 1
CELL
ROQ
data_in from_mem
CONFIG logic_result
Result of Query data_out MEM from_ext LOGIC
(c)
Fig. 20.2 a Overview of the complete architecture. b Structure of the duo Bank-Breaker. c Insight
of the PIM cell
The second was inserted to control addresses, data flow inside the bank and select
between PIM and standard memory mode. The breaker block is used to enable the
communication among different banks. This structure is flexible and can be easily
reconfigured to implement other algorithms.
comparison of the proposed architecture with the state of the art in terms of clock
cycles required for an operation. Our architecture is always faster than the other
solution proposed in literature.
It should be taken into account that even with multiple parallel operations the clock
cycles required would remain constant, achieving the throughput mentioned above,
meaning also that the maximum degree of parallelism reachable is equal to the num-
ber of the available banks. Moreover, thanks to its modular structure, the architecture
is meant to be easily scaled to bigger dimensions and with as many banks as needed.
It could also be possible to develop a 3D structure in order to increase performance.
The architecture could be easily modified to implement other types of operations.
In conclusion, this architecture demonstrates that a Processing-in-Memory approach
leads to a great improvement of performance. The architecture here proposed achieve
very good performance and has enough flexibility to be adapted to several different
algorithms.
References
Abstract Fine resolution selection of the sample rate is not available in digital
storage oscilloscopes. They rely on offline processing to cope with such need. The
paper presents an algorithm that, exploiting online processing with a digital filter
characterized by dynamically generated coefficients and a memory management
strategy, allows almost arbitrary selection of the sample rate from an incoming stream
of samples. The paper also proposes a digital circuit implemented on FPGA to devise
the possible performance of the method.
21.1 Introduction
Analogue oscilloscopes offer a discrete set of time base signals to select the time
window that is analyzed. The use of a continuously variable control is possible but
is in trade off with the calibration of the signal [1, 2].
In digital storage oscilloscopes (DSOs) the time base is determined by controlling
the sampling rate. Again, only a discrete set of values is available [3, 4] since DSOs
provide the highest sampling rate and obtain lower rates through decimation [5–7].
Flexible sample rate selection would allow more efficient usage of memory
resources allowing the exact sampling rate needed for the given application. Sample
rate changes can be accomplished through digital resampling approaches but the
required processing power and the need of dedicated circuitry for each sampling
M. D’Arco · E. Napoli
Department Electrical and Information Technology Engineering, University of Naples Federico II,
80125 Naples, Italy
e-mail: [email protected]
E. Napoli
e-mail: [email protected]
E. Zacharelos (B)
Department Physics, Electronics and Electronic Computers, Aristotle University of Thessaloniki,
54124 Thessaloniki, Greece
e-mail: [email protected]
rate makes this choice unfeasible [8–11]. Modern DSOs host powerful CPUs able
to implement in real time: averaging, FFT spectral analysis, parameters measure-
ments, and selection between different acquisition modes [12, 13]. Unfortunately,
these CPUs cannot resample the input stream in real time.
This paper proposes a time base system that, thanks to the simplicity of its oper-
ation principle, allows fine selection of the sample rate of the digital storage scope
with very fine frequency resolution up to the maximum sample rate. The proposed
solution relies on a suitable memory management strategy and a dynamical digital
filter.
The proposed method involves the use of a digital circuit, deployed between the ADC
and the acquisition memory that, depending on the chosen design parameters, allows
a very fine regulation of the sample rate from half the system frequency, 1/2 · f ck up
to the highest frequency, f ck . The method does not lack in generality since choosing
a sample rate lower than 1/2 · f ck is easily obtained by cascading the proposed circuit
with a standard one that performs decimation by an integer value. It is important
to highlight that the whole acquisition chain made up of ADC, digital circuit, and
memory operates synchronously at the system clock rate f ck .
After processing, the samples stored in the acquisition memory represent a version
of the input signal resampled at a sample rate f s = C f ck , where C is an arbitrary
(within limits) fractional value in the interval [1/2, 1).
The digital circuit processes in real-time the signal x(n) deriving from the ADC,
and produces the output, y(n). Both are produced at the highest clock rate, f ck . The
output is an estimation of the samples of the input signal, resampled at f s = C f ck .
The value y(n) is determined by combining the samples x(n) and x(n − 1) returned
by the ADC:
(T s = 1.314 ns) is shown with red bullets. The resampling factor is C = 0.761, and
the coefficient a(n) is updated subtracting C −1 − 1 = 0.3141 to the current value.
b(n) = 1 − a(n) represents the point inside the sampling period where resampling
must be performed. The bottom axis is the time while the top axis shows the increment
of the memory pointer. When a(n) is incremented (time: 6, 10, 14, 18 in Fig. 21.2)
the memory pointer is not updated.
The proposed method suffers of a performance degradation when compared with the
standard technique (zero padding, low pass, decimation). However, simulated tests
with sinusoidal signals demonstrate that when the sampling clock is at least ten times
higher than the signal bandwidth, the results are satisfactory.
The performances are reported in terms of standard parameters defined for a
pure sine wave: signal-to-noise-and-distortion (SINAD) ratio and total-harmonic-
distortion (THD). SINAD and THD are calculated: for the input signal corrupted by
white Gaussian noise (rms value equal to 15% the LSB of the ADC) and quantized
by and 8bit ADC; for the resampled signal.
Figure 21.3 reports the result obtained resampling at 743 MHz a 47.1 MHz signal
converted with a 1GSs ADC. The original signal has SINAD = 48.49 dBc and
THD = −51.50 dB. The resampled signal has SINAD = 46.97 dBc and THD = −
50.04 dB showing quite limited degradation. Similar results are obtained applying
50 kHz random deviation of the input frequency.
21 Digital Circuit for the Arbitrary Selection of Sample … 187
A digital circuit for the implementation of the above proposed resampling algorithm
has been designed. The schematic (without pipelining) is in Fig. 21.4.
Circuit input data are the signal to be resampled x, and the resampling factor
defined through the input d = 1 − C −1 . The output data are the resampled stream
y, and the memory pointer, Ptr X . The number of bits for x, d, and y is 8, while the
memory pointer Ptr X , is represented with 32 bits.
The two complementary coefficients, a and b = 1 − a, are multiplied by the
previous value and the current value of the input signal respectively. Afterwards, the
two products are summed, in order to produce the output signal, y.
The updating of the coefficient a, relies on adding either the quantity d, or in the
case of exception, a unitary value to the current value of a. In the case of exception,
a is negative, and the coefficient’s MSB, is high, a[9] = 1. Otherwise, a[9] = 0, and
d is added to the current value of a. This distinction is realized with the use of a
multiplexer, controlled by the a’s MSB. After the correct choice between “1” and
“d”, an accumulator is implemented for the updating of a.
A second accumulator is implemented, for the memory management. When a is
positive, a[9] = 0, g = 1 and Ptr X is incremented by a unitary value. In the case of
exception, a is negative, a[9] = 1, g = 0 and Ptr X remains unchanged.
In Table 21.1, some basic features of the circuit are presented. C-step refers to
the difference between two consecutive values of the resampling factor. The limi-
tation stems from the fact that d is represented by an 8-bit number. The resolution
obtained on the resampling factor C is about 0.19%. The HDL design is implemented
on a Stratix IV GX FPGA device. Table 21.1 reports the resources needed for the
resampler.
21.6 Conclusion
The paper presented an algorithm and its circuital implementation, for the creation
of a time base that allows fine selection of the sample rate of a digital storage scope.
The proposed algorithm shows good performances when the sampling rate, as
usual, is about ten times higher than the bandwidth of the signal. The circuital imple-
mentation of the algorithm allows, as a proof of concept, to demonstrate the feasibility
of the circuit and its performances when implemented on Stratix IV GX FPGA.
References
1. Oya JRG, Munoz F, Torralba A, Jurado A, Marquez FJ, Lopez-Morillo E (2012) Data acqui-
sition system based on subsampling using multiple clocking techniques. IEEE Trans Instrum
Meas 61(8):2333–2335
2. D’Apuzzo M, D’Arco M (2017) Sampling and time—interleaving strategies to extend high
speed digitizers bandwidth. Measurement 111:389–396
3. Monsurrò P, Trifiletti A, Angrisani L, D’Arco M (2018) Streamline calibration modelling for
a comprehensive design of ATI-based digitizers. Measurement 125:386–393
21 Digital Circuit for the Arbitrary Selection of Sample … 189
22.1 Introduction
Informative totems are tools that can provide useful information, such as finding your
way around a building or buying a train ticket at the train station in few passages.
With the advent of Artificial Intelligence (AI), and in particular of Machine Learning
(ML) techniques, these devices can improve their performance by providing more
effective support to the user and facilitating their purpose [1]. In addition, informative
totems are an example of what are nowadays called smart-edge devices, bringing the
attention to the topic of edge-computing [2].
Let us consider a significant application of smart-totems: supposing that there
is a teenage kid lost inside a shopping mall and no longer able to find his mother.
He needs to find a way in order to rejoin with her, and he knows his mother could
be inside a specific shop of the mall. While looking around, his gaze is caught by
an informative totem that is asking him if he needs help. The kid approach to the
totem and, as he touches the screen, the totem offers him some actions tailored for
the situation, among which “search for a shop”. The kid choose this option and the
totem shows him a map with the correct path to reach the shop from his position.
The kid memorizes the path and proceeds to the shop.
From this example scenario, it is possible to identify some functional require-
ments (FR) for applications targeting smart-totems, such as: (FR1) recognizing the
age and the gender of a person that is approaching to the totem and (FR2) pro-
ducing information basing on them. Together with functional requirements, even
non-functional (NFR) ones can be identified: (NFR1) a system response within a
certain time, possibly under real-time requirements, and (NFR2) the adaptation to
sudden and continuous changes of physical entities with which the system interacts.
Given these requirements, the current trend to address them is represented by AI
for age and gender recognition [3] and automatic adaptation and by edge-computing
for the real-time response [2, 4]. However, nowadays most of AI applications are
implemented on cloud: for example, considering ML [5] (one of the AI techniques
to perform image computation), the ML algorithm for age and gender recognition
are developed for cloud applications, not considering the limited resources of an
edge-computing system.
We place our contribution within a growing research trend: the porting of ML
algorithms on edge-devices (NNs) [6]. Our proposal is an informative totem able to
recognize the age and the gender of a person that is coming toward it and provide
a response basing on this. Our goal is to satisfy FR1, FR2, NFR1 and NFR2 above
described: we developed a ML algorithm (specifically, a neural network, NN) that
works on some images taken with a camera (representing the edge of our system),
and we implemented the NN on an edge-computing platform located close to the
camera. We tested the proposed application on two different edge-computing plat-
forms, one with a CPU and one with a GPU, and we compared results with the
principal competitors.
The paper is organized in the following way: Sect. 22.2 gives an overview about
related works on the topic, Sect. 22.3 describes our system and experimental results,
together with a discussion with other competitors. In Sect. 22.4, some conclusions
and future works are reported.
22 An Intelligent Informative Totem Application … 193
Age and gender classification are very important for advertising and marketing,
but other potential uses include also automatic ticket office or informative totems.
Classifying age and gender of people basing on their face image is a well known
problem in academic literature: to this end, several algorithms have been proposed.
An exhaustive survey on methods and approaches in age and gender estimation is
given in the paper of Atallah [5], providing an overview on the issue from 2010
to 2017. From this work, it emerges that Deep Learning (DL), and in particular
Convolutional Neural Networks (CNN), nowadays provide the best performance
on age and gender recognition, and we witness to a gradual shifting of the use
of classic ML methods to those of DL. The work in [7] presents one of the first
methods adopting DL, the CNN, and it showed improved performance compared
with traditional feature-based methods [8], such as Support Vector Machines (SVM).
In the context of cloud-computing, there are several private companies that pro-
vide this service by calling an API in cloud, such as Google [9], Amazon [10] and
Sighthound [11]. The latter also contributed to academic literature with the paper
[12], with a 61% of accuracy in age estimation.
On the other hand, in the context of edge-computing devices, there have been
implementations of age and gender recognition algorithms with ML, especially DL.
In the commercial field it is possible to find some applications that address this issue:
Axis enterprise proposes the system called Demographic Identifier [13]. Pyramics
does the same with Pysense [14]. The latter, in particular, exploits the age and gender
recognition software developed by Fraunhofer IIS [15] for embedded platforms,
which was also used for other embedded platforms respect to the one considered by
Pysense.
At the best of our knowledge, there are few implementations with CNN in
academic literature. Azarmehr [16] proposes one of the most significant approaches
using an SVM algorithm, implemented on a quad-core Snapdragon 600. On the other
hand, Chen [17] addressed the problem only for the gender recognition using a CNN,
executed on a custom architecture implemented on FPGA. Irick [18] also reported
an Artificial Neural Network (ANN) based system executed on an architecture im-
plemented on an FPGA, that achieves an accuracy of 83.3%, roughly processing 30
images per second.
A schematic summary is reported in Table 22.1.
In this section, we present our system. Firstly we present our goal, and then we move
to description of the system components. Then, the performed tests are shown and a
final discussion on results is reported.
The main idea we want to present in this paper is the transfer of a CNN, previously
developed [19], able to recognize the age and gender from an image of an individual,
on edge devices, in order to test the relative performance of the final system. We
conceived a comparison between a CPU and a GPU edge device(s), performing the
classification made by the CNN, by observing how much execution time is needed
to accomplish it at edge conditions for both devices.
In this way, we want to lay the foundations for a more detailed study of the
problem of moving ML algorithms, in particular as NNs, for the recognition of
age and gender of individuals on edge devices, which notoriously possess greater
constraints of computational resources if compared with cloud devices. The CNN
and the edge devices used for comparison are described in the following paragraphs.
The proposed neural network The considered NN is a CNN, in particular a VGG16-
like from an architectural point of view [19]. Our algorithm classifies an individual
image in ten different classes, binding together the information of age and gender,
in order to have one NN able to perform the age/gender prediction, limiting as
much as possible the memory occupation. Indeed, by owning two networks that
separately perform age and gender prediction, would increase the occupied memory
space, which is a non-trivial aspect for an edge-computing device. In particular, our
solution occupies approximately 600 MB, and it is able to recognize people according
to the classification method defined in [19], with an accuracy of 40% and an off-by-1
accuracy of 70%. Nevertheless, our attention, is currently focused on the inference
process of the CNN, rather than on the question of training phase, already addressed
in paper [19]. Therefore, our interest is the time performed by the CNN in doing a
prediction.
The CNN is written in Python 3.6, exploiting the ML libraries Tensorflow and
Keras. Further detail on the CNN are reported in [19].
The edge device The Edge-Computing revolution makes it necessary to seek alter-
natives to the use of low-profile microcontrollers, as it has been traditionally done
in wireless sensor networks. When algorithms become more computing intensive,
architectures over classical CPUs, such as GPUs and circuits implemented on FP-
GAs, can prove beneficial when used as processing platforms. Moreover, System-
on-Programmable Chips (SoPCs), integrating FPGAs with microcontrollers on the
same device, allow combining the flexibility of software with the performance of
hardware.
22 An Intelligent Informative Totem Application … 195
In this paper, we consider the comparison between a CPU and a GPU edge plat-
form. In particular, the accounted device is the Nvidia Jetson Nano board [20]. This
platform represents a valid solution considering the prospective of transferring ML
applications to edge-computing devices. It consists of a Quad-core ARM® Cortex® -
A57 MPCore CPU, together with a Nvidia Maxwell™ GPU architecture, with 128
Nvidia CUDA® cores and a RAM memory size of 4 GB. The comparison is made
on the same board, performing the classification of our CNN firstly on the ARM®
processor and then on the GPU, with the aim of obtaining the execution times, of the
same ML algorithm, on both architectures.
22.3.2 Results
We executed our CNN algorithm on both processing element of the same Nvidia
Jetson Nano board, firstly on the ARM, then on the GPU. The results obtained are
shown in Table 22.2.
As expected, the ARM provides a worse performance than the GPU, emphasizing
the importance of using hardware accelerators also at the edge.
22.3.3 Discussion
Results shown in Sect. 22.3.2 are preliminary and further refinements are needed
in order to get better timing performance. As mentioned in Sect. 22.2, our direct
competitors are Demographic Identifier [13] and Pysense [14], which propose a
commercial solution with an application oriented to retail. However, Pysense exploits
the recognition software developed by the Fraunhofer IIS [15], which provides further
results of its software on other edge-computing platforms. Finally, from the academic
literature, we consider the paper [16]. We summarized all these information, publicly
available, in Table 22.3, where we compare our solution with the features of the
competitors.
Looking at the table, it can be seen that not all information are available. Despite
this observation, our solution compared to others still maintains attractiveness. At
the moment, it is the only one that contemplates a single algorithm for the estimation
of both age and gender, so this means less memory space occupied on the edge
device; furthermore, it exploits a Deep Learning (DL) approach, unlike the approach
proposed by Azarmehr [16], which uses an SVM algorithm. CNN methods generally
outperform SVM methods, this because it is known that deep learning performs well
when large training sets are being used [21]. At this regards, Chen [17] provides an
application on FPGA with discrete results, but its algorithm is only able to recognize
the gender of an individual.
About Axis solution [13], we do not have information, so a consistent comparison
is not possible. For the Pyramics solution [14], and so the Fraunhofer IIS software
[15], we have no information about the age and gender prediction software used.
The timing performance are better in their case, but we have no percentage about
age estimation.
The terms of comparison with competitors are still unclear, due the limited avail-
ability of information on the two requirements considered (Age and Gender recogni-
tion on edge devices). Indeed, in some cases, the other solutions do not report details
on the ML algorithm used for age and gender recognition. Others, on the other hand,
22 An Intelligent Informative Totem Application … 197
22.4 Conclusions
In this paper we presented our idea targeting an informative totem, its possible usage
and the requirements it needs to satisfy. We presented our solution implementing a
CNN on an edge device. Tests on timing performance have been done and a compar-
ison with principal competitors is reported. The latter has not been easy due to the
fact that not all the information are available for each competitor. Our solution still
lacks of accuracy but we foresee to improve it according to the edge platform we are
going to use. A hardware/software co-design of the entire system is required, taking
into account the CNN architecture, the hardware and compression techniques for the
NN. Surely, in future, we will widen the comparison with FPGA-based platforms.
References
1. Di Mascio T, Gennari R, Melonio A, Tarantino L (2014) Engaging New users into design
activities: the TERENCE experience with children. In: Smart organizations and smart artifacts,
pp 241–250
2. Satyanarayanan M (2017) The emergence of edge-computing. Computer 50(1):30–39
3. Shi W, Cao J, Zhang Q, Li Y, Xu L (2008) Artificial intelligence techniques: an introduction
to their use for modelling environmental systems. Math Comput Simul 78:379–400
4. Shi W, Cao J, Zhang Q, Li Y, Xu L (2016) Edge-computing: vision and challenges. IEEE Intern
Things J 3:637–646
5. Atallah RR, Kamsin A, Ismail MA, Abdelrahman SA, Zerdoumi S (2018) Face recognition and
age estimation implications of changes in facial feature: a critical review study 6:28290–28304
6. Li H, Ota K, Dong M (2018) Learning IoT in edge: deep learning for the internet of things
with edge-computing. IEEE Netw 32–1:96–101
7. Levi G, Hassner T (2015) Age and gender classification using convolutional neural networks.
In: 28th IEEE conference on computer vision and pattern recognition (CVPR), pp 34–42, IEEE
Press, Boston
8. Eidinger E, Enbar R, Hassner T (2014) Age and gender estimation of unfiltered faces. IEEE
Trans Inf Forensics Secur 9:2170–2179
9. Google Vision API https://cloud.google.com/vision/?source=post_page
10. Amazon Rekognition https://aws.amazon.com/it/rekognition/?source=post_page
11. Sighthound Recognition API https://www.sighthound.com/products/cloud
12. Dehghan A, Ortiz EG, Shu G, Masood SZ (2017) DAGER: deep age, gender and emotion
recognition using convolutional neural networks. arXiv:1702.04280
198 P. Giammatteo et al.
Abstract Frame jitter occurs when the delay between a trigger and the start of a
signal acquisition or signal generation is different among subsequence data frames.
Test bench waveform signal generators features low frame jitter (e.g. 400 ps rms),
but this performance is still insufficient for the instrument to be used in sensitive
applications like Doppler velocimetry. In this work a circuit is presented that syn-
chronizes on-the-fly an internal clock to every occurrence of an external trigger. It
is implemented in a Field Programmable Gate Array (FPGA) and features a frame
jitter lower than 100 ps rms.
23.1 Introduction
Frame jitter occurs in instruments or systems that acquire or produce frames of data
synchronized by a trigger [1]. This is the case, for example, a waveform function
generator that produces sinusoidal bursts triggered by an external pulse sequence.
Small temporal differences between the trigger active edge and the actual start of the
burst generation represent the frame jitter.
Applications like interferometric radar [2] or Doppler velocimetry are quite sen-
sitive to this problem. For example, in ultrasound Doppler for biomedical [3] or
industrial velocimetry [4], bursts of ultrasounds are transmitted every Pulse Repeti-
tion Interval (PRI). The target produces an echo whose phase changes depending on
its position among subsequent PRIs. Target velocity is detected by reading the phase
changes that occur in subsequent data frames (PRIs). Unfortunately, frame jitter
alters directly the signal phase, affecting the accuracy of the velocity measurement.
Test bench function generators like, e.g. 33612A from Keysight Technologies Inc.
(Santa Rosa, CA, USA) features a frame jitter as low as 320 ps rms. However, a jitter
lower than 100 ps rms is desirable in most of the aforementioned applications, so the
use of test-bench instrumentation can be not feasible.
In this paper a resynchronization circuit is presented that produces a clock whose
phase is synchronized on-the-fly to an input trigger. The synchronization occurs
for every trigger pulse. The generated clock can be used to generate/acquire data
frames with low frame jitter. The circuit is implemented in a Field Programmable
Gate Array (FPGA) of the Cyclone III family (Altera-Intel, Santa Clara, CA USA)
[5]. Next section describes the architecture of the proposed circuit, and in Sect. 23.3
experiments show how the proposed circuit limits the frame jitter below 100 ps rms.
System A and System B work with the independent and asynchronous clocks clkA
and clkB , respectively (see Fig. 23.1). System A generates periodic events to System
B signaled by the active edge of the Sync signal. Every time an edge on Sync is
received, the Phase Alignment Circuit (PAC), embedded in the FPGA of the System
B, tunes the phase of the clkS to the phase of the trigger.
System B exploits clkS for generating and/or acquiring data with low frame jitter
with respect to the trigger.
The proposed architecture of the PAC is sketched in Fig. 23.2. A Tapped–Delay-
Line (TDL), typically employed in Time-to-Digital converters [6], performs a fine
measurement of the temporal delay between the Sync signal and the clkTDL signal.
The delay is represented by the number N of delay elements crossed in the TDL
(see following Sect. 23.2.1) by Sync before the clkTDL edge occurs. N is represented
by a thermometric code, converted in a binary value by the following encoder. The
Calibration RAM (C-RAM) stores at address N the number of phase steps necessary
for the correction. The clkS phase is tuned by accessing the Phased Locked Loop
(PLL) through the Phase Shift Control interface [5]. This operation affects the phase
of clkS only, thus the PLL never loses the lock condition. The Calibration Unit (CU)
populates the C-RAM once at system switch-on. Details of the mentioned blocks are
given in the following sections.
A TDL consists of a set of delay elements followed by registers, as shown in Fig. 23.3.
Each pair delay element-register represents a TDL Cell and returns the status (“Cell
Status” register) of that cell at clkTDL rate [7]. The TDL is fed by the Sync signal
that propagates in the delay line as sketched on the left of the Fig. 23.3. At time
t0 , a Sync edge enters the TDL and crosses the delay elements as it propagates (t1 ,
…, tn ), leading to a variation of the Cell Status register. At the first rising edge of
clkTDL after the Sync edge fed the delay line, the Cell Status register represents the
number of elements crossed by Sync edge. In particular, the phase information is
stored in the position of the transition 0-to-1 in the bits of the register, that is detected
by the following encoder. In the example shown in Fig. 23.3 the clkTDL edge stops
the register sampling at t3 , after Sync crossed 3 delay elements. The delay Tm is
quantified given the delay of each Cell, which is obtained through the calibration
process detailed in next section.
The FPGA implementation of a TDL requires a deep knowledge of the target
device architecture and of the tools for constraining the physical placement. The
realization of “small” and harmonized delay elements (order of tens of ps) is the
main issue. A typical solution consists in exploiting the “carry” logic normally used
to realize adders, counters, etc. [7]. Indeed, the carry routing paths are more matched
with each other than the general routing paths of the FPGA, and grant delays below
70–80 ps, depending on the target device. The carry logic can be used by realizing an
adder and by forcing its inputs to “0” and “1” so the output is dependent on the adder
carry-in value only. Then, a N-bit adder realizes a N-Cell TDL. Moreover, each bit
of the adder can be implemented in a Logic-Element (LE), which is the basic unit of
the Cyclone III FPGA, that includes a register (FF) as well. The latter must be used
as Cell register to reducing the path between adder and register, minimizing the skew
between the outputs of the Cells. In the Cyclone III device, the LEs are grouped into
groups of 16 called Logic Array Block (LAB) [5]: to realize a N-Cell TDL with N >
16, more LABs are necessary. Specific constraints should be set in “Design Partition
Planner” and “Chip Planner” tools of Quartus II software to direct the fitter to use
consecutive LABs that have a carry delay similar to that among LEs. Constraints are
also given so that the fitter will use LUT and FF of the same LE to implement each
single TDL Cell [5].
All these considerations let to implement a reliable and reproducible structure
that can’t be obtained with the typical FPGA design flow, where the fitter is free to
place the logic according to general optimization strategies.
In order to know the delay associated to each Cell of the TDL a calibration is required.
Moreover, the calibration compensates for the TDL delay deviations due to temper-
ature and power supply variations. There are two different calibration processes:
“double registration” and “statistical” calibration [8]. The first approach is the sim-
plest, but only the mean value of Cell delays is estimated. Although this is not
sufficient for an accurate phase measurement, it is useful for an initial estimation of
how many delay cells are required in the TDL:
tT DL
N T DL = mean (1)
tcell
line. This approach results in a “calibration curve”, like the one shown in Fig. 23.4.
This procedure is implemented in the Calibration Unit of Fig. 23.2, which stores the
calibration curve in the C-RAM.
The fine phase measurements performed by the TDL and converted by the encoder are
used to tune an Altera-Intel PLL. The PLL allows the dynamic shift of the phase of its
outputs clocks relative to the reference. This is achieved through the “dynamic phase
shifting” interface [5]. The phase shifting is performed by steps whose resolution
depends on the voltage controlled oscillator frequency f V C O :
1
r esshi f t = (2)
8 · fV C O
The “PLL PS Control” block of Fig. 23.2 connects to the phase shift interface.
The commands for the shift steps (step up/down, output selection, step strobe) are
serialized with the interface clock, clkPSI. Each step requires 5 clock cycles of clkPSI ,
which corresponds to 50 ns for clkPSI = 100 MHz. For example, a phase rotation of
20 steps takes 1 µs. This time can be reduced by rising the clkPSI frequency.
204 D. Russo and S. Ricci
Fig. 23.5 High persistence display of scope during jitter measurements. Tests were done with the
re-phasing circuit was not active (left) and active (right). Time scale is 4 ns/division
The proposed circuit was implemented in the Cyclone III FPGA of the house-made
board [9]. The circuit included a 256-Cell TDL working with clkTDL of 100 MHz.
The VCO frequency was set to 600 MHz, corresponding to a phase step r esshi f t =
208 ps. The clkPSI was 100 MHz, thus in the worst case the phase was aligned in
2.4 µs after the Sync pulse. The rephrased clock, i.e. clkS , was 100 MHz as well.
The “double registration” was performed to assess the mean delay of the Cells,
mean
which resulted in tcell = 45 ps. Being the TDL constituted by 256 Cells, the total
delay was 256 · 45 ps = 11.52 ns, suitable to cover the 10 ns of the clkTDL period.
For the experiment the circuit was connected to the function generator 33612A
(Keysight Technologies Inc. Santa Rosa, CA, USA) and the oscilloscope TDS5104
(Tektronix, Inc. Beaverton. OR, USA). In particular, the function generator produced
a pulse every 1 ms connected to the Sync input of the proposed circuit and the
trigger input of the scope. A pulse generated by the proposed circuit from the re-
phased clkS was visualized and acquired by the scope, triggered by the original pulse.
Figure 23.5 shows on the left the output pulse when the resynchronization circuit
was not enabled, i.e. clkS = clkTDL . As expected the positive edge position varies
with respect to the trigger in a 10 ns range, i.e. the period of the sampling clock. Once
the resynchronization is enabled, the range of variation of the pulse edge reduces
significantly, like shown on the right of Fig. 23.5. In this last case, the jitter measured
was less than 90 ps rms.
23.4 Conclusion
The proposed circuit is able to dynamically adjust the phase of an internally generated
clock to the rising edge of an input trigger. The tuning occurs at every trigger pulse.
The rephrased clock features a jitter lower than 100 ps, making the circuit suitable for
sensitive applications like, for instance, Doppler velocimetry [10] or Time of Flight
(ToF) measurements [11].
23 FPGA-Based Clock Phase Alignment Circuit … 205
References
1. Kalashnikov AN, Challis RE, Unwin ME, Holmes AK (2005) Effects of frame jitter in data
acquisition systems. IEEE Trans Instrum Meas 54(6):2177–2183. https://doi.org/10.1109/TIM.
2005.858570
2. Pieraccini M, Miccinesi L (2019) Ground-based radar interferometry: a bibliographic review.
Remote Sens 11(9):1029. https://doi.org/10.3390/rs11091029
3. Ricci S, Ramalli A, Bassi L, Boni E, Tortoli P (2018) Real-time blood velocity vector measure-
ment over a 2D region. IEEE Trans Ultrason Ferroelect Freq Control 65(2):201–209. https://
doi.org/10.1109/TUFFC.2017.2781715
4. Birkhofer B, Debacker A, Russo S, Ricci S, Lootens D (2012) In-line rheometry based on
ultrasonic velocity profiles: comparison of data processing methods. Appl Rheol 22(4):44701.
https://doi.org/10.3933/ApplRheol-22-44701
5. Cyclone III Device Handbook, CIII 5V1-4.2, Altera Corp (2012)
6. Roberts GW, Ali-Bakhshian M (2010) A brief introduction to time-to-digital and digital-to-
time converters. IEEE Trans Circ Syst II-Express Briefs 57(3):153–157. https://doi.org/10.
1109/TCSII.2010.2043382
7. Dadouche F, Turko T, Uhring W, Malass I, Dumas N, Le Normand JP (2015) New design-
methodology of high-performance TDC on a low cost FPGA targets. Sens Transducers J
193(10):123–134
8. Wu J (2010) Several key issues on implementing delay line based TDCs using FPGAs. IEEE
Trans Nucl Sci 57(3):1543–1548. https://doi.org/10.1109/TNS.2010.2045901
9. Ricci S, Meacci V, Birkhofer B, Wiklund J (2017) FPGA-based system for in-line measurement
of velocity profiles of fluids in industrial pipe flow. IEEE Trans Ind Electron 64(5):3997–4005.
https://doi.org/10.1109/TIE.2016.2645503
10. Ricci S, Vilkomerson D, Matera R, Tortoli P (2015) Accurate blood peak velocity estima-
tion using spectral models and vector doppler. IEEE Trans Ultrason Ferroelect Freq Control
62(4):686–696. https://doi.org/10.1109/TUFFC.2015.006982
11. Marino-Merlo E, Bulletti A, Giannelli P, Calzolai M, Capineri L (2018) Analysis of errors in
the estimation of impact positions in plate-like structure through the triangulation formula by
piezoelectric sensors monitoring. Sensors 18(10):E3426. https://doi.org/10.3390/s18103426
Chapter 24
Real-Time Embedded System
for Event-Driven sEMG Acquisition
and Functional Electrical
Stimulation Control
24.1 Introduction
(FES) [1], with the aim of physiologically control the muscle functional restoration
as much as possible [2]. In particular, FES employs low energy current pulses to
modulate the muscle contraction [3] following this approach: a complex stimulation
pattern, useful to activate the group of muscles involved in a movement, is regulated
by sEMG envelope evaluation or by muscle force indicators (e.g., RMS, ARV) [4].
In a practical application, the sEMG processing and FES control is a fundamental
task to be carried out in real-time [5]. Since the run-time performance bottleneck
could be easily related to the use of a general purpose computer for the FES control
(often concurrently running, or loaded with, many other unrelated applications or
functionalities, leading to unpredictable performances), here the idea is to replace it
with a dedicated embedded system. In this regard, major concerns will be the effec-
tiveness and safety of the stimulation and the resulting performances, i.e., a latency
short enough to fulfill the real-time constraints and the quality of the stimulated
movement.
We propose an embedded bio-mimetic FES system based on the Average Thresh-
old Crossing (ATC) event-driven technique applied to the sEMG signal. The ATC,
which essentially compares the sEMG signal with a threshold, enables the imple-
mentation of a low-complexity on-board feature extraction process directly in hard-
ware [6, 7], able to support, e.g., the recognition of different gestures [8]. The min-
imal data size of the ATC information [7] and its sparsity (due to its event-driven
nature) perfectly matches the low computational capabilities of an embedded system.
Evolving from the architecture presented in the previous work [7], with the aim of
making the system portable and improving the run-time performance, we replaced
the personal laptop, and the software based on the Matlab® and Simulink® environ-
ment, with a Raspberry Pi 3 B+ as the processing and control core of the system,
running a multi-platform software. Its main tasks are the management of the sEMG
multi-channel wireless acquisition, the computation and update of the FES parame-
ters from the ATC data, and the safe control of the stimulator. The software features
a Graphical User Interface (GUI) as well, to monitor and control every aspect of the
system, eventually guiding the user into setup different stimulation sessions.
24.2.1 Hardware
The developed system represented in Fig. 24.1 can be conceptually divided into three
main parts: the sEMG acquisition modules and the articular electro-goniometers as
inputs, the Raspberry Pi acting as central control and processing unit, and the FES
stimulator. The sEMG acquisition can be performed using two different types of de-
vice, depending on the application-case: we provide a complete four-channels board
(a), suitable for multiple-muscle monitoring on the same limb, or four single sEMG
modules (b), that can be employed individually or in group on different body regions.
24 Real-Time Embedded System for Event-Driven sEMG Acquisition … 209
In both cases the ATC is implemented in hardware, using the standard window of 130
ms [6], and the data are wirelessly transmitted via Bluetooth Low Energy (BLE) to
the Raspberry Pi. Moreover, we developed digital articular electro-goniometers (c)
that can be employed as optional input in the case the user needs a visual feedback
on the angular limb motion helpful to evaluate the running stimulation.
On the other side, we employ the commercial medical-certified RehaStim2 stimu-
lator device provided by the HASOMED GmbH company, which is able to generate
biphasic rectangular current pulses on up to eight channels simultaneously [9]. The
stimulator is interfaced with an external device by means of the ScienceMode2 bidi-
rectional communication protocol [10], which supports the control of complex stim-
ulation patterns and training scenarios since intensity, pulse-width, and frequency are
user-selectable pulse-by-pulse. The Raspberry Pi runs the main software, including
the GUI, controls and acquires data from the input devices, processes the data and
generates the stimulation patterns in real-time, and controls the FES stimulator.
24.2.2 Software
The software has been based on a object-oriented design in order to promote flexibil-
ity and modularity [11] (e.g., leveraging encapsulation, inheritance, and composition
features), both to enable a seamless integration of different devices (e.g., input de-
210 F. Rossi et al.
vices, see Sect. 24.2.1) and to enable the future development of new processing
algorithms. A multi-threaded architecture has been developed in order to map the
functional tasks onto different running threads [12], so to optimize the use of com-
putational resources and to avoid complex (run-time) code interdependencies. From
the development standpoint, we based the software on the Python language, because
of its cross-platform nature, its widespread adoption, and the large availability of
third-party multi-platform libraries (in particular, we used the standard library for
implementing the multi-threading features, and the Kivy library [13] for the GUI).
Referring to Fig. 24.1, the GUI is organized in four full-screen views, through
which the user is able to properly configure and perform the system actions. After
the login, the Initialization view allows the user to set the acquisition and stimulation
parameters directly or by retrieving them from a database or through a calibration
procedure. In the last case a dedicated Calibration view guides the user through
the specific steps. Once the parameters have been set, the user can modify or save
them using the Parameters view. The Main Stimulation view is the core of the GUI,
allowing the user to start/stop the stimulation session, and providing visual feedback
by showing both the FES intensity and the angular information.
A calibration procedure, divided into four sub-steps, is essential to define the
ATC-FES control parameters on a per-user basis. First, the ATC threshold is set just
above the sEMG baseline in order to maximize the threshold crossing events with
the minimal muscle effort. Then, the maximum ATC value and the maximal current
intensity are evaluated in way to create the proper relationship between acquisition
and stimulation data. In the end, the Angular Range Of Motion (AROM) is evaluated.
In this way, we are able to obtain a calibrated set of parameters enabling the imple-
mentation of a simple, yet effective, ATC-FES control algorithm based on lookup
tables.
The multi-threading structure of the system and the running state of the involved
threads during a typical stimulation session is reported in Fig. 24.2. The Main Thread
runs all along the session waiting for the user input and creating child threads: the
FES Control manages the communication with the RehaStim2, also providing the
watchdog timer function; AT Cth , AT Cmax , A R O Mmax and Imax represent the four
calibration steps which trigger the acq (data acquisition) threads.
(a) (b)
80 20
60 15
40 10
20 5
0 0
Fig. 24.3 In a the distribution of the elapsed time between ATC data reception and FES control
update is shown. In b the recorded angular signals (blue: therapist, red: patient) associated to a single
movement repetition, along with the applied stimulation current (black dashed line), are represented
212 F. Rossi et al.
24.4 Conclusion
References
Abstract Deep Neural Networks (DNNs) are being used in more and more fields.
Among the others, automotive is a field where deep neural networks are being
exploited the most. An important aspect to be considered is the real-time constraint
that this kind of applications put on neural network architectures. This poses the need
for fast and hardware-friendly information representation. The recently proposed
Posit format has been proved to be extremely efficient as a low-bit replacement of
traditional floats. Its format has already allowed to construct a fast approximation
of the sigmoid function, an activation function frequently used in DNNs. In this
paper we present a fast approximation of another activation function widely used in
DNNs: the hyperbolic tangent. In the experiment, we show how the approximated
hyperbolic function outperforms the approximated sigmoid counterpart. The impli-
cation is clear: the posit format shows itself to be again DNN friendly, with important
outcomes.
25.1 Introduction
The use of deep neural networks (DNN) as a general tool for signal and data pro-
cessing is increasing both in industry and academia. One of the key challenge is the
cost-effective computation of DNNs in order to ensure that these techniques can be
implemented at low-cost, low-power and in real-time for embedded applications in
IoT devices, robots, autonomous cars and so on. To this aim, an open research field
is devoted to the cost-effective implementation of the main operators used in DNN,
among them the activation function. The basic node of a DNN implements the sum
of products of inputs (X) and their corresponding Weights (W) and then applies an
activation function f ( · ) to it to get the output of that layer and feed it as an input
to the next layer. If we do not apply an activation function then the output signal
would simply be a simple linear function, which has a low complexity but is not
power enough to learn complex mappings (typically non-linear) from data. This is
why the most used activation functions like Sigmoid, Tanh (Hyperbolic tangent) and
ReLu (Rectified linear units) introduce non-linear properties to DNN [1, 2]. Choosing
the activation function for a DNN model must take into account various aspects of
both the considered data distribution and the underlying information representation.
Moreover, for decision critical applications like machine perception for robotic and
autonomous cars, also the implementation accuracy is important.
Indeed, one of the main trend in industry to keep low the complexity of DNN
computation is avoiding complex arithmetic like double-precision floating point (64-
bit), but relying on much more compact formats like BFLOAT or Flexpoint [3, 4] (i.e.
a revised version of the 16-bit IEEE-754 floating point format adopted by Google
Tensor Processing Units and Intel AI processors) or transprecision computing [5,
6] (e.g. the last Turing GPU from NVIDIA sustains INT32, INT8, INT4 and fp32
and fp16 computation [5]). To this aim, this paper presents a fast approximation of
the hyperbolic tangent activation function combined with a new hardware-friendly
information representation based on Posit numerical format.
Hereafter, Sect. 25.2 introduces the Posit format and the CppPosit library imple-
mented at University of Pisa for the computation of the new numerical format.
Section 25.3 introduces the hyperbolic tangent and its approximation. Implemen-
tation results when the proposed technique is applied to DNN with known bench-
mark dataset are reported in Sect. 25.4, where also a comparison with other known
activation functions, like sigmoid, is discussed. Conclusions are drawn in Sect. 25.5.
Fig. 25.2 Two examples of 16-bit Posit with 3 bits for exponent (es = 3). In the upper the numer-
ical value is: (221/256 is the value of the fraction, 1 + 221/256 is the
mantissa). The final value is therefore 1.907348 × 10−6 · (1 + 221/256) = 3.55393 × 10−6 . In
the lower the numerical value is: (40/512 is the value of the fraction, 1
+ 40/512 is the value of mantissa). The final value is therefore 2048 · (1 + 40/512) = 2208
In this work we are going to use the cppPosit library, a modern C++ 14 imple-
mentation of the original Posit number system. The library identifies four different
operational levels (L1–L4):
– L1 operations are the ones involving bit-manipulation of the posit, without decod-
ing it, considering it as an integer. L1 operations are thus performed on ALU and
are fast.
– L2 operations involve unpacking the Posit into its four different fields, with no
exponent computation.
– L3 operations instead involve full exponent unpacking, but without the need to
perform arithmetic operations on the unpacked fields (examples are converting
to/from float, posit or fixed point).
– L4 operations require the unpacked version to perform software/hardware floating
point computation using unpacked fields.
L1 operations are the most interesting, since they are the most efficient ones. L1
operations include inversion, negation, comparisons and absolute value. Moreover,
when esbits = 0, L1 operations also include doubling/halving, 1’s complement when
the specific Posit representation falls within the range [0, 1] and an approximation
of the sigmoid function, called here fast Sigmoid, and described in [9]. Table 25.1
reports some implemented L1 operations stating whether the formula is exact or an
approximation and the operation requirements in terms of Posit configuration and
value. It is important to underline that every effort put in finding an L1 expression
for some functions or operations has two advantages: a faster execution when using
a software emulated PPU (Posit Processing Units), and a lower area required (i.e.
less transistors) when the PPU is implemented in hardware.
216 M. Cococcioni et al.
has a fast and efficient L1 approximation when using Posits with 0 exponent bits
[9] (FastSigmoid). In order to exploit a similar trick for the hyperbolic tangent, we
first introduced the scaled sigmoid function:
Particularly interesting is the case k = 2, when the scaled sigmoid coincides with
the hyperbolic tangent:
sSigmoid2 (x) = e2·x − 1 / e2·x + 1 = tanh(x) (25.2)
Now that we can express the hyperbolic tangent as a linear function of the sig-
moid one, we must rework the expression in order to provide a fast and efficient
approximation to be used with Posits.
25 A Fast Approximation of the Hyperbolic Tangent … 217
Fig. 25.3 The posit circle when the total number of bits is 5. The hyperbolic tangent uses all the
numbers in [−1, 1], while the sigmoid function only the ones in [0, 1]
We know that Posit properties guarantee that, when using 0 exponent bits format,
doubling the Posit value and computing its sigmoid approximation is just a matter of
bit manipulations, so they can be efficiently obtained. The subtraction in Eq. (25.1)
does not come with an efficient bit manipulation implementation as-is. In order to
transform it into an L1 operation we have to rewrite it as:
Then let us focus on negative values for x only. For these values, the expression 2 ·
FastSigmoid(2·x) is inside the unitary region [0, 1]. Therefore, the L1 1’s complement
can be applied. Finally, the negation is always an L1 operation, thus for all negative
values of x the hyperbolic tangent approximation can be computed as an L1 operation.
Moreover, thanks to the anti-symmetry of the hyperbolic tangent, this approach
can also be extended to positive values. The following is a possible pseudo-code
implementation:
FastTanh(x) →y
x_n = x > 0? -x:x
218 M. Cococcioni et al.
s = x > 0
y_n = neg(compl1(twice(FastSigmoid(twice(x_n)))))
y = s > 0? -y_n:y_n
where twice is an L1 operation which computes 2 · x and compl1 is the L1
function that computes the 1’s complement, again as an L1 operation.
Since we are also interested in training neural networks, we also need an efficient
implementation of the hyperbolic tangent derivative:
d(tanh(x))/d(x) = 1 − tanh(x)2
Table 25.2 Accuracy (%) and inference time (ms) comparison between different activation
functions and different Posit configurations (MNIST and Fashion-MNIST data set)
Activation FastTanh (this paper) True Tanh FastSigmoid [9] ReLu
% ms % ms % ms % ms
MNIST
Posit16,0 98.5 3.2 98.8 5.28 97.1 3.31 89.0 2
Posit14,0 98.5 2.9 98.8 4.64 97.1 3.09 89.0 1.9
Posit12,0 98.5 2.9 98.8 4.66 97.1 3.04 89.0 1.9
Posit10,0 98.6 2.9 98.7 4.62 96.9 3.08 89.0 1.9
Posit8,0 98.6 3.01 98.4 4.84 94.2 3.01 88.0 1.9
FASHION-MNIST
Posit16,0 89.6 3.4 90.0 5.5 85.2 3.4 85.0 2.1
Posit14,0 89.6 2.9 90.0 5.0 85.2 3.2 85.0 1.9
Posit12,0 89.7 2.9 90.0 5.1 85.2 3.1 85.0 1.9
Posit10,0 89.7 2.9 89.7 5.1 85.1 3.2 85.0 1.9
Posit8,0 89.6 3.1 89.3 5.2 84.3 3.0 84.0 1.9
25.5 Conclusions
Acknowledgements Work partially supported by H2020 European Project EPI (European Proces-
sor Initiative) and by the Italian Ministry of Education and Research (MIUR) in the framework of the
CrossLab project (Departments of Excellence program), granted to the Department of Information
Engineering of the University of Pisa.
References
1. Pedamonti D (2018) Comparison of non-linear activation functions for deep neural networks
on MNIST classification task. arXiv:1804.02763
25 A Fast Approximation of the Hyperbolic Tangent … 221
2. Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In:
27th International conference on international conference on machine learning (ICML) 2010,
pp 807–814
3. Köster U et al (2017) Flexpoint: an adaptive numerical format for efficient training of deep
neural networks. In: NIPS 2017, pp 1740–1750
4. Popescu V et al (2018) Flexpoint: predictive numerics for deep learning. In: IEEE symposium
on computer arithmetics, 2018
5. “NVIDIA TURING GPU Architecture, graphics reinvented”, White paper n. WP-09183-
001_v01, pp 1–80, 2018
6. Malossi A et al (2018) The transprecision computing paradigm: concept, design, and
applications. In: IEEE DATE 2018, pp 1105–1110
7. Cococcioni M, Rossi F, Ruffaldi E, Saponara S (2019) Novel arithmetics to accelerate machine
learning classifiers in autonomous driving applications. In: IEEE ICECS 2019, Genoa, Italy,
27–29 Nov 2019
8. Cococcioni M, Ruffaldi E, Saponara S (2018) Exploiting posit arithmetic for deep neural
networks in autonomous driving applications. IEEE automotive 2018, pp 1–6
9. Gustafson JL, Yonemoto IT (2017) Beating floating point at its own game: posit arithmetic.
Supercomput Front Innov 4(2):71–86
10. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document
recognition. Proc IEEE 86(11):2278–2324
11. LeCun Y, Jackel L, Bottou L, Brunot A, Cortes C, Denker J, Drucker H, Guyon I, Muller U,
Sackinger E, Simard P, Vapnik V (1995) Comparison of learning algorithms for handwritten
digit recognition. In: Fogelman F, Gallinari P (eds) International conference on artificial neural
networks, Paris. EC2 and Cie, pp 53–60.
12. Xiao H, Rasul K, Vollgraf R (2017) Fashion-mnist: a novel image dataset for benchmarking
machine learning algorithms. arXiv:1708.07747
Part VI
Sensors and Sensing Electronic Systems
Chapter 26
2-D Acoustic Particle Velocity Sensors
Based on a Commercial Post-CMOS
MEMS Technology
26.1 Introduction
on suspended bridges. Since then, the original device has been developed leading
to commercial compact probes allowing detection of multiple APV components.
Other devices based on multiple micro-wires integrated on the same chip have been
proposed to perform 2-D [7, 8] and 3-D [9] measurements. These sensors have been
fabricated with a dedicated micromachining technology that was not compatible with
the standard IC fabrication processes. The possibility of fabricating APV sensors
with a CMOS process followed by a simple post-processing applied in a research
laboratory has been demonstrated [10, 11]. This technique enabled the integration of
two APV sensors with orthogonal sensitivity axis on the same chip with the possibility
of obtaining a 2-D APV sensor with a programmable directivity [12]. Recently,
an APV sensor fabricated by means of a commercial post-CMOS technology [13]
available to small-medium enterprises has been proposed [14].
This work expands the experiments presented in [14] by combining the sig-
nals of two orthogonal APV sensing structures to obtain a single-chip sensor with
programmable directivity. Comparison of the response of two independent struc-
tures also provides preliminary indication of the matching properties of this novel
fabrication flow.
The basic structure that forms the APV sensors is shown in Fig. 26.1. The device is
formed by two parallel polysilicon wires placed at a micrometric distance (L gap in
the figure). Each wire is split into three identical segments, supported by U-shaped
silicon dioxide cantilevers, which are suspended into a single deep cavity etched into
the silicon substrate. The wires are self-heated by an electrical current (bias current)
and the heat exchange between them takes place through both conduction and forced
convection. The latter mechanism depends on the local APV, which induces oscil-
lating temperature variations between the wires. Due to the temperature coefficient
of the wire resistance (TCR ∼ = 1 × 10−3 K−1 ), the bias current converts the temper-
ature variations into an electrical signal (voltage), which is proportional to the APV.
Separation of the wires into smaller segments allows keeping an optimal wire length
and resistance, while increasing stiffness and reducing etching times.
The sensors have been designed using a standard 0.35 μm CMOS process provided
by Austria Micro System, followed by a post-CMOS front-side micromachining step,
aimed at etching the cavities. The whole fabrication flow was provided by the CMP
consortium [13]. The sensor dimensions were optimized using an original simulation
approach [11, 14] based on the COMSOLTM environment. In practice, parametric
simulations performed by varying the main dimensions (e.g. the Lgap distance) has
been used to find the configuration that provides the maximum sensitivity.
The designed test chip, 2.88 mm × 2.88 mm wide, included several different
APV sensors. In this work, we have used the two identical APV sensors shown in the
micrograph of Fig. 26.2, indicated by SX and SY. Each one of these sensors is formed
by two elemental structures as that in Fig. 26.1, connected to form a Wheatstone
bridge as schematically shown for SY on the right of Fig. 26.2. Connection is made
in such a way that the resistance variations of all wires induced by the APV give in-
phase contributions to the output voltage. SX and SY differ only for their orientation,
which is such that SX and SY are sensitive to the APV component along the X-and
Y-axis, respectively.
Fig. 26.2 Micrograph of the test chip portion where the two APV sensors used in this work (SX
and SY) are located. Sensors SX and SY are sensitive to APV components located along the X and
Y-axis, respectively, indicated below the micrograph. The way the four wires W1–4 that form SY
are connected and the equivalence with a Wheatstone bridge are shown on the right
228 A. Ria et al.
The chips are packaged into 44 pin cases (JLCC44) resulting in overall dimensions
of 24 mm × 24 mm × 8 mm for each sample under test. The readout configuration
is shown in Fig. 26.3. A dc supply voltage (V H ) is applied to one diagonal of both
SX and SY bridges, while the corresponding output signals (V SX and V SY ) of the
sensor are taken on the other diagonal. Voltage V H , provided by a LP2951 regulator,
can be varied in the 1.25–12 V range by changing the variable resistor RV . For
conventional acoustic intensity, the output signals are very small (tens of microvolts),
thus amplification with ultra-low noise instrumentation amplifiers (Analog Devices
AD8421, set to gain = 200) is required. The dc component of the bridge output
voltages is removed by high pass filters CF –RF , with a cut-off frequency of 10 Hz.
The amplified output signals V OX and V OY are low pass filtered (roll-off frequency
15 kHz) by RH , CH . A 16-bit digitizer (Picotech PicoScope 4262) is used to acquire
the output voltages.
Signal processing is performed using programs running on a personal computer.
Frequency response measurements are obtained using the stationary wave tube
approach. A rotating sample-holder allows varying the orientation of the sensors
with respect to the direction of the APV, which is parallel to the tube axis.
Fig. 26.4 Overheating temperature versus supply voltage (left) and frequency response, measured
for VH = 7 V (right)
overheating, the higher the temperature oscillations induced by the APV, and then
the higher the sensitivity. A mismatch between the two sensors is clearly visible,
probably due to difference in the thermal insulation. The frequency response of the
sensor sensitivity is shown in Fig. 26.4 (right) for V H = 7 V. The strongly low-pass
characteristic of the response is practically the same for SX and SY, while the former
presents a higher sensitivity, which is consistent with the above-mentioned higher
temperature reached by the wires.
The directivity of the SX and SY sensors is shown in Fig. 26.5 (left). The typical
figure of eight, deriving from a cosine-like response, is visible. We have synthetized
a response with maximum sensitivity along an arbitrary axis by simply calculating
a linear combination of the output signals V OX and V OY , according to:
Fig. 26.5 Left: polar plots of the output voltages (VOX , VOY ) as a function of the sensor orientation,
measured for VH = 7 V. Right: polar plot of the composite output voltage VOθ defined by Eq. (26.1).
Coefficients a and b that appear in Eq. (26.1) are chosen to obtain an angle of maximum sensitivity
of 30°
230 A. Ria et al.
where θ is the desired direction of maximum sensitivity. The result, for a preferential
sensitivity direction of 30°, is shown in Fig. 26.5 (right).
In conclusion, the experimental results confirm that it is possible to program
the axis of maximum sensitivity by simply changing the coefficients of the linear
combination. This goal has been obtained with sensors that did not require completing
the chip fabrication in a research lab, making the development of this kind of devices
attractive even for small enterprises. A current limit of the proposed sensors is the
frequency response, which is considerably worse than previous devices developed
with a research-grade post-CMOS procedure [10, 11]. This drawback is tied to the
larger thermal mass of the wires, due to the lower resolution of the commercial
micromachining technology with respect to the research process. Nevertheless, the
sensitivity at frequencies up to 1 kHz is sufficient to encourage the experimentation
in sound source selection and localization scenarios. Furthermore, application to
acoustic impedance measurements for low frequency material characterization can
also be envisioned.
Acknowledgements This research has been supported by DELTATECH Italy and the Italian munic-
ipality of Sogliano al Rubicone (Italy) in the framework of the SIHT (Sogliano Industrial High
Technology) project.
References
1. Ramamohan KN, Comesaña DF, Leus G (2018) Uniaxial acoustic vector sensors for direction-
of-arrival estimation. J Sound Vib 437:276–291
2. Song Y, Li YL, Wong KT (2015) Acoustic direction finding using a pressure sensor and a
uniaxial particle velocity sensor. IEEE Trans Aerosp Electron Syst 51(4):2560–2569
3. Song Y, Wong KT, Li Y (2015) Direction finding using a biaxial particle-velocity sensor. J
Sound Vib 340:354–367
4. Garcia-Bonito J, Elliott SJ (1999) Active cancellation of acoustic pressure and particle velocity
in the near field of a source. J Sound Vib 221(1):85–116
5. Felisberto P, Santos P, Jesus SM (2018) Acoustic pressure and particle velocity for spatial
filtering of bottom arrivals. IEEE J Oceanic Eng 99:1–14
6. De Bree HE, Leussink P, Korthorst T, Jansen H, Lammerink TS, Elwenspoek M (1996) The
μ-flown: a novel device for measuring acoustic flows. Sens Actuators A 54(1–3):552–557
7. Pjetri O, Wiegerink RJ, Lammerink TS, Krijnen GJ (2013) A crossed-wire 2-dimensional
acoustic particle velocity sensor. In: Sensors. IEEE, pp 1–4
8. Pjetri O, Wiegerink RJ, Krijnen GJ (2016) A 2D particle velocity sensor with minimal flow
disturbance. IEEE Sens J 16(24):8706–8714
9. Yntema DR, Van Honschoten JW, Wiegerink RJ, Elwenspoek M (2008) A complete three-
dimensional sound intensity sensor integrated on a single chip. J Micromech Microeng
18:115004
10. Bruschi P, Butti F, Piotto M (2011) CMOS compatible acoustic particle velocity sensors. In:
IEEE Sensors 2011 conference proceedings, Limerick, 28–31 Oct 2011, pp 1405–1408
26 2-D Acoustic Particle Velocity Sensors Based on a Commercial … 231
27.1 Introduction
an array of Ultra-weak Fiber Bragg Gratings (UWFBGs) which are gratings with a
very small reflectivity in the order of 40 dB or below. They allow multiplexing of
hundreds or even thousands of them in a single mode fiber without cumulative losses
detrimental to measurement SNR reduction [6, 7]. Schemes using time-division mul-
tiplexing [8] or using laser sweeping and phase-unwrapping (a mechanism used to
retrieve the values of the phase beyond the range of the arctangent function) has also
been proposed and demonstrated [9].
Other techniques akin to the ones used for interrogation of standard single mode
fibers include ones based on distributed phase demodulation employing a table lookup
operation [10] and a 3 × 3 coupler [11], which necessitates duplicate receivers. Oth-
ers employing coherent detection or I-Q demodulation can be used but most of them
involve phase unwrapping operations which are typically computationally heavy
[12]. Using such algorithms in a distributed sensor would significantly lower the
dynamic performance. In the differentiate-square-multiply (PGC-DMS) algorithm
[13], demodulation computations involve division operations which are susceptible
to division-by-zero errors. In this work, we propose and experimentally demonstrate
a DAS based on UWFBGs utilizing a homodyne demodulation scheme with delayed
interferometry and direct detection using the phase-generated carrier (PGC) demod-
ulation with the differentiate-cross-multiply (PGC-DCM) [7] algorithm. The method
offers a high SNR, does not require phase-unwrapping, introduces less errors com-
ing from division-by-zero operations in PGC-DMS, involves computations readily
implementable with analogue or digital processing and uses a simple direct detection
receiver.
s2 (t ) d
dt
of the modulation depth C = 2.63. Other modulation points introduce errors and,
most importantly, the arctan function requires computationally costly phase unwrap-
ping. The proposed method in this contribution is the PGC-DCM whose schematic
is shown in Fig. 27.1. The computations in the diagram yield:
d d
s DC M (t) ∝ [G H J2 (C)J1 (C)] × [sin2 φ(t) + cos2 φ(t)] φ(t). (27.2)
dt dt
After simplification using trigonometric identity, the demodulated phase can be
extracted by integration of both sides of (27.2), yielding:
Hence, after normalization of (27.3) to handle the scaling factors, the proposed
algorithm can be used to obtain the demodulated phase without the use of costly
two-dimensional phase unwrapping in a distributed system and reduced errors due
to division-by-zero operations. Note also that PGC-DCM is known to exhibit lower
harmonic distortions than the PGC-arctan algorithm [13].
The experimental setup used to validate the proposed technique is shown in
Fig. 27.2. As shown, light from a narrowband laser of 200-kHz linewidth is amplified
using an Erbium-Doped Fiber Amplifier (EDFA) and filtered using an Optical Band-
pass Filter (OBPF) before being gated with an Acousto-optic Modulator (AOM) to
generate the interrogating pulses. After another round of amplification and filtering,
the pulses are sent through a three-port optical circulator into the fiber under test
delay
DAQ &
Photodiode
Processing
PM
Fig. 27.2 Experimental setup of proposed φ-OTDR: Erbium-doped fiber amplifier (EDFA); opti-
cal band-pass filter (OBPF); acousto-optic modulator (AOM), digital acquisition (DAQ); phase
modulator (PM); piezoelectric actuator (PZT)
236 Y. Muanenda et al.
(FUT), which is composed of 200 UWFBGs each with reflectivity of ~−43 dB,
FWHM of ~3.4 nm and spaced 5 m apart along a 1-km standard singlemode fiber.
The fiber between the two gratings at the end of the FUT is wound around a
pizeoelectric (PZT) actuator on which controlled vibrations have been applied using
a voltage amplifier driven with a waveform generator. The backscattering from the
FUT is collected at the return port of the circulator, which feeds the unbalanced
interferometer where a delay is applied in one arm and the phase of the light on
the other one is modulated using the Phase Modulator (PM). The beating is then
detected using a direct detection receiver with a simple pin photodiode of 125 MHz
bandwidth and acquired using a real-time digital acquisition system for processing
using the PGC-DCM technique.
The first set of measurements involved the observation of raw back-reflection traces
from the UWFBG array when pulses of 20 ns were sent into the sensing fiber.
Since the aim is to observe the visibility of back-reflections, this has been done by
disconnecting the arm of the interferometer having the PM. A sample set of 500 raw
traces is depicted in Fig. 27.3a and b which show that the raw traces from the gratings
have consistent and enhanced visibility suitable for measurement with a high SNR.
This is also confirmed with a comparison of the traces with a “singlemode” oper-
ation of the FUT, which was done by shifting the emission wavelength of the laser
away from the passband of the UWFBGs. The traces given in Fig. 27.4 show the
backscattering from 20-ns pules for both cases and 120-ns ones only for standard
“singlemode” operation. (Note that using a 120-ns pulse with the UWFBG array
would have resulted in the interference of the backscattering from adjacent grat-
ings.) As shown, both at the near- and far-end, the intensity of the response from the
singlemode operation is close to the noise floor and won’t enable distributed mea-
surements at the end of the fiber while the UWFBGs exhibit higher visibility even
for narrow pulses.
Subsequently, the full setup was used by connecting the PM and 10,000 traces
were acquired for a segment of the fiber at the far-end, for a total duration of 200 ms,
to observe the evolution of the traces in the presence of the overlap and beating
from the delayed interferometer. In this case, upon photodetection, a clear beating
of the interfering fields is observed on the oscilloscope, which is also verified in the
acquired traces.
A sample measurement done when a phase modulation of 10 kHz and vibration
of 2.5 kHz is applied to the PZT is depicted in Fig. 27.5, both in time and frequency
domain. As shown, the two intermediate components centered at the modulating
angular frequency ω0 and its double 2ω0 are seen with a spacing of 2.5 kHz, consistent
with the spectrum of a typical phase modulated signal.
Subsequently, the PGC-DCM algorithm was used to obtain the demodulated phase
from the intermediate components as shown in Fig. 27.6. It can be observed in
27 A High-SNR Distributed Acoustic Sensor Based … 237
(a)
0.15
Amplitude (a.u.)
0.1
0.05
-0.05
0 200 400 600 800 1000
Distance (m)
(b)
0.15
Amplitude (a.u.)
0.1
0.05
-0.05
800 850 900 950 1000 1050
Distance (m)
Fig. 27.3 a Overlapped raw backscattering traces showing high visibility and consistent reflection
from ultra-weak gratings b traces at the far end showing capacity to measure with high SNR
Fig. 27.6a that there is consistent demodulation with the proposed technique even
at the zero crossing points of the intermediate signals along the whole 200-ms dura-
tion of the 2.5-kHz signal (500 cycles). In addition, before high-pass filtering, the
demodulated phase exhibits slow variations due to environmental drifts, which are
perturbations of interest in many structural health monitoring applications.
The demodulated and high-pass filtered phase response depicted in Fig. 27.6b has
an SNR of ~34 dB, thanks to the use of UWFBGs. Note that, the trace for singlemode
operation using 20 ns probing pulses does not enable vibration measurement as the
intensity at the far-end is equal to the noise floor as shown in the plot in Fig. 27.4b.
238 Y. Muanenda et al.
(a)
0.15 SMF 20ns
SMF 120ns
UWFBG 20ns
0.1
Amplitude (a.u.)
0.05
0.1
0.05
Fig. 27.4 Comparison of the backscattering traces from the UWFBG array with a singlemode
operation at the a near-end, and b far-end of the responses, confirming significantly higher visibility
due to gratings
27.4 Conclusions
(a)
0.08
0.06
Amplitude (a.u.) 0.04
0.02
-0.02
-0.04
-10
-20
Power (dB)
-30
-40
-50
6 7 8 9 10 11 12 13
Frequency (kHz)
(c)
0
-10
-20
Power (dB)
-30
-40
-50
16 17 18 19 20 21 22 23
Frequency (kHz)
Fig. 27.5 Observed beating from the delayed interferometer at the point of the PZT: a time domain
evolution, and intermediate components b centered at ω0 and c centered at 2ω0
phase retrieval for quantitative measurements are involved. It is also robust against
errors due to division-by-zero operations. In addition, the method is scalable as it
involves integration and differentiation operations which, thanks to the ubiquity of
systems for performing fractional order calculus, can be realized with digital signal
processing schemes based on FPGAs or analog ones using operational amplifiers,
both of which are also candidates for small-scale integration. Hence, the proposed
240 Y. Muanenda et al.
(a)
0.04
0.03
0.02
Phase (a.u.)
0.01
-0.01
-0.02
0 50 100 150
Time (msec)
(b)
Fig. 27.6 a Intermediate components and demodulated phase before and after high pass filtering
showing consistent phase retrieval at zero-crossing points. b Spectrum of the demodulated phase
change induced by 2.5 kHz vibration of a PZT actuator
scheme offers a high-SNR DAS based on a scalable and consistent homodyne phase
demodulation technique suitable for distributed dynamic measurements.
References
1. Muanenda Y, Oton CJ, Di Pasquale F (2019) Application of Raman and Brillouin scattering
phenomena in distributed optical fiber sensing. Front Phys 7:155
2. Muanenda Y (2018) Recent advances in distributed acoustic sensing based on phase-sensitive
optical time domain reflectometry. J Sens 2018:3897873
3. Muanenda Y, Oton CJ, Faralli S, Di Pasquale F (2015) High performance distributed acoustic
sensor using cyclic pulse coding in a direct detection coherent-OTDR. In: Proceedings of 5th
Asia-Pacific Optical Sensors Conference Conference, pp 965547 (SPIE)
27 A High-SNR Distributed Acoustic Sensor Based … 241
28.1 Introduction
Ionic currents across membranes are crucial in both excitable and non-excitable
cells; their accurate measurement requires efficient coupling between cell mem-
brane and measuring electrodes. An elective (though somehow ‘classic’) approach
P. Piedimonte
Basic and Applied Sciences for Engineering Department, Sapienza University of Rome, Rome,
Italy
D. A. M. Feyen · M. Mercola
Medicine and Cardiovascular Institute Department, Stanford University School of Medicine,
Palo Alto, California, USA
E. Messina
Policlinico Umberto I, Sapienza, Rome, Italy
M. Renzi
Physiology and Pharmacology Department, Sapienza University of Rome, 00184 Rome, Italy
F. Palma (B)
Electronics and Telecommunications Engineering Department, Sapienza University of Rome,
Rome, Italy
e-mail: [email protected]
Our preliminary results [13–15] show that different cell types have unaltered mor-
phology and functional properties when seeded on SiNWs compared to control
condition.
Current-clamp recordings from NG18CC15 neural like cells on SiNWs showed
passive properties and excitability profile typical of cells on control substrates. Like-
wise, voltage-clamp experiments from BV-2 cells revealed same membrane current
profile and density across different substrates.
We also tested BV-2 cells grown on SiNWs (both silicon and silicon oxide sub-
strates) using Ca2+ epifluorescence imaging and found that both basal intracellular
[Ca2+ ] and ATP-elicited [Ca2+ ]i rise were typical of BV-2 cells in physiological
conditions.
Primary hippocampal neurons and microglial cells from mice also showed bio-
compatibility with SiNWs as shown by immunofluorescence.
To further broaden our characterization of SiNW biocompatibility we tested iPS-
derived human cardiomyocytes (at Stanford University School of Medicine, CA).
iPS-derived cardiomyocytes developed normally on SiNW substrates and showed
good adhesion to the seeding surface. Notably, we could record typical calcium
fluxes (not shown) and contractile activity from the iPSC-CMs grown on the
nanowires (Fig. 28.3), indicating that SiNWs are compatible with regular growth
and physiological behavior of these human-derived cells.
Beside the functional studies on biocompatibility, the quality of the cell mem-
brane/nanostructures interface is crucial to design bio-devices. Recent findings in the
fabrication of artificial bio-interfaces added much to our understanding of such inter-
face; however, it is still open the question on how the cell membrane accommodate
the presence of sharp objects at the nano-scale.
To address this, we investigated the morphology of the cell/SiNWs interface at
high-resolution using SEM on fixed cultures grown for 48 h on SiNWs in standard
Fig. 28.3 Beating cardyomiocites before (left) and after (right) contraction (Scale bars: 20 µm)
248 P. Piedimonte et al.
Fig. 28.4 SEM images of GH4C1 neuron-like cell cultures on silicon nanostructures. (Scale bars:
a 1 µm, b 200 nm, c 50 nm)
conditions. Figure 28.3 depicts SEM of GH4C1 neuron-like cells and Fig. 28.4 BV-2
microglial cells on silicon nanowires.
First, we noticed that both culturing (and fixative) standard procedures did not
interfere with the presence of SiNWs, indicating that intact nanostructures were
present and preserved also during our functional recordings. In fact both individual
cells and SiNWs could be readily and clearly identified and appeared unaffected
using SEM. Furthermore, the overall cell morphology appeared unaltered and the
cellular membrane is shown to interact very closely and tightly to the engineered
substrates. Our SEM images show that independent of the nanostructure size and
orientation the cellular membrane tends to grip to and engulf the substrate along its
full profile, including the sharp edge at the nanowires base (Fig. 28.5).
Altogether, Silicon NanoWires do interact tightly with cell membranes and do
not appear to alter normal survival, morphology and functional properties of several
cell types in vitro, thus resulting amenable for non-interfered biological measures
and conditioning.
Fig. 28.5 SEM images of BV-2 microglial cell cultures on silicon nanowires. (Scale bars: a 1 µm,
b 200 nm, c 50 nm)
28 Silicon Nanowires as Contact Between the Cell Membrane … 249
28.4 Conclusions
Acknowledgements Authors wish to thank LFoundry for the information on the technology
LF11iS-BSI.
References
1. Sakmann B, Neher E (2009) Single-Channel Recording, 2nd edn. Springer, New York
2. Pine J (1980) Recording action potentials from cultured neurons with extracellular microcircuit
electrodes. J Neurosci Methods 2:19–31 [PubMed: 7329089]
3. Lambacher A et al (2004) Electrical imaging of neuronal activity by multi-transistor-array.
Appl Phys Mater Sci Process 79:1607–1611
4. Zheng W, Spencer RH, Kiss L (2004) High throughput assay technologies for ion channel drug
discovery. Assay Drug Dev Technol 2:543–552 [PubMed: 15671652]
5. Herron TJ, Lee P, Jalife J (2012) Optical imaging of voltage and calcium in cardiac cells &
tissues. J Circ Res 110:609–623
6. Cheng H, Lederer WJ, Cannell MB (1993) Calcium sparks: elementary events underlying
excitation-contraction coupling in heart muscle. Science 262:740–744
7. Matiukas A et al (2007) Near-infrared voltage-sensitive fluorescent dyes optimized for optical
mapping in blood-perfused myocardium. Heart Rhythm 4:1441–1451
8. Fromherz P (2002) Electrical interfacing of nerve cells and semiconductor chips.
ChemPhysChem 3:276–284
9. Timko BP et al (2009) Electrical recording from hearts with flexible nanowire device arrays.
Nano Lett 9:914–918 [PubMed: 19170614]
10. Duan X et al (2011) Intracellular recordings of action potentials by an extracellular nanoscale
field-effect transistor. Nat Nanotechnol
11. Xie C, Lin Z, Hanson L, Cui Y, Cui B (2012) Intracellular recording of action potentials by
nanopillar electroporation. Nat Nanotechnol 7(3):185
12. Patent filed on the 22/09/2017, Ref code: IT0549-17- UNIVERSITÀ DEGLI STUDI DI
ROMA-CB)
13. Piedimonte P et al (2019) Silicon nanowires to detect electric signals from living cells. Mater
Res Express 6(8)
250 P. Piedimonte et al.
29.1 Introduction
A. Bertacchini (B)
DISMI—Department of Sciences and Methods for Engineering,
University of Modena and Reggio Emilia, 42122 Reggio Emilia, Italy
e-mail: [email protected]
M. Lasagni · G. Sereni
IndioTECH srl, via Roma 4, 42014 Castellarano, Reggio Emilia, Italy
e-mail: [email protected]
G. Sereni
e-mail: [email protected]
is necessary and they are sensitive to the properties of dielectrics placed in the mea-
suring gap. Conversely, Eddy-Current Sensors are contactless devices operating on
the principle of magnetic induction, and they can precisely measure the position (dis-
placement or proximity), x, of a metallic target in contaminated environments (e.g.
dust, oil particles, etc.) and also through non-metallic materials such as plastics, dirt,
etc. The main drawback of these sensors is the thermal drift that under uncontrolled
conditions can affect the measurement.
Moreover, as mentioned previously, Eddy-Current Sensors are widely used in
industrial applications not only for direct displacement/proximity measurements but
also to estimate material property and detect crack fatigue inspection [1–4], bearing
wear [5], thicknesses [6, 7], etc.
Different methods have been developed to enhance the resolution and reduce the
thermal drift, such as measuring the working frequency, precise amplitude demodu-
lation, etc. However, none of these methods can ensure ultra-low power performance.
In fact, many works [8–11] show that the continuous power consumption cannot be
lower than 5 mW. This amount of power can be usually considered low in many
applications but in many others where the sensing device is battery-powered (e.g.
wireless sensor nodes) leads to an unacceptable rate of the battery substitution. The
purpose of this work is to demonstrate that it is possible to obtain a low cost 10 µm-
resolution ECDS with power consumption in the order of tens of microwatts. In
order to eliminate any thermal drift issue, in this first implementation, all the mea-
surements have been carried out at constant temperature. In future works, of course,
once a complete smart sensor device will be realized by adding an ultra-low power
wireless microcontroller, a temperature compensation algorithm can be included.
The operating principle of a typical ECS is based on magnetic induction. The main
components of the sensor are a conductive target, a sensing coil and an electronic
circuit interface, as sketched in Fig. 29.1 left. The sensor coil, driven by an ad hoc
AC current, generates an alternating magnetic field, which concatenates with the
nearby conductive target inducing Eddy Currents. In turn, Eddy Currents generate
a magnetic field, which is opposite to the one generated by the coil. This causes a
magnetic flux reduction and energy dissipation in the sensor coil.
With reference to Fig. 29.1 center, the coil-target air coupling can be considered
as an equivalent transformer. The primary of the transformer is the sensing coil
and is comprised of the inductor L x and the series resistor Rx . The secondary of the
29 Ultra-Low Power Displacement Sensor 253
Fig. 29.1 Eddy current sensing system: simplified operating principle of the sensor (left),
transformer model and equivalent circuit (center) and implemented electronic circuit interface
(right)
All the test have been carried out by exploiting the setup sketched in Fig. 29.2 left.
The implemented Electronic Interface Circuit (EIC) has been fixed by means of a
mounting bracket to the fixed end of a commercial manual outside micrometer, while
the target has been positioned on its own mounting bracket fixed to the moving end
of the micrometer. In this way, the distance between target and sensing coil can
be varied of micrometers in a repeatable way. The target has been realized using
common FR4 for PCB production with a 35 µm copper layer and an area of 15 ×
15 mm, larger than the diameter of the inductor used as sensing coil. The EIC’s output
voltage, V OUT , has been measured by means of an Agilent DSO9254A oscilloscope,
while the EIC’s supply voltage has been provided by an Agilent N6715B DC Power
Analyzer to measure precisely also the power consumption of the sensor.
In order to limit any thermal drift issues, all the tests have been carried out at the
same temperature (T room = 23 °C).
Figure 29.3 shows an example of V OUT over time for a given dynamic micrometric-
displacement profile of the target with respect to the coil in case of EIC’s supply
voltage of 100 mV. The same behavior over time under the same dynamic displace-
ment profile has been obtained for different EIC’s supply voltage in the range of
75–200 mV. In particular, Fig. 29.3 shows how V OUT changes in response to differ-
ent displacements in the range 0–5000 µm. As discussed previously, when the target
is close to the sensing coil, the oscillation amplitude decreases due to the larger power
losses, with a consequent decrease of V OUT . Vice versa, by moving the target away
from the coil V OUT increases. Results show good linearity in the range 0–3000 µm
(Fig. 29.3 right).
The sensor’s sensitivity S, expressed in [mV/µm], can be defined as S = ΔV n /Δx n
by discretizing the whole measurement range Δx in n sub-ranges. In the same way,
the resolution R, expressed in [µm], can be defined as R = N n /S. Where N n is the
Fig. 29.2 Measurement setup. Simplified sketch with mounting brackets omitted (left) and real
setup (right). By rotating the micrometer’s knob it is possible to change the distance between
sensing coil and target in a controlled and repeatable way
29 Ultra-Low Power Displacement Sensor 255
Fig. 29.3 Example of Sensor Output Voltage, V OUT , over time during micrometric target’s dis-
placements x with respect to the coil (left) and V OUT versus. x (right). In the example the EIC’s is
supplied with V IN = 100 mV, corresponding to an overall power consumption of PIN = 28 µW
RMS voltage noise of the signal in the n-th sub-range, ΔV n is the output voltage
variation in the n-th sub-range and Δx n is the considered sub-range.
As shown in Fig. 29.4, the larger is S, the better is the R of the sensor. Over
the considered measurement range Δx [0–5000 µm], R is better than 10 µm over
a distance range from 0 to 3 mm. In particular, in the displacement range Δx n =
0–500 µm, the resolution can slightly improve to 6 µm.
Figure 29.5 shows the relationship between R and the power consumption of the
sensor, PIN , for different supply voltages, V IN , in the sub-range Δx n = 0–500 µm,
that is the most interesting one for many industrial applications. As it is possible
to note, an increase of V IN , hence of PIN , improves the achievable resolution. In
particular, with V IN = 200 mV, PIN rises to 140 µW, but the resolution improves
down to 3 µm. For the considered subrange an acceptable linearity R-PIN can be
achieved for a V IN in the range 100–200 mV. Similar results can be obtained for the
other considered Δx n sub-ranges.
Fig. 29.4 Voltage-referred Sensitivity (left) and Resolution (right) with respect to the Δx n in case
of EIC’s fed by a 100 mV supply voltage corresponding to a power consumption of only 28 µW
256 A. Bertacchini et al.
Fig. 29.5 Power consumption PIN (left) and resolution R (right) of the realized electronic interface
circuit for different supply voltages V IN in the Δx n = 0–500 µm sub-range
29.4 Conclusions
References
1. Johnston DP, Buck JA, Underhill PR, Morelli JE, Krause TW (2018) Pulsed eddy-current
detection of loose parts in steam generators. IEEE Sens J 18(6):2506–2512
2. Alatawneh N, Underhill PR, Krause TW (2018) Low-frequency eddy current testing for
detection of subsurface cracks in CF-188 stub flange. IEEE Sens J 18(4):1568–1575
3. Stott CA, Underhill PR, Babbar VK, Krause TW (2018) Pulsed eddy current detection of cracks
in multilayer aluminum lap joints. IEEE Sens J 15(2):956–962
4. Pereira D, Clarke TGR (2015) Modeling and design optimization of an eddy current sensor for
superficial and subsuperficial crack detection in inconel claddings. IEEE Sens J 15(2):1287–
1292
5. Yamaguchi T, Ueda M (2007) An active sensor for monitoring bearing wear by means of an
eddy current displacement sensor. Meas Sci Technol 18(1):311–317
6. Cheng W (2017) Thickness measurement of metal plates using swept frequency eddy current
testing and impedance normalization. IEEE Sens J 17(14):4558–4569
29 Ultra-Low Power Displacement Sensor 257
7. Li W, Ye Y, Zhang K, Feng ZH (2017) A thickness measurement system for metal films based
on eddy-current method with phase detection. IEEE Trans Ind Electron 64(5):3940–3949
8. Nabavi MR, Nihtianov S (2011) Eddy-current sensor interface for advanced industrial
applications. Ind Ele IEEE Trans 58(9):4414–4423
9. Nabavi MR, Pertijs MAP, Nihtianov S (2013) An interface for eddy-current displacement
sensors with 15-bit resolution and 20 mHz excitation. IEEE Solid-State Circ J (48)11
10. Wang H, Liu H, Li W, Feng Z (2014) Design of ultrastable and high resolution eddy-current
displacement sensor system. In: IECON 2014 annual conference of the IEEE, pp 2333–2339
11. Welsby SD, Hitz T (1997) True position measurement with eddy current technology. Sens J
Appl Sens Tech 14(11):30–41
12. Nabavi MR, Nihtianov SN (2012) Design strategies for eddy-current displacement sensor
systems: review and recommendations. IEEE Sens J 12(12):3346–3355
Chapter 30
Simulation of an Optical-to-Digital
Converter for High Frequency FBG
Interrogator
Abstract In this paper, design and simulations of an optoelectronic circuit for the
conversion of the optical signal, coming from an interrogation system for FBG sen-
sors, into a digital signal, is presented. The approach is divided into an optical intro-
duction of the interrogation system, an analog section and, finally, digital consid-
erations. The analog processing part is mainly based on the realization of a double
stage transimpedance amplifier to obtain, in the working conditions, the best perfor-
mances required in terms of high gain and wide bandwidth. The output voltage from
the analog section is then converted to digital via a 12-bit ADC and sent to an FPGA
that processes the defined algorithm in order to obtain the needed optical-electrical
linear conversion. The circuit simulations, digital stability and other consideration,
including the stability to optical power variability obtained by the numerically sim-
ulated interrogation system, are performed, highlighting the peculiarities of this new
type of high frequency FBG interrogator.
30.1 Introduction
In the last decades, Fiber Bragg Grating (FBG) sensors have been studied and
employed in several environments due to many advantages that characterize this
kind of optical sensors such as: low cost, small size, immunity to electromagnetic
interference, bio compatibility and no toxic material among the most important [1].
Combined with an interrogator, a FGB measurement system becomes a reliable and
accurate sensing method for temperature and mechanical strain in a huge variety of
fields: from minimally invasive microsurgery in which FBG sensors are employed
as force sensors for the surgeon [2], to automotive field where FBG sensors are
glued in a tyre monitoring circumferential and longitudinal strain [3, 4]. Despite all
these advantages, the high cost of the interrogation systems limits the usage of FBG
sensors: the higher is the accuracy that is needed the higher is the expensiveness of
the interrogator. Furthermore, it depends also on the type of variation (Fig. 30.1):
very fast variations, as impact damage detections or hydrophones, can be appreciated
with an interrogation system that is optically and electrically more complex and def-
initely more expensive than a normal interrogator employed to monitor temperature
or vibration of high buildings or bridges.
Many schemes for wavelength interrogation are reported in literature with differ-
ent detection algorithm based on Fabry-Pérot or Mach-Zehnder interferometer [5,
6], spectroscopic charge coupled devices (CCD), or using the power ratio between
optical filters [7, 8]. Nevertheless, no one is enough satisfactory to have appeal in
the market.
In this paper a high frequency wavelength interrogation concept is presented
and analyzed, focusing on the electronic circuit and its characteristics. This system
is briefly descripted from the analytical point of view to understand the working
principle and its optical characteristics; then circuital simulations follow, focusing
on the device choice with stability considerations.
Frequency
The FBG works as an optical filter with a very narrow band whose central wavelength
is called Bragg wavelength and is dependent with the modulation period of the
refractive index, created inside the fiber core, and the refractive index of the mode
propagating inside the core. The Bragg wavelength is very sensitive to temperature or
strain variation, experiencing a wavelength shifting which has to be detected from the
interrogator. The proposed interrogation concept is depicted in Fig. 30.2: a broadband
spectrum generates the light irradiated towards a FBG sensor through an optical
circulator. The FBG reflects part of the signal which becomes, again through the
optical circulator, the input signal of the Arrayed Waveguide Grating (AWG) whose
working principle is to separate a polychromatic spectrum in many output channels
depending on the wavelength, as an integrated prism. Due to the AWG, when the
Bragg wavelength shifts, it will space among the channels, becoming very easy to
detect. Every AWG channel, four in this case study, is connected to a photodiode
to transduce the signal from optical to electrical. The signal is then converted from
current to voltage through a transimpedance amplifier (TIA) and digitalized with an
Analog to Digital Converter (ADC), ready to be read by a FPGA performing the
interrogation algorithm and detecting the wavelength deviation.
Assuming the FBG spectrum as apodized, the side lobes are eliminated and the
reflectance spectrum can be approximated with a Gaussian shape [9]. The Transmit-
tance of every AWG channel can be approximated as Gaussian as well. The output
signal from the generic mth AWG channel can be calculated integrating in the whole
spectrum of wavelength containing FBG reflectance B(λ), AWG transmittance A(λ)
and a parameter that take care of the light source spectrum S(λ):
Im (λ F BG ) = Am (λ)B(λ)S(λ)dλ (1)
CIRC
FBG PD TIA ELECTRICAL
BBS
A
AWG D FPGA λB
OPTICAL C
30.3.1 Photodiode
Cf
R1 R2 A
Rf D FPGA
C
C1
Rs R3
PD Cs TIA 1
TIA 2
Rf C
Fig. 30.3 Proposed circuit ODC whose aim is to convert optical signal in digital
is lower, increasing the speed, but the signal to noise ratio (SNR) might be unac-
ceptable. The solution is to feed the photodiode’s output current into the summing
point of a transimpedance amplifier. Now the response time is not dependent on the
photodiode parasitic capacitance, allowing to use large resistor for high gain and
improving SNR too.
The proposed ODC is shown in Fig. 30.3: a two-stage employing low noise, low
input current and low input capacitance operational amplifier is used since high gain
and wide bandwidth is required. The gain can be calculated as:
R2
Gain (V/W) = R f 1 + Rλ = 121k (30.3)
R1
Where Rλ represents the responsivity. This value comes from the consideration
that the AWG output optical power has to be converted in a voltage that matches
the ADC input characteristics. The two stages include a first transimpedance pre-
amplifier in inverting configuration for current to voltage conversion, then a non-
inverting amplifier for the remaining voltage amplification. First stage gain is directly
determined by Rf; second stage gain is determined by R1 and R2. An issue comes
from the instability, that can affect also a simple configuration of an operational
amplifier if the delay created by amplifier’s input capacitance reacts with the feedback
resistance. This can be avoided moving the pole created at higher frequency or
deleting it with a zero. The best solution is to connect a feedback capacitor Cf in
parallel with Rf limiting the frequency response and avoiding gain peaking that can
lead to overshooting.
The Rf and C parallel on the non-inverting input is needed to compensate the
thermal DC drift due to the temperature coefficient of the amplifier input current.
As depicted in Fig. 30.4a, b, in which impulsive response and bode diagram are
shown, the optimal feedback capacitance value is 0.5 pF, allowing to have a 10 MHz
bandwidth and a fast response without overshooting.
264 V. R. Marrazzo et al.
0.1pF
200.0m 0.2pF
100.0
0.3pF
150.0m 0.5pF
0.7pF 80.0
Vout [dB]
Vout [V]
100.0m
60.0
50.0m
0.1pF
0.2pF
0.0 40.0 0.3pF
0.5pF
-50.0m 0.7pF
20.0
-100.0m
0.0 20.0n 40.0n 60.0n 80.0n 100.0n 120.0n 140.0n 160.0n 100k 1M 10M 100M 1G
Time [s] Frequency [Hz]
Fig. 30.4 a Impulsive response with many values of the feedback capacitance; b bode diagram
with the same values of feedback capacitance
(a) 4 channel TIA output (b) Power output variation among 4 channels
10.0μ
1.0
8.0μ
0.8
Output Power [W]
6.0μ
Vout [V]
0.6
Ch1
Vout1 4.0μ Ch2
0.4 Vout2 Ch3
Vout3
Vout4 Ch4
0.2 2.0μ
0.0 0.0
0.0 1.0μ 2.0μ 3.0μ 4.0μ 5.0μ 1551.5 1552.0 1552.5 1553.0 1553.5 1554.0 1554.5
Fig. 30.5 a Channel output voltage from four TIA; b output optical power from four AWG channels
during FBG interrogation
30 Simulation of an Optical-to-Digital Converter for High … 265
30.4 Conclusions
An Optical-to-Digital Converter has been simulated. From the results shown here, it
is evident that the proposed approach is able to transduce the optical signal, which
comes from some output channels of the AWG system employed in this kind of
interrogation system, in voltage with a linear dependence. The amplifier gain has
been chosen in function of the ADC: with a 12-bit and a maximum of 1V allowed,
a TIA gain of 121 k is needed. With these characteristics, the ADC quantum is
about 0.244 mV that means 2 nA in terms of current and about 2 nW in terms of
optical power, which determines a wavelength resolution below one picometer. The
circuit works up to 10 MHz, allowing to the proposed interrogator to become a valid
competitor for this kind of measurement systems.
References
31.1 Introduction
A peculiar characteristic of a sensor for monitoring intraoral forces is, clearly, the
dimensions [4]. In fact, it must be either positioned inside the mouth or in contact
with a very limited surface, such as that of the tooth.
Another characteristic is the resolution of the sensor, which must detect forces of
few grams [5]. Sensors should be compatible with standard CMOS technology or
industrialized processes [4, 6–8].
Last requirement for this kind of device is to overcome the prestrain problem that
affects every strain gauge sensor, that is caused by their mechanical placement and
led to an altered rest condition.
With the aim of creating a completely wireless and size constrained system, a custom
circuit was conceived, designed and prototyped with a form factor of 2 × 1 cm, as
shown in Fig. 31.1. The system will be embedded using an EPO-TEK® MED-301
biocompatible epoxy from Epoxy Technology Inc. to be used inside human mouth.
The circuit consists of four main blocks (Fig. 31.3):
• Sensors: an analysis of the state-of-the-art literature and the search for the most
performing electronic components available on the market [4], led us to the selec-
tion of the model 015LW by VPG Inc. (Fig. 31.2), a 120 strain gauge with sizes
of 1.90 × 1.37 mm. The sensor block also includes a MEMS accelerometer for
future use (Fig. 31.3).
• Power Conditioning: it contains the supply source regulation block and a DC/DC
converter able to boost a 1.5 V coin battery source.
Due to the high amplification value of the PGA (from 2 to 760 V/V), prestrain [9]
could saturate the output leading to a corruption of the measurements of the forces
coming from the strain gauges, embedded in resin as shown in Fig. 31.5. With the
use of a proper software routine, the DAC output is adjusted and can dynamically
compensate for prestrain issues, as shown in Fig. 31.6.
The measurements are sent to a Smartphone using a Bluetooth connection, and
showed in a custom application where the user can select, for each single sensor, to
see raw or elaborated data, showed in Fig. 31.4.
31.4 Conclusions
In this work, a wireless intraoral sensor device has been proposed. It is well suited
to extrapolate information about intraoral forces because of its reduced size, the
use of BLE protocol instead of wire communications and the ability to dynamically
compensate for prestrain issues.
272 M. Merenda et al.
Fig. 31.6 Saturation of the output (a) and prestrain overcome (b). Voltage output of the signal
conditioning block after DAC offset addition
Acknowledgements The research results presented in this paper are based on the activities carried
out in the framework of the project “MoSSY—Cyber Physical System Technology for the Monitor-
ing of Stomatognathic System” (00008– ALTRI_DR_408_2017_Ric.di_Aten-DADDONA) funded
by the University of Naples Federico II within the “Programma per il finanziamento della ricerca
di Ateneo” (2016–2019).
References
1. Koc D, Dogan A, Bek B (2010) Bite force and influential factors on bite measurement: a literature
review. Eur J Dent 4:223–232
31 Wireless Sensors for Intraoral Force Monitoring 273
2. Pereira LJ, Pastore MG, Bonjardim LR, Castelo PM, Gavião MBD (2007) Molar bite force and
its correlation with signs of temporomandibular dysfunction in mixed and permanent dentition.
J Oral Rehabil 34:759–766
3. Lantada AD, Bris CG, Morgado PL, Maudes JS (2012) Novel system for bite-force sensing and
monitoring based on magnetic near field communication. Sensors 12(9):11544–11558
4. D’Addona DM, Merenda M, Della Corte FG (2019) Electronic sensors for intraoral force
monitoring: state-of-the-art and comparison. Procedia CIRP 79:730–733
5. D’Addona DM, Rongo R, Teti R, Martina R (2018) Bio-compatible cyber-physical system for
cloud-based customizable sensor monitoring of pressure conditions. Procedia CIRP 67:150–155
6. Aquilino F, Della Corte FG, Fragomeni L, Merenda M, Zito F (2009) CMOS fully-integrated
wireless temperature sensors with on-chip antenna. In: European microwave week 2009, 39th
European microwave conference, art. no. 5296138, pp 1117–1120
7. Merenda M, Felini C, Della Corte FG (2018) A monolithic multisensor microchip with complete
on-chip RF front-end. Sensors (Switzerland), 18(1). Article no 110
8. Aquilino F, Della Corte FG, Merenda M, Zito F (2008) Fully-integrated wireless temperature
sensor with on-chip antenna. In: Proceedings of IEEE sensors, Article no 4716552, pp 760–763
9. Rees DWA (1986) The sensitivity of strain gauges when used in the plastic range. Int J Plast
2(3):295–309
Part VII
Power and High Voltage Electronics
Chapter 32
Reinforced Galvanic Isolation:
Integrated Approaches to Go Beyond
20-kV Surge Voltage (invited)
32.1 Introduction
Reliability and safety issues require galvanic isolation in several application fields,
such as the automotive (i.e., electric and hybrid vehicles) the industrial (i.e., motor
control, automation, etc.), the medical (i.e., implanted devices, defibrillators, patient
monitoring, etc.), the consumer (i.e., home appliance, inductive cooking, etc.) and the
communication one (i.e., sensors, wire line networks, etc.). A general block-diagram
of a galvanically isolated system is depicted in Fig. 32.1a. Data signals are trans-
ferred across the galvanic isolation barrier to enable bidirectional communication
between the two interfaces A and B, while an isolated power supply for interface B
is provided from interface A by a power transfer technique. Recent standardization
Fig. 32.1 a Simplified block-diagram of a galvanically isolated system. b Simplified surge test
profile according to [1]
for semiconductor isolators defines accurate testing for the maximum transient iso-
lation voltage, V IOTM , and the maximum repetitive voltage, V IORM , which measure
the capacity to withstand high voltages for very short periods of time and throughout
the device lifetime, respectively [1]. Another important specification is the maxi-
mum surge isolation voltage, V SURGE , that quantifies the capability of the isolator to
withstand very high voltage impulses of a certain transient profile, which can arise
from indirect lightning strikes or faults, as shown in Fig. 32.1b. The highest level
of isolation, namely reinforced isolation, is achieved at the component level, only if
it passes the surge test with a V SURGE greater than 10 kV. At the present time, both
industrial and automotive applications are moving towards 10 kV, some applications
(e.g., patient monitoring systems) already require V SURGE higher than 15 kV, while
in the near future, galvanic isolation up to 20 kV will be required.
State–of–the–art galvanic isolators are based on electromagnetic (EM) coupling
(i.e., capacitive or inductive) across a dielectric layer (i.e., the galvanic barrier).
An integrated galvanic barrier can be implemented by using silicon dioxide (SiO2 ),
which exhibits a breakdown voltage (BV) of about 1000 V/µm [2], sometimes in
combination with silicon nitride (Si3 N4 ) and oxynitride (SiON) to further improve its
isolation rating [3]. In the last years, oxide isolation has been successfully exploited
for highly integrated isolated data [4–6] and isolated power transfer [7–11] by means
of on–chip capacitors or stacked transformers. However, oxide insulation can reliably
provide a limited surge capability (typically 5–6 kV), since increasing the oxide thick-
ness produces wafer mechanical stress and second–order BV effects. The use of two
series–connected galvanic isolation barriers, namely double isolation, is exploited
to improve the overall isolation rating. It can be a viable solution for digital isola-
tors (i.e., data transfer) [12], with a maximum V surge of 12.8 kV by using a couple
of isolation capacitors [13]. However, in isolated dc–dc conversion this approach
is affected by a power efficiency degradation, which can be slightly mitigated by
adopting integrated LC resonant barriers [14].
The galvanic barrier can be also implemented with other dielectric layers, such as
the polyimide, traditionally used in semiconductor industry for stress relief. In this
case, the isolation device (typically a stacked transformer) is built as a stand–alone
chip by using post–processing fabrication steps at the cost of reducing the integration
32 Reinforced Galvanic Isolation: Integrated Approaches … 279
level (i.e., from two to three chips per each isolated channel). This approach guar-
antees high data rates with high isolation rating (up to 20 kV) and common–mode
transient immunity (CMTI) performance (better than 200 kV/µm) [15], while being
also suited to power transfer up to several hundreds of mW with maximum power
efficiencies higher than 30% [16]. Since polyimide BV is about 250 V/µm, typically
the isolation layer is about 3x thicker to sustain the same isolation voltage of an oxide
barrier. On the other hand, very thick polyimide layers can be manufactured with a
record of 32.5-µm thickness able to withstand 20-kV surge voltage [15], which is not
practical using silicon dioxide layers. In any case, isolation approaches based either
on integrated SiO2 barriers or post–processed polyimide transformers have inherent
limitations in terms of isolation that can be improved only by means of expensive
and time–consuming technological advances.
Sections 32.2 and 32.3 describe two alternative isolation techniques based on radio
frequency (RF) coupling between two isolated interfaces, which are suited for iso-
lated data and isolated power transfer, respectively [17, 18]. In these approaches, the
galvanic isolation is provided by packaging/assembling techniques, which guarantee
design flexibility. Indeed, the distance through insulation (DTI), which is responsi-
ble for the isolation rating, can be properly increased to guarantee to the required
V SURGE .
The key parameter for galvanically isolated power transfer systems is power effi-
ciency. Indeed, it must be traded off with the isolation performance being both related
32 Reinforced Galvanic Isolation: Integrated Approaches … 281
to the dielectric thickness of the isolation transformer, either using an integrated bar-
rier or a post-processed device. When a very high isolation rating is required, a
different isolation approach can be applied to the traditional power transfer architec-
ture [18]. It takes advantage of a well-established technology, known as “wafer-to-
wafer bonding”, already used in different environments (e.g., integration of MEMs).
The idea is to exploit face-to-face coupling between metal spirals by opposing two
silicon wafers, interposing a dielectric layer between them and using proper vias
called “trough silicon vias” (TSV) to implement external connections, as depicted
in Fig. 32.4a. For better understanding, the 3D views and the cross–section of the
isolator are reported in Fig. 32.4b, c, respectively. A face-to-face isolation trans-
former is implemented by using the wafer–to-wafer bonding technique. The top chip
contains the primary coil of the transformer (to be connected to the oscillator), while
the bottom chip includes the secondary winding that drives the rectifier, according
Fig. 32.4 a Package structure. b Simplified block diagram of the RF galvanic isolator. Face-to-face
3D assembly: 3D top and bottom views c, d cross-section
282 E. Ragonese et al.
to the simplified architecture in Fig. 32.4d. Both transformer windings are therefore
fabricated using the top metal layer of the adopted technology. The overall verti-
cal structure can be mounted into a proper package by using bonding wires and
solder bumps for top and bottom wafer connections, respectively. This approach
allows the interposed dielectric layers to be properly chosen in order to guarantee the
required isolation rating without degrading the transformer efficiency. To this aim,
high dielectric strength materials can be used. Moreover, different isolation ratings
may be adjusted by controlling the distance and/or the type of materials used when
attaching the first and second silicon chips on wafer level. Compared to traditional
isolated power transfer system based on integrated or post–processed power trans-
formers, this approach is more flexible and can provide both higher isolation rating
and smaller isolator size.
References
1. DIN VDE Semiconductor Devices-Magnetic and Capacitive Coupler for Basic and Reinforced
Isolation, VDE Verlag VDE V 0884–11, Jan 2017
2. Palumbo V, Ghidini G, Carollo E, Toia F (2015) Integrated transformer. US Patent App
14733009, filed June 8 2015
3. Mahalingam P, Guiling D, Lee S (2007) Manufacturing challenges and method of fabrica-
tion of on-chip capacitive digital isolators. In: Proceedings of international symposium on
semiconductor manufacturing, Oct 2007, pp 1–4
4. Krone A et al. (2001) A CMOS direct access arrangement using digital capacitive isolation.
In: Proceedings IEEE international solid-state circuits conference digital technology papers,
Feb 2001, pp 300–301
5. Moghe Y, Terry A, Luzon D (2012) Monolithic 2.5 kV RMS, 1.8 V–3.3 V dual-channel
640 Mbps digital isolator in 0.5 µm SOS. In: Proceedings of IEEE international SOI conference,
Oct 2012, pp 1–2
6. Kaeriyama S et al (2012) A 2.5 kV isolation 35 kV/us CMR 250 Mbps digital isolator in standard
CMOS with a small transformer driving technique. IEEE J Solid-State Circ 47:435–443
7. Spina N, Fiore V, Lombardo P, Ragonese E, Palmisano G (2015) Current-reuse transformer cou-
pled oscillators with output power combining for galvanically isolated power transfer systems.
IEEE Trans Circ Syst I: Reg Papers 62:2940–2948
8. Lombardo P, Fiore V, Ragonese, E, Palmisano G (2016) A fully-integrated half-duplex
data/power transfer system with up to 40 Mbps data rate, 23 mW output power and on-chip
5 kV galvanic isolation. In: IEEE international solid-state circuits conference digital technology
papers, Feb 2016, pp 300–301
9. Greco N, Spina N, Fiore V, Ragonese E, Palmisano G (2017) A galvanically isolated dc–dc
converter based on current-reuse hybrid coupled oscillators. IEEE Trans Circuits Syst II: Exp
Brief 64:56–60
10. Fiore V, Ragonese E, Palmisano G (2017) A fully-integrated watt-level power transfer system
with on-chip galvanic isolation in silicon technology. IEEE Trans Power Electron 32:1984–
1995
11. Ragonese E et al (2018) A fully integrated galvanically isolated DC-DC converter with data
communication. IEEE Trans Circ Syst I: Reg Pap 65:1432–1441
12. Javid M, Ptacek K, Burton R, Kitchen J (2018) CMOS bi-directional ultra-wideband galvan-
ically isolated die-to-die communication utilizing a double-isolated transformer. In: Proceed-
ings of IEEE international symposium on power semiconductor devices and ICs, May 2018,
pp 88–91
32 Reinforced Galvanic Isolation: Integrated Approaches … 283
33.1 Introduction
Sodium-Nickel chloride batteries, usually called ZEBRA, are a very interesting al-
ternative to Lithium-Ion (Li-Ion) ones [1]. ZEBRA technology yields energy density
comparable with the Li-Ion one, but shows higher coulombic efficiency and it is
inherently safer [2]. Moreover, this technology may lead to rather inexpensive bat-
teries because it employs chemical substances abundant in nature, unlike Lithium
[3]. The drawback is that a ZEBRA battery must work with an internal temperature
in the 250–350 ◦ C range. Thus, it must be equipped with a heater that increases the
energy losses and makes these batteries not suitable to every application. Howev-
er, the above mentioned advantages make ZEBRA very appealing as alternative to
Lithium in many applications, such as in some automotive cases [4, 5], as energy s-
torage system for renewable energy [6, 7] and as energy source in telecommunication
applications [8, 9].
Unfortunately, the production process of the ZEBRA battery and the full exploita-
tion of its capabilities present some open challenges which limit its penetration in
the battery market. Many studies have been presented in the last years to address the
production problems [10–12]. Instead, the literature is poor about the full exploita-
tion of this technology that improved and accurate battery modeling and battery state
estimation can provide [13]. A large amount of experimental data, still not available,
would be needed to address the problem.
The aim of this work is to carry out an experimental characterization of a commer-
cial ZEBRA battery, the FZSonick® 48TL200, in order to collect and make available
data useful for a better modeling of a sodium-nickel chloride battery. As the 48TL200
battery does not provide access to all its internal data, additional sensors have been
applied to the battery and an experimental setup has been developed to carry out the
test campaign.
The basic cell of a ZEBRA battery is composed by a liquid sodium negative elec-
trode, and a positive electrode which consists of nickel surrounded by a mixture of
nickel chloride (NiCl2 ), nickel (Ni), iron sulphide (FeS) and liquid electrolyte. The
electrodes are separated by a solid electrolyte made of beta alumina that allows the
conduction of the sodium ions [5]. The main chemical reactions occurring in the cell
are:
Discharge
−−
2Na + NiCl2 −−−−
−−
− 2NaCl + Ni
− E = 2.58 V (33.1)
Charge
Discharge
−−
FeCl2 + 2Na −−−−
−−
− Fe + 2NaCl
− E = 2.35 V (33.2)
Charge
Only few inner battery data are available on the 48TL200 user interface. Therefore,
we first developed an experimental setup, shown in Fig. 33.1, to monitor and save
as many as possible internal quantities of the battery, without accessing the battery
inside and so with no risk of damage or alteration of its behavior. In particular, we
added further sensors to measure the battery current and voltage, and the voltage and
current of each of the 5 strings.
The setup employs a National Instrument cDAQ-9178 chassis equipped with one
NI9219 and three NI9215 modules. The NI9219 provides 4 general purpose channels
288 F. Baronti et al.
that we used to measure the battery voltage, the battery current with a shunt resistor
of 0.5 m, and the voltage of the first string. The three NI9215 provide 4 analog
input voltage channels each, in the range of ±10 V. We used them to acquire the
voltage difference between the first string and the other ones, according to the block
diagram shown in Fig. 33.2. The individual current of each string is measured by
means of 5 open loop Hall effect sensors HCT-0010-050.
The measurement system is controlled by a custom interface developed in Lab-
View, which runs on a laptop. The interface allows us to control the power supply
(QPX1200SP) used to charge the battery and the load composed by a power resistor
bank and a relay. In particular, 7 resistors of 2.2 and 1.5 kW can be dynamically
connected in parallel, to obtain a load resistance from about 315 m up to 2.2 and
thus a discharge current from about 135 A down to 22 A, respectively. The PC is also
connected via USB to the battery BMS and stores the few quantities measured by
the BMS such us the internal temperature.
Several tests were carried out to extract the main features of the ZEBRA battery
and to understand more deeply the BMS behavior. These experiments provide data
by which very important information can be extracted for a better battery modeling
and the development of a BMS with more accurate algorithms to estimate the internal
state of the battery, such us the State of Charge (SoC), as it happens for Li-ion batteries
[15]. The availability of separate data for each string makes it possible a deeper view
of the inner behavior of the battery.
Let us show in Fig. 33.3 an example of experimental result. It reports the voltage
and current of the first string and the entire battery during a test in which the battery
was completely discharged with a constant load of 2.2 and then recharged with a
power supply setting of 55 V. We note that the battery and the first string voltages are
overlapped in the discharge phase. The same happens for the other strings that are
parallel connected possibly with ideal diodes. Instead, the two voltages are different in
the charge phase. The battery voltage is 55 V as set by the external supply, whereas
the first string voltage and current resemble a classic Constant Current/Constant
Voltage (CC/CV) profile [16]. A similar behavior is found on the other strings. This
33 Experimental Characterization of a Commercial … 289
56
20
54
10
Voltage (V) 52
Current (A)
50 0
48
-10
V S1 IS1
46
V batt Ibus -20
44
0 5 10 15 20 25 30
Time (h)
Fig. 33.3 Current and voltage of the battery and the first string during a test consisting of a full
discharge with a constant load of 2.2 followed by a recharge phase
observation suggests to us that the parallel connection is removed and the strings are
individually charged with dedicated DC/DC converters by the BMS.
This test was repeated for all the load values and the available results are summa-
rized in Table 33.1. These results suggest a couple of interesting conclusions. First,
the extracted charge and energy are maximum for an average current of about 50 A.
This happens because the energy loss in the battery heaters to maintain the minimum
value of the battery temperature decreases when the battery current increases. The
task of maintaining the minimum temperature level is also sustained by the heat
generated in the parasitic series resistance of the battery, and the heaters may eventu-
ally be turned off. Second, the heat generated in the battery series resistance at high
discharge currents is larger than that dissipated through the case. The net effect is
that the battery temperature increases too much and overcomes its maximum value,
eventually causing the battery disconnection. We end up with the very disappointing
result that the discharge phase is interrupted by the BMS before the battery is fully
discharged and the energy contained is not fully exploitable by the application.
Finally, let us show the results of a pulsed current test (PCT), i.e. a test in which
the battery load is switched on and off at controlled time instants, useful to investi-
290 F. Baronti et al.
52
V
S1 15
I
50 S1
Voltage (V)
Current (A)
48 10
46
5
44
42 0
0 5 10 15 20
Time (h)
Fig. 33.4 Current and voltage of the first string in a pulsed current test
gate the Open Circuit Voltage (OCV) battery response and the behavior at different
Depths of Discharge (DOD). Figure 33.4 shows the first string voltage and current
during a PCT test with SoC steps of about 6%. It is worth noting that the battery
behavior is straightforward down to about 70% DOD, with an almost constant OCV
and an internal resistance gradually increasing (larger voltage steps for the same
current pulse). The behavior changes significantly at deeper DOD. This indicates
that a different chemical reaction involving the iron component of the cell is taking
place [14]. The availability of these data makes it possible the investigation on more
accurate electrical models of the battery [17].
33.4 Conclusions
This paper shows the preliminary results coming from the experimental charac-
terization of a commercial nickel-chloride sodium battery for telecom applications.
This kind of battery seems an appealing candidate for Li-ion technology replacement,
particularly in stationary applications, due to the intrinsic safety of the involved chem-
istry. However, research efforts are still needed in battery modeling and BMS imple-
mentations. The battery was instrumented to collect all the possible data, as far as tem-
perature, voltage and current of the paralled strings are concerned. The experimental
setup is described and the results coming from charge/discharge cycles with different
loads are discussed. It results that the entire battery energy can fully be exploited
only when the heater dissipation is negligible at low load currents. At high loads, the
battery does not provide the whole energy because the internal dissipation leads to a
dangerous increase of the internal temperature, which forces the BMS to disconnect
the battery before the end of the test. Finally, experimental results coming from pulse
current tests useful for the validation of improved battery models are shown.
33 Experimental Characterization of a Commercial … 291
References
1. Bhatt DK, El Darieby M (2018) An assessment of batteries form battery electric vehicle per-
spectives. In: 2018 6th IEEE international conference on smart energy grid engineering, SEGE
2018, pp 255–259
2. Hueso KB, Palomares V, Armand M, Rojo T (2017) Challenges and perspectives on high and
intermediate-temperature sodium batteries. Nano Res 10(12):4082–4114
3. Gruber PW, Medina PA, Keoleian GA, Kesler SE, Everson MP, Wallington TJ (2011) Global
lithium availability: a constraint for electric vehicles? J Ind Ecol 15(5):760–775
4. Capasso C, Veneri O (2014) Experimental analysis of a zebra battery based propulsion system
for urban bus under dynamic conditions. Energy Procedia 61:1138–1141
5. O’Sullivan TM, Bingham CM, Clark RE (2006) Zebra battery technologies for the all electric
smart car. In: International symposium on power electronics, electrical drives, automation and
motion, 2006. SPEEDAM 2006, Nov 2006, pp 244–248
6. Lu X, Yang Z (2014) Molten salt batteries for medium- and large-scale energy storage
7. Bignucolo F, Coppo M, Crugnola G, Savio A (2017) Application of a simplified thermal-electric
model of a sodium-nickel chloride battery energy storage system to a real case residential
prosumer. Energies 10(10):1497
8. Restello S, Lodi G, Miraldi AK (2012) Sodium nickel chloride batteries for telecom application:
a solution to critical high energy density deployment in telecom facilities. In: INTELEC,
international telecommunications energy conference (proceedings), pp 1–6
9. Restello S, Spa F, Maggiore M, Zanon N, Spa F, Maggiore M (2013) Sodium nickel batteries
for telecom hybrid power systems. 2. Sodium nickel chloride technology, vol 2, pp 324–328
10. Lu X, Coffey G, Meinhardt K, Sprenkle V, Yang Z, Lemmon JP (2010) High power planar
sodium-nickel chloride battery, pp 7–13
11. Lu X, Li G, Kim JY, Lemmon JP, Sprenkle VL, Yang Z (2013) A novel low-cost sodium-zinc
chloride battery. Energy Environ Sci 6(6):1837–1843
12. Chang HJ, Lu X, Bonnett JF, Canfield NL, Son S, Park YC, Jung K, Sprenkle VL, Li G (2018)
“Ni-less” cathodes for high energy density, intermediate temperature Na-NiCl2 batteries. Adv
Mater Interfaces 5(10):1–8
13. Benato R, Dambone Sessa S, Necci A, Palone F (2016) A general electric model of sodium-
nickel chloride battery. In: AEIT 2016—international annual conference: sustainable develop-
ment in the Mediterranean area, energy and ICT networks of the future
14. Sudworth JL (2001) The sodium/nickel chloride (ZEBRA) battery. J Power Sources 100(1–
2):149–163
15. Morello R, Di Rienzo R, Roncella R, Saletti R, Baronti F (2018) Hardware-in-the-loop platform
for assessing battery state estimators in electric vehicles. IEEE Access
16. Dung LR, Chen CE, Yuan HF (2016) A robust, intelligent CC-CV fast charger for aging lithium
batteries. In: IEEE international symposium on industrial electronics
17. Musio M, Damiano A (2015) A non-linear dynamic electrical model of sodium-nickel chloride
batteries. In: 2015 international conference on renewable energy research and applications,
ICRERA 2015
Chapter 34
Design and Development of a Prototype
of Flash Charge Systems for Public
Transportation
A. Alessandrini
DICEA, University of Florence, Florence, Italy
R. Barbieri · L. Berzi · M. Pierini · L. Pugi (B)
DIEF, University of Florence, Via di Santa Marta 3, 50139 Florence, Italy
e-mail: [email protected]
F. Cignini · A. Genovese · F. Ortenzi
ENEA Centro di Ricerca di Casaccia, Rome, Italy
According Relation (34.1) by rising the gain K p most of the power should be
provided by capacitors, otherwise by batteries. By specifying directy Iref als the user
has the possibility of choosing a specific amount of power flowing from or to batteries
forcing their recharge or discharge according system logic.
34 Design and Development of a Prototype of Flash Charge Systems … 297
The final prototype of the system, visible in Fig. 34.4 was finally assembled and
tested first using a simulation test bench and then by driving in an internal road
circuit available at the same research center of ENEA in Casaccia (Rome, Italy).
In Fig. 34.5, some experimental results are shown: vehicle performs a mission
(start and stops acceleration braking etc.) and the total current required by various
connected loads is measured and compared with the contribution of supercapacitors
(DC-DC Current) and of the batteries: proposed approach involves a smooth behavior
of the system which is able to largely exploit the energy stored in supercapacitors,
maintaining small and smooth power demands on batteries.
Fig. 34.5 Preliminary results, showing the capability of the proposed system to smoothly reduce
the applied load on batteries exploiting energy provided by super-capacitors
298 A. Alessandrini et al.
It’s interesting to notice that the behavior of the system can be easily tuned acting
on the gain of the proposed battery current controller. More complicated or nonlin-
ear calibration can be performed simply be introducing a tabulated gain schedul-
ing. It’s interesting to notice that the proposed controller runs on ATMEL 16 Bit
micro-controllers with limited performances and most of the computational effort are
exploited by communication and diagnostic tasks. This is a demonstration of the sim-
plicity of the proposed approach which give wide margin for future implementations
on more performing hardware.
Authors have successfully assembled a prototype of the proposed flash charge sys-
tems. Preliminary results are encouraging. Currently authors are completing exper-
imental activities and these complete results will probably the object of a future
publication. Authors are also planning some improvements of the current system:
First, the substitution of current lead-acid batteries with more performing lithium
ones. In this way stored energy can be further increased. Also, Lithium batteries
should support a further increase of transferred energy during the flash charge process
since their specific power is quite higher respect to lead ones. It’s interesting to notice
that from previous research experiences [6, 7] durability of this kind of batteries
should also be greatly improved by the adoption of the proposed system. Also the
recharge static station should be further improved by introducing an active system
able to further increase the transferred power. Also authors plan the possibility of
testing wireless recharge systems on the bus which also in their previous experiences
have proven to be a valid solution for the recharge of road vehicles [8, 9].
Acknowledgements authors wish to thank all the people that have cooperated to the success of this
activity. In particular, authors wish to mention all the people of ENEA and University of Florence
which have provided a fundamental support in terms for preliminary testing activities and for the
final assembly of the vehicle.
References
Abstract The online monitoring of a high voltage apparatus is a crucial aspect for
a predictive maintenance program. The insulation system of an electrical machine
is affected by partial discharges (PDs) phenomena that—in the long term—can lead
to the breakdown. This in turn may bring about a significant economic loss; wind
turbines provide an excellent example. Thus, it is necessary to adopt embedded
solutions for monitoring the insulation status. This paper introduces an online system
that exploit fully unsupervised methodologies to assess in real-time the condition of
the monitored machine. Accordingly, the monitoring process does not rely on any
prior knowledge about the apparatus. Nonetheless, the proposed system can identify
the relevant drifts in the machine status. Notably, the system is designed to run on
low-cost embedded devices.
35.1 Introduction
that is entitled to identify the category of the defect affecting the apparatus. Usually
machine learning methodologies support the classification system of the PD sources.
In [2] an ultra high frequency (UHF) antenna has been used to sample the discharges;
then, a set of features has been extracted from the raw data. The K-means algorithm
supported a clustering according to the defects into the feature space. In [7] the de-
fects have been classified using a neural network on features extracted from signals
picked up by a high frequency current transformers (HFCT). In [4] the DBSCAN
clustering is applied on features extracted from the wavelet decomposition of signals
coming from a HFCT sensor. Recently, in [1] a gas insulated substation (GIS) has
been monitored with both antenna and HFCT sensors; PD sources recognition has
been performed by exploiting clustering techniques. The major weakness of those
approaches is that the classification system requires a training procedure. This in turn
means that a large training set should be available.
Other approaches are based on the statistical analysis of the PD signals. In [8] the
supervised classification approach utilizes histogram similarity. Accordingly, a PD
pattern is suitably represented as an histogram. The system relies on a database of
histograms, where each defect is represented. As a result, online monitoring applies
an hypothesis test to compare the input histogram with each histogram included in
the database. This approach again requires a large dataset of labeled histograms.
Nonetheless, multi-defects PD patterns represent an issue.
In this paper, online monitoring relies on a fully unsupervised approach that does
not require a pre-built training set. The model is entitled to track in real time the drift
affecting the insulation system. To this purpose, aging phenomena are detected by
using the approaches commonly adopted in anomaly detection. Hence, in principle,
no prior knowledge of the apparatus under investigation is needed. Moreover, the
whole system is designed to target low-resources devices, such as low-cost embedded
systems.
The proposed monitoring model basically assesses in real-time the aging of the insu-
lation system. Since a fully unsupervised approach is targeted, the model is designed
to detect significant changes in the status of the apparatus without knowledge base.
Hence, the monitoring process is approached as an anomaly detection problem. In
any instant T the goal is to check whether the apparatus shows an anomalous behavior
with respect to the past. The status of the insulation system at time T is characterized
by sampling—at a given frequency—the amplitude of the signal sensed by the HCFT
in a time window δ. Such signal is converted in a vector x by using the same process
that leads to a PD pattern; in this case, though, the phase information is discarded to
obtain a vector. As a result, the size of x depends on the occurrences of PDs in the
time window δ.
35 Unsupervised Monitoring System for Predictive Maintenance … 303
where K 1 and K 2 are scaling constants that are used to adjust for unequal sample
sizes. Given χ 2 and the degrees of freedom D F, which corresponds to the number of
non-empty bins, the p-value can be computed from the Chi2 distribution. The null
hypothesis is accepted if p-value < α, where α is the significance level set a-priori
(usually α = 0.05).
The KS processes the empirical cumulative distribution functions (ECDF) of xT
and x̃, respectively. Given N points pn ordered from smallest to largest value, the
ECDF is defined as
F On = i n /N (35.2)
D = sup |F On − F E n | (35.3)
n
Given D, the p-value can be calculated as in [5]. Eventually, the null hypothesis is
rejected if the significance level α is lower than the p-value.
In the following, Find Anomaly(xT , x̃, α) will denote a function that returns 1
when the null hypothesis is rejected, and 0 otherwise. Such function may exploit
either the Chi2 test or the KS test.
304 C. Gianoglio et al.
The proposed monitoring systems is designed to identify in real time the abrupt dis-
continuities in the status of the apparatus, which are reported as alerts. Accordingly,
the monitoring procedure is organized as follows. In standard mode, the monitoring
systems continuously get the vector xt measured at time t and verifies the occurrence
of an anomaly (as per Sect. 35.2). If an anomaly is revealed, a second procedure starts;
its goal is to verify that an alert should be activated. Let T ∗ denote the instant in which
an anomaly as occurred. The procedure sets an alert only if—in the time window
between T ∗ and T ∗ + Δ—the anomaly (number O f Anomalies) occurs again at
least thr times, where thr and Δ have been fixed empirically. Algorithm 1 gives the
pseudo-code of this procedure. In practice, the monitoring system is designed to set
an alert only when a sequence of anomalies is detected. Thus, it can detect only the
significant discontinuities in the status of the apparatus. It is worth noting that the
approach is fully unsupervised and does not require any previous knowledge about
the apparatus. Moreover, the computational complexity of the whole monitoring pro-
cedure is negligible. This in turn means that the monitoring system can be hosted by
low-cost, resource-limited embedded systems.
The proposed monitoring system can indeed become part of an IoT-based pre-
dictive maintenance, as per Fig. 35.1. First, the alerts can allow one to categorize
all the data associated to an apparatus according to well defined landmarks. Hence,
a remote database can collect structured information provided by all the monitored
apparatuses. In practice, the landmarks generated via an unsupervised process can be
exploited to eventually label the data provided by online measurements. As a result,
a remote warehouse can exploit such database and machine learning methodologies
to make inferences—given a measure xt for an apparatus—on the exact type of de-
fect affecting the insulation system under investigation. Under the paradigm of edge
computing, the local embedded system can be designed to implement the inference
function, which receives its parameters from the cloud. In this regard, the literature
offers a few examples of hardware-friendly implementations of inference functions
[3, 6].
The experimental session involved two twisted pair specimens that underwent aging
tests according to standard IEC 60851-5. A HFCT with a bandpass behavior placed
around the ground cable provided the sensor. Signals were sampled by a Picoscope
with a bandwidth in the range 0–200 MHz and a maximum sample frequency of 1
GSamples/s. The monitoring system was deployed on a Raspberry Pi. In the first
experimental session, the Picoscope was configured with a fullscale of 20 V and
12 bit resolution. The status of the specimen was monitored every minute, with
a sampled time window δ = 0.5 s. Figure 35.2 shows—on a time scale—the alerts
generated by monitoring system before the specimen breakdown (a total of 20 h).
The red marks refer to a monitoring system relying on the KS test; the blue marks
refer to the Chi2 test. After about four hours, both the setup started to produce
alerts. In the following four hours the effects of aging phenomena assumed almost
a periodic pattern. Then, the gap between successive alerts progressively increased.
The Chi2 test actually produced two more alerts than the KS test. Possibly in the
case of Chi2 test sensitivity also depends on nbins; in this experiment it has been
chosen empirically as nbins = 25. The second test, made on another specimen,
aimed at evaluating the impact of the analog to digital converter (ADC) resolution
on the monitoring system. In this test, the status of the specimen was monitored
every 2 min (δ = 0.5 s); the breakdown occurred after 22.5 h. Figure 35.3 gives the
alerts produced with a 8 bit resolution (in blue) and with a 12 bit resolution (in red).
Overall, the system generated more alerts when adopting a 8 bit resolution. In fact,
it is reasonable to assume that a coarser quantization makes anomaly detection more
prone to errors. Under such assumption, the alerts generated with a 8 bit resolution
respectively 1 hour after the start and half an hour before the breakdown might
represent false alarms.
306 C. Gianoglio et al.
Fig. 35.2 Alerts produced by Chi2 and KS tests on the first specimen
Fig. 35.3 Alerts produced by Chi2 test with resolutions of 8 and 12 bit
35.5 Conclusions
This paper shows that anomaly detection paradigms can support a fully unsuper-
vised monitoring system for predictive maintenance of high voltage apparatus. The
proposed method is computationally light and fit IoT solutions that rely on edge
computing. The monitoring system can identify in real-time the significant changes
in the status of the apparatus, thus revealing aging effects. At the same time, the
system enables the automated labeling of acquired data, which become structured
information to be stored and processed by deep analytics.
References
1. Álvarez F, Garnacho F, Ortego J, Sánchez-Urán M (2015) Application of HFCT and UHF sensors
in on-line partial discharge measurements for insulation diagnosis of high voltage equipment.
Sensors 15(4):7360–7387
2. Gao W, Ding D, Liu W (2011) Research on the typical partial discharge using the UHF detection
method for GIS. IEEE Trans Power Deliv 26(4):2621–2629
3. Gianoglio C, Guastavino F, Ragusa E, Bruzzone A, Torello E (2018) Hardware friendly neural
network for the PD classification. In: 2018 IEEE conference on electrical insulation and dielectric
phenomena (CEIDP). IEEE, pp 538–541
4. Hao L, Lewin P, Hunter J, Swaffield D, Contin A, Walton C, Michel M (2011) Discrimination
of multiple PD sources using wavelet decomposition and principal component analysis. IEEE
Trans Dielectr Electr Insul 18(5):1702–1711
5. Massey FJ Jr (1951) The Kolmogorov-Smirnov test for goodness of fit. J Am Stat Assoc
46(253):68–78
6. Ragusa E, Gianoglio C, Gastaldo P, Zunino R (2018) A digital implementation of extreme
learning machines for resource-constrained devices. IEEE Trans Circuits Syst II: Express Briefs
65(8):1104–1108
35 Unsupervised Monitoring System for Predictive Maintenance … 307
7. Wang MH (2005) Partial discharge pattern recognition of current transformers using an ENN.
IEEE Trans Power Deliv 20(3):1984–1990
8. Yazici B (2004) Statistical pattern analysis of partial discharge measurements for quality as-
sessment of insulation systems in high-voltage electrical machinery. IEEE Trans Ind Appl
40(6):1579–1594
Chapter 36
Control System Design for Cogging
Torque Reduction Based on Sensor-Less
Architecture
36.1 Introduction
Increasingly, servo drives based on brushless motors are used. This type of electric
motors has a high efficiency, good capacity to deliver relatively high torques and
excellent characteristics in dynamic regime. These features make brushless motors
the most suitable in applications such as the implementation of operating machines
and industrial robots in assembly lines. However, the presence of permanent magnets
creates some limitations in the use of this type of motor. One of the problems with
brushless motors is the presence of an intrinsic phenomenon called Cogging.
This phenomenon is caused by the magnetic interaction between the two main
parts of the machine that from an operational point of view can be interpreted as
a torque oscillation. This phenomenon is therefore a problem in those applications
where a great deal of precision is required. Moreover, it is the cause of unwanted
noise for the entire drive system. The result of this interaction is an additive pair
which causes an undesired oscillation on the rotation of the rotor axis even in the
absence of power supply to the electric machine. This problem is partially solved by
requiring physical modifications in the production phase of the electric motor, which
provides for different shapes of the stator slots and of magnets [1, 2] and therefore
the replacement of the motor itself. However, these solutions are often expensive as
they require customized procedures. Therefore, is simpler from an operational and
functional point of view to design a electronic control system that rejects this torque
disturbance. The interest part of the following work is the usage of a sensor-less
architecture in the context of the Cogging Torque. Exploiting the non-linear control
technique designed in our previously work, that is based on the mathematical model of
the Cogging Torque as a function of the rotor position. The challenge is to verify if our
control algorithm works also with sensor-less architecture in which the measure of the
encoder/resolver is not available. There are many works in the literature that propose
a control system for synchronous motors with permanent magnets based on sensor-
less architecture. In [3–5] are reported the results of implementation of sensor-less
architectures based on EKF (Extended Kalman Filter), in discrete time, for brushless
motors with air gap induction type trapezoidal, i.e. Brushless DC motors (BLDC).
In [6] a solution is presented which makes use of a modern variant of Kalman Filter
called Unscented Kalman Filter (UKF). In [7, 8] the sensor-less architecture used
involved a sliding mode state observer; in [8] also the controller is developed with a
sliding mode technique. In [9] the project of a sensor-less architecture is described
exploiting the criteria of the H-infinity control theory while in [10] the project of
a state observer based on the theory of neural networks is proposed. In [11] the
possibility of using a sensor-less architecture to reduce torque ripple is presented.
Compared to the other works to which reference is made, regarding sensor-less
architectures based on state observer, in this article a non-linear control system based
on EKF has been developed in continuous time instead of discrete time.
Furthermore, compared to the previous works, the proposed EKF model refers
to the dynamics of the motor expressed in three-phase axes instead of in direct-
quadrature axes. This is because from an operational point of view the current and
voltage measurements available as observer inputs are actually deriving from the
three-phase dynamics. Furthermore, compared to the previous works, the reduction
of an intrinsic disturbance of the machine which depends directly on the variables
estimated by EKF is included.
There are two main configurations, the first identified with the abbreviation SPM
(surface permanent magnets) in which the magnets are arranged on the outer surface
of the rotor, the second is indicated with IPM (internal permanent magnets) where
the permanent magnets are set in the iron of rotor. The problem of Cogging Torque
is present for both SPM and IPM brushless motors. In this work, as in [12], reference
is made to a brushless motor with SPM architecture. The Cogging Torque is born
from the magnetic interaction between the permanent magnets placed on the rotor
surface and the teeth of the stator slots. Further, since it is completely due to the
field produced by the permanent magnets, it is also present when the electric motor
is not powered. This magnetic interaction causes the birth of magnetic forces that
can be represented as force vectors applied in the centre of the magnets themselves,
which create a mechanical torque that induces the rotation of the rotor. The direction
of the magnetic force vectors depends on how the flow lines of the magnetic field
produced by the magnets are closed in the stator iron through the teeth of the slots.
It is intuitive that the way in which the magnetic flux lines generated by the magnets
pass through the stator teeth, depends on the reciprocal position between the rotor
and the stator.
Each magnetic flux line can be interpreted as a line passing between the centre of
a magnet and the centre of a stator tooth, closed in the rotor and stator iron, as shown
schematically in Fig. 36.1.
Since the internal structure of a brushless motor is symmetrical, it is also intuitive
that the same tooth-magnet configuration is repeated several times throughout a
Fig. 36.1 Schematic representation of the path of the magnetic flux lines
312 D. Pierpaolo and S. Saponara
Fig. 36.2 Schematic representation of the contribution to the Cogging Torque of a single tooth-
magnet pair
corner. Further it is possible to think of the total Cogging Torque as the overlapping
of the effects of “cogging torque elements” associated with each tooth-magnet pair.
As schematically shown in Fig. 36.2, during the rotation, each magnet (due to the
direction of the closing of the flow) sees a force applied which attracts it towards the
tooth and once the centre of the magnet has passed the centre of the tooth, it sees
a force that rejects it. Locally there is a symmetrical situation in terms of attraction
and repulsion forces which suggests that the Cogging Torque can be represented as
a periodic function of the rotation angle, whose period depends on the number of
magnets and stator teeth, with average zero value. The Cogging Torque depends only
on the magnetic interaction between magnets and teeth, due to the field produced by
the magnets themselves, so this torque can be interpreted as an additive disturbance
with respect to the electromagnetic torque which instead depends only on the currents
supplied by the motor when it is powered.
As in [12], this article also refers to a closed description of the cogging pair.
In particular, reference is made to [13] which describes the mathematical model
reported in Eq. (36.1).
m
Tcog = Tk sin(k Z θ + αk ) (36.1)
k=1
where Tk and αk represent the coefficients relative to the kth harmonic of the Fourier
series, Z is the number of stator teeth and m is the number of harmonics necessary
to approximate the behaviour of the Cogging Torque.
36 Control System Design for Cogging Torque Reduction … 313
In this article we refer to our previous work [12] on the design of a control system
that solves the problem of Cogging Torque, extending it to the case in which a sensor-
less architecture is used. In general, sensor-less architecture involves the design of
a system for estimating the angular position and angular velocity of the rotor that
exploits the measurement of the three-phase voltages that supply the motor and the
three-phase currents supplied. In this work a continuous-time EKF as a state observer
was considered (Fig. 36.3).
With reference to the unified theory of electrical machines [14], for the design
of a control system for three-phase motors it is advisable to describe the dynamic
equilibrium of the machine in terms of two-phase equivalent circuits. As shown in
Fig. 36.3, the direct transformations of Blondel (block B) and Park [block A(θ̂)] are
applied to the current vector, while the inverse transformations (A T (θ̂) and B T ) are
applied to the voltage vector output from the FLC block. In this work the design of the
EKF refers to the dynamic model expressed in the three-phase coordinate reference
while the control laws are expressed in the direct and quadrature axis system (which
are obtained after the application of both direct coordinate transformations A(θ ) and
B).
For completeness, in Eqs. (36.2) and (36.3) are reported the dynamics equation of
the electro-mechanic equilibrium both in three-phase and direct-quadrature frames
respectively.
⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞
ua ia ia sin pθ 2π
d
⎝ u b ⎠ = Rs ⎝ i b ⎠ + L eq ⎝ i b ⎠ − pωkϕ ⎝ sin pθ − ⎠
dt 3
uc ic ic sin pθ − 3 4π
dω 3
2(k − 1)π
Jm + βω = − pkϕ i k sin pθ − + Tcog (36.2)
dt k=1
3
ud i d id −L eq i q
= Rs d + L eq + pω
uq iq dt i q L eq i d + i d
dω 3
Jm + βω = pkϕ i q + Tcog (36.3)
dt 2
where u a , u b , u c and i a , i b , i c are the three-phase domain voltage and current com-
ponents, meanwhile u d , u q and i d , i q are the voltage and current components in the
direct-quadrature reference frame. With Rs and L eq are indicated the stator electrical
resistance and equivalent inductance respectively. With p is indicated the pole pairs,
Jm represents the inertia moment of the rotor and β is the friction coefficient; θ and
ω represent the rotor angular position and speed respectively.
From an operational point of view, considering the three-phase axis model of the
motor for the EKF project is advantageous compared to the design of the motor
model in direct-quadrature axes as it avoids having a feedback of the estimated angle
for the calculation of the transformation of Park on current measurements, which
instead is present in [3–5]. As in our previous work [12], in the dynamics of the
motor we refer to a mathematical model of Cogging Torque with seven harmonics in
the Fourier development of Eq. (36.1), while in the model used for the development
of the FLC control reference is made to the four-harmonic model.
In Eqs. (36.4) and (36.5) the control laws (the vector of the voltages that supply
the motor expressed in axes dq) and the update dynamics of the state observer are
summarized.
L g2 h 2 −L g2 L 2f h 1
ud −L g1 h 2 L g1 L 2f h 1 v1 L 3f h 1
= − (36.4)
uq L g1 L 2f h 1 L g2 h 2 − L g1 h 2 L g2 L 2f h 1 v2 L f h2
d x̂
= F x̂ x̂ + G x̂ u + K E K F y − C x̂
dt
dP T
= F x̂ P + P F x̂ + Q − K E K F R K ET K F
dt
K E K F = PC T R −1 (36.5)
In Eq. (36.2), L kf h j is the kth Lie derivative of the generic output function h j along
v1
the direction of the vector field f, while the vector represents the auxiliary
v2
control in the new base identified through the operating procedure of the technique
used. Reference is made to a proportional auxiliary control, a simple static feedback
of the state variables defined in the new base [12].
In Eq. (36.5) we indicate with x̂ the vector of the estimated state, referring to the
state representation derived from the dynamics expressed in three-phase axes, F x̂
36 Control System Design for Cogging Torque Reduction … 315
and G x̂ are the Jacobian matrices calculated in the current estimated value, y is
the vector of the variables of measurable state (current vector), C is the vector that
maps the state vector in the output vector, K E K F is the gain of the observer, Q and R
are the covariance matrices of the noises that act on the vector of state x and on the
vector of input u and P is the covariance matrix of the estimated state, which solves
the Riccati matrix differential equation.
Compared to our previous work, the expression reported in Eq. (36.4) will depend
on the estimated state (relative to the state representation in axes dq). It means that
the model of the Cogging Torque that appears in the dynamics of the motor is in
relation to the true position while the part of the controller that uses this model is a
function of the estimated position.
To verify the validity of the proposed architecture, the results of the trajectory tracking
problem in terms of desired position and current in quadrature axis are presented
below.
The results in terms of estimation of the currents expressed in three-phase refer-
ence and the result of the estimation of the angular velocity are also reported in the
following.
For all the simulations that we present, reference is made to the following
parameters for the motor and EKF models (Tables 36.1, 36.2).
For the model of the measurement noises we assume additive Gaussian signal for
the current and voltage components. In particular we set null mean value for both
current and voltage noise and 1 [A] and 5[V] standard deviation respectively.
The first type of desired trajectory for rotor positioning is a general shape in which
are presents some change of rotative direction without steady state phases.
The covariance matrixes P(t0 ), Q and R are set to identity matrix of just
dimension.
Figure 36.4 shows the estimation result for what concerns three-phase current
components, that is basically the first check since that are the available measures.
One of the issues linked with the usage of the EKF as estimation system, is the
initial condition settings for what concern the updating dynamic equation of the filter
itself.
Clearly if the initial conditions of the estimated state vector are to much different
respect the initial conditions of the reference process dynamics model, the EKF
cannot realizes a good estimation.
It is mandatory to verify the EKF performance with different initial conditions
further than with different desired trajectory in terms of rotor position.
In particular, we have verified that the estimated position converged to real position
set as initial condition θ (t0 ) = 0.5, θ (t0 ) = 1.0, θ (t0 ) = 1.5 and θ (t0 ) = 2.0 fixing
initial condition for the other state variable (current components and speed).
Figure 36.5 represents the result of the trajectory tracking for the rotor axis posi-
tioning to desired behaviour, with different initial conditions in terms of rotor start
position.
Figure 36.6 shows that the estimated position is convergent to the real position, in
particular is reported the behaviour of the position error related to the results shown
in Fig. 36.5.
Figures 36.7 and 36.8 shows the result in term of rotor angular speed estimation
with different initial conditions, in order to verify the robustness of the EKF as
a function of the motor speed variation. Figure 36.9 shows the behaviour of the
estimation error of angular speed of the rotor, for different speed initial condition,
fixing the initial condition for the other variables. Figure 36.10 shows the result of
the trajectory tracking control problem for the direct current component. From the
theory of the Brushless control [14], to emulate a FOC (Field Oriented Control)
architecture, it is imposed a null reference signal for this current component.
In this work a sensor-less architecture based on the EKF observer and on feedback
linearization control system is proposed. We have verified that also with a sensor-less
architecture it is possible to assume our previously control solution [12] to reduce the
intrinsic problem of the Cogging Torque. Results in term of the trajectory tracking
control problem on the direct current component and on the rotor axis position are
318 D. Pierpaolo and S. Saponara
presented. An analysis on the initial condition of the rotor axis position reveals that
the architecture is robust in term of variation of start condition of the global system.
In summary, the contribution of this work is to demonstrate that through a
continuous-time EKF observer that refers to the dynamics of the motor in three-phase
axes, it is possible to realize a sensor-less architecture that exploits the design of a
FLC controller for Cogging Torque reduction, which instead is derived by referring
to the dynamics expressed in direct-quadrature axes.
As future work, it could be interesting to verify if the new architecture solution
can be implemented on a real embedded system as the sensor-based version proposed
in [12]. Clearly the introduction of the EKF increase the complexity of the global
control system. With respect to the sensor based FLC version, in which only algebraic
operation was required, in this case the EKF requires to solve iteratively a differential
matrix equation (Riccati equation) that reasonable requires higher computational
capability than a low-cost embedded system like Arduino Uno, used in [12].
Another extension to this work could be the comparison between different types
of state observatories, UKF or Sliding Mode or Neural Networks, to verify that the
reduction of the cogging pair, based on FLC, can be implemented in any type of
sensorless architecture and what is the best configuration.
References
2. Hwang Myeong-Hwan, Lee Hae-Sol, Cha Hyun-Rok (2018) Analysis of torque ripple and
cogging torque reduction in electric vehicle traction platform applying rotor notched design.
Energies 11(11):3053
3. Zhang Z, Feng J (2008) Sensorless control of salient PMSM with EKF of speed and rotor
position. In: IEEE international conference on electrical machines and systems
4. Termizi MS et al (2017) Sensorless PMSM drives using Extended Kalman Filter (EKF). In:
2017 IEEE conference on energy conversion (CENCON). IEEE
5. Tian G et al (2018) Rotor position estimation of sensorless PMSM based on Extented Kalman
Filter. In: 2018 IEEE international conference on mechatronics, robotics and automation
6. Lv H, Wei G, Ding Z (2014) UKF—based for sensorless brushless DC motor control. In: IEEE
2014 international conference on mechatronics and control (ICMC)
7. Tingna S, Na L, Li W (2010) Sensorless control for brushless DC motors using adaptive sliding
mode observer. In: 29th Chinese control conference
8. Zheng C, Li Y (2016) Sensorless speed control for brushless DC motors system using sliding-
mode controller and observers. In: 2016 8th International conference on intelligent human-
machine systems and cybernetics (IHMSC), vol 1. IEEE
9. Vinida K, Chacko M (2016) A novel strategy using H infinity theory with optimum weight
selection for the robust control of sensorless brushless DC motor. In: 2016 IEEE symposium
on sensorless control for electrical drives (SLED). IEEE
10. Jinquan Z, Deying G (2014) Control method of sensorless brushless DC motor based on neural
network. In: IEEE 26th Chinese control and decision conference
11. Sun X et al (2016) A sensorless method with torque ripple suppression for brushless DC motors.
In: 19th International conference on electrical machines and systems (ICEMS). IEEE
12. Dini P, Saponara S (2019) Cogging torque reduction in brushless motors by a nonlinear control
technique. Energies 12(11):2224
13. Tudorache T et al (2012) Improved mathematical model of PMSM taking into account cogging
torque oscillations. Adv Electr Comput Eng 12(3):59–64
14. Krause P (2017) Introduction to electric power and drive systems. Wiley, New York
Part VIII
Signal and Data Processing
Chapter 37
Acoustic Emissions Detection
and Ranging of Cracks in Metal Tanks
Using Deep Learning
Abstract This work proposes a new method for the estimation of the distance of
cracks in pressure metal tanks. This method is obtained coupling the acoustic emis-
sions analysis and the deep learning techniques. Using a 2D CNN we are able to
estimate the distance between a crack and an acoustic emission piezoelectric sensor.
The CNN is trained on images representing the spectrogram of acoustic emission
located at distances of 2, 20, 40, 60, 80, 100, 120 and 140 cm. We obtained a RMSE
of 2.54 cm.
37.1 Introduction
Fig. 37.2 Architecture of a convolutional neural network for image classification or regression
with obstacles and junctions. In fact, in these cases, echoes signals generated by
rebounds interfere with the analysis.
In recent times, the development of Artificial Intelligence and Machine Learning
(ML) algorithms enabled the use of such techniques in several applications, being
the signal processing and analysis of sensor network data two of them.
The main task categories that are addressed with a ML approach can be divided
in three groups: classification of data, pattern recognition and data regression. Gen-
erally, a ML design flow requires the developer to select a suitable feature set that
characterizes the data, the algorithm training stage and, finally, its deployment in
inference stage. The features (Fig. 37.1) that are usually employed in AE applica-
tions are the maximum of absolute value of the amplitude, the signal duration, the
signal energy and the number of crossings of a given threshold [7].
In recent years, among the numerous AI model categories, Deep Learning (DL)
became a trending topic, both in research and industry [11–16]. The main advantage
of a DL approach with respect to a traditional ML one is the capability to learn and
extract automatically the feature maps that are needed to achieve the solution of
the task. Deep Learning gained popularity after the achievements of Convolutional
Neural Networks (CNN), shown in Fig. 37.2, which are typically used in the computer
vision field of applications for classification and regression of image data [17].
In this paper we propose a method based on Deep Learning and CNN to process
the image of an Acoustic Emission in a metal pressured tank in the form of its spec-
trogram. By designing a regression model, the algorithm is able to detect a rupture
event and to estimate its distance (ranging) from the piezoelectric sensor. Imple-
menting this technique in multiple sensor nodes of the network, it is also possible
to estimate the position of the event by triangulation (localization). Other research
works apply similar CNN approaches in other fields of application, such as seismic
events detection [18] and biomedical [19].
328 G. C. Cardarilli et al.
Fig. 37.3 Experimental setup framework: generation of acoustic events on the metal plate at
different distances
37 Acoustic Emissions Detection and Ranging … 329
the second one, and 32 in the others. The final regression layer outputs the estimated
distance of the AE waveform received by the sensor.
We trained the convolutional network using our 800 spectrograms dataset. We applied
the cross-validation technique considering 75% of the instances for the training set
330 G. C. Cardarilli et al.
Fig. 37.6 RMSE (cm) and loss after the training process
and the other 25% for the validation set. As shown in Fig. 37.6, we obtained an
optimal convergence after about 50 iterations.
The results are presented in terms of Root Mean Square Error (RMSE). We
obtained a RMSE of 2.54 cm. Considering the distance steps between the dataset
instances we can state that the network is able to have a good generalization and to
locate the rupture very accurately.
37.4 Conclusions
References
1. Attuazione della direttiva 97/23/CE in materia di attrezzature a pressione. dlgs 93/2000 Italian
Legislation
2. Cardarilli GC, Di Nunzio L, Massimi F, Fazzolari R, De Petris C, Augugliaro G, Mennuti C
(2018) A wireless sensor node for acoustic emission non-destructive testing. Lect Notes Electr
Eng
3. Bechhoefer E, Qu Y, Zhu J, He D (2013) Signal processing techniques to improve an acous-
tic emissions sensor. In: Proceedings of the annual conference of the prognostics and health
management society. pp 581–58
4. Grosse Christian U, Ohtsu M (eds) (2008) Acoustic emission testing. Springer Science &
Business Media, Berlin
5. Akyildiz Ian F et al (2002) Wireless sensor networks: a survey. Comput Netw 38(4):393–422
6. Perumalla V, Ramanjaneyulu BS, Kolli A (2017) Simulation study of topological structures
and node coordinations for deterministic WSN with TSCH. Int J Inform Vis 1(4)
7. Giardino D, Matta M, Spanò S (2019) A feature extractor IC for acoustic emission non-
destructive testing. Int J Adv Sci Eng Inf Technol 9(2):538–543
8. Giuliano R, Mazzenga F, Neri A, Vegni AM (2017) Security access protocols in IoT capillary
networks. IEEE Internet Things J 4(3):645–657
9. Riqualificazione serbatoi GPL con metodo EA, Istituto nazionale per l’assicurazione contro
gli infortuni sul lavoro (INAIL) (2019). https://www.inail.it/cs/internet/attivita/ricerca-e-
tecnologia/certificazione-verifica-e-innovazione/certificazione/riqualificazione-serbatoi-gpl-
con-metodo-ea.html
10. Ni Q-Q, Iwamoto M (2002) Wavelet transform of acoustic emission signals in failure of model
composites. Eng Fract Mech 69(6):717–728
11. Lu Y (2017) Industry 4.0: A survey on technologies, applications and open research issues. J
Ind Inf Integr 6:1–10
12. Matta M, Cardarilli GC, Di Nunzio L, Fazzolari R, Giardino D, Nannarelli A, Re M, Spanò S
(2019) A reinforcement learning based QAM/PSK symbol synchronizer. IEEE Access
13. Cardarilli GC, Di Nunzio L, Fazzolari R, Nannarelli A, Re M, Spano S (2019) N-dimensional
approximation of euclidean distance. IEEE Trans Circuits Syst II Express Briefs
14. Cardarilli GC, Di Nunzio L, Fazzolari R, Re M, Spano S (2019) AW-SOM, an algorithm for
high-speed learning in hardware self-organizing maps. IEEE Trans Circuits Syst II: Express
Briefs
15. Cardarilli GC, Di Nunzio L, Fazzolari R, Giardino D, Matta M, Re M, Silvestri F, Spanò S (2019)
Efficient ensemble machine learning implementation on FPGA using partial reconfiguration.
Lect Notes Electr Eng 550:253–259
16. Hordri NF, Yuhaniz SS, Shamsuddin SM (2016) Deep learning and its applications: a review.
In: Postgraduate annual research on informatics seminar 2016, Universiti Teknologi Malaysia
17. Russakovsky O et al (2015) Imagenet large scale visual recognition challenge. Int J Comput
Vis 115(3):211–252
18. Zhu L, Peng Z, McClellan J (2018) Deep learning for seismic event detection of earthquake
aftershocks. In: 2018 52nd asilomar conference on signals, systems, and computers, IEEE
19. Zhang J et al (2019) Fine-grained ECG classification based on deep CNN and online decision
fusion. Preprint at arXiv:1901.06469
20. ASTM E-976, Standard Guide for Determining the Reproducibility of Acoustic Emission
Sensor Response, ASTM International
21. Burrascano P, Laureti S, Senni L, Ricci M (2018) Pulse compression in nondestructive test-
ing applications: reduction of near sidelobes exploiting reactance transformation. IEEE Trans
Circuits Syst I Regul Pap (99):1–11. https://doi.org/10.1109/tcsi.2018.2862868
Chapter 38
Recognizing Breathing Rate
and Movement While Sleeping in Home
Environment
Abstract The recovery of our body and brain from fatigue directly depends on the
quality of sleep, which can be determined from the results of a sleep study. The
classification of sleep stages is the first step of this study and includes the mea-
surement of vital data and their further processing. The non-invasive sleep analysis
system is based on a hardware sensor network of 24 pressure sensors providing sleep
phase detection. The pressure sensors are connected to an energy-efficient microcon-
troller via a system-wide bus. A significant difference between this system and other
approaches is the innovative way in which the sensors are placed under the mattress.
This feature facilitates the continuous use of the system without any noticeable influ-
ence on the sleeping person. The system was tested by conducting experiments that
recorded the sleep of various healthy young people. Results indicate the potential to
capture respiratory rate and body movement.
38.1 Introduction
Sleep is necessary for everybody and sleeping an adequate time with good quality
ensures that people feel good and have more energy for their daily tasks. The National
Sleep Foundation (NSF) recommends that adults sleep 7–8 h a day [1, 2].
The sleep phases can be divided into two main categories: Non-Rapid Eye Move-
ment (NREM) and Rapid Eye Movement (REM) phase. The REM phase occurs
after an initial stage of deep sleep. It is a phase in which dreams occur while the eyes
move rapidly in different directions, with the heart and respiratory rate becoming
irregular. The REM phase alternates with light and deep stages of the NREM phase
and becomes longer with each sleep cycle. Adults with healthy sleeping habits spend
about 20% of their time in the REM phase, while this percentage decreases with age
[3].
Typically, sleep is analyzed in a sleep laboratory using polysomnography (PSG).
Here the electrophysiological signals are recorded and interpreted. However, sleep-
ing in this environment differs from a “normal” sleep at home. For a person to be
monitored over a longer period of time, home installation is the only possible solu-
tion. An additional aspect is that the costs of a “home” system should be significantly
lower, while the most important relevant sleep parameters can still be collected [4].
For example, monitoring movement and breathing will support the detection of apnea
[5].
The main objective of the method presented in this work is to track and analyze
a person’s movement, breathing, and heart rate during sleep. The main difference
of this system compared to other approaches is the innovative way of placing the
sensors under the mattress to ensure the familiar sleeping comfort.
38.2 Methodology
There are several types of pressure sensors that could be used for this project [6].
Force Sensing Resistor (FSR) is a type of material whose resistance changes under
pressure. Several types of research have been carried out with this type of sensor
and they have proven their reliability and accuracy when used in a sensor grid [7–9].
Therefore, the FSR sensor was selected for use in this method.
16 FSR sensors can be individually connected to each sensor node, while the
sensor nodes are connected to each other via a system-wide bus (I2 C) with address
arbitration. Due to this fact, a simple expansion of the system by connecting addi-
tional sensor nodes with sensors is possible. All sensors will automatically receive
an address in the system one after another and therefore no manual adjustment is
necessary.
A node is implemented as a small and simple PCB. It features an ATMEL SAM
D21 microcontroller based on 32-bit ARM architecture. The advantage of this micro-
controller is the large number of 12-bit resolution AD pins and the compatibility with
38 Recognizing Breathing Rate and Movement While Sleeping … 335
a lot of widely-used frameworks and tools. The firmware is based on ARM mbed
framework and is written in C++. In regular intervals, the node measures voltage
value on sensor pins and saves that data in a local dynamic buffer. When a request
arrives via a system bus, the microcontroller processes the request and returns the
latest measurements.
The “Endpoint” is a device that acts as an interface between the network of sensor
nodes and external clients and services. In the context of the presented project, an
Intel Edison was used as “Endpoint”. Periodically, an endpoint queries the network
for the latest data. Received sensors’ values are being stored along their timestamps
and node locations in a local database, which is used for possibility of easy access
from multiple devices, connected wirelessly to the “Endpoint”.
To achieve a flexible sensor mesh, automatic address arbitration is implemented.
By every system start, all nodes reset their addresses and wait for a high signal on
input pin. When this happens, the node takes the offered address and responds to the
bus. After that, the endpoint instructs that node to rise its sense output pin high so
that the next node can catch the address. This arbitration algorithm allows users to
place boards in any order for their sensor mesh network.
Figure 38.1 shows the system structure. Its estimated cost is about 150 e based
on costs of single components.
To conduct the study, 24 FSR sensors (FSR 406) were connected to sensor nodes
(see Fig. 38.2). The positions of sensors can be changed depending on experiment
aims. In the first step, experiments were conducted with three subjects in different age
groups (18–25, 26–30 and 31–35). Two male and one female subjects participated in
the test. Body Mass Index of test persons was 22 ± 2.5 kg/m2 . No significant health
disorders were present on the test subjects. A total of approximately two hours were
spent in bed, simulating sleep in different positions to collect the movement and
breathing data.
38.3 Results
Signals from only one sensor (position is presented by green colour at Fig. 38.2) are
displayed for simple and clear representation in Fig. 38.3. Some notable events and
corresponding positions are also presented in this Figure.
All movements are recognizable, and also the periodic signal is easy to recognize.
Figure 38.4 shows the enlarged representation of this periodic signal with blue dots
as peaks from reference device Zephyr. It is necessary to mention that the frequency
of the signal recording was 1 Hz due to the chosen architecture. This is the main
reason why the periodic signal does not seem to be very clear and why it is not
possible to detect exact peaks.
38.4 Discussion
The developed system prototype provides the measurement of different vital param-
eters relevant for sleep stages analysis. The system architecture consists of a network
of pressure sensors that are installed in a bed. Using the system does not cause incon-
venience during sleep as the sensors are placed under the mattress. The sensors could
detect body movements and respiratory signals. In this respect, the system seems to
be suitable for sleep analysis.
One of main novel points of the proposed system is its possible application in
home environment under the bed mattress for the measurements in non-obtrusive
way. Another important point is the using of automatically address arbitration, which
allows changing the number and positions of sensors in a very fast and easy way.
338 M. Gaiduk et al.
Using of FSR sensors for the measurements should also be mentioned. And the
estimated price of about 150 e for the components in sale by retail, which means a
much lower price in case of mass production is also an important aspect in comparison
with other similar devices.
The evaluation of the system operation with the reference device has confirmed
that breathing can be detected even at a frequency of 1 Hz. The next step is to
perform a long-term test with night monitoring and evaluate the results. At the same
time, work on increasing the system frequency has already begun. This can open the
possibility to improve the results and to enable the recognition of heart rates.
The next step will be to connect the hardware system to a sleep stage classification
algorithm [11], to experiment in a sleep laboratory where the results can be evaluated
in collaboration with sleep medicine experts.
References
1. Kendall S (2015) National sleep foundation recommends new sleep times (Online). Available:
http://www.sleephealthjournal.org
2. Pilcher J, Huffcutt A (1996) Effects of sleep deprivation on performance: a meta-analysis
(Online). Available: www.watermark.silverchair.com
3. Cherry K The 4 stages of the sleep (NREM and REM sleep cycles) (Online). Available: www.
verywell.com
4. Muzet A (1988) Dynamics of body movements in normal sleep. Sleep’86, pp 232–234
5. Zeidler MR et al (2015) Predictors of obstructive sleep apnea on polysomnography after a
technically inadequate or normal home sleep test. J Clin Sleep 11:1313
6. Bicking RE Fundamentals of pressure sensor technology (Online). Available: www.
sensorsmag.com/components/fundamentals-pressure-sensor-technology
7. Lokavee S, Suwansathit W, Tantrakul V, Kerdcharoen T (2014) Unconstrained detection of res-
piration rate and efficiency of sleep with pillow-based sensor array. In: 2014 11th international
conference on electrical engineering/electronics, computer, telecommunications and informa-
tion technology (ECTI-CON), p 1–6 (Online). Available: https://doi.org/10.1109/ecticon.2014.
6839779
8. Lokavee S, Watthanawisuth N, Mensing JP, Kerdcharoen T (2011) Sensor pillow system: mon-
itoring cardio-respiratory and posture movements during sleep. In: The 4th 2011 biomedical
engineering international conference, pp 71–75 (Online). Available: https://doi.org/10.1109/
bmeicon.2012.6172021
9. Sundar A, Das C (2015) Low cost, high precision system for diagnosis of central sleep apnea
disorder. In: 2015 international conference on industrial instrumentation and control (ICIC),
pp 354–359 (Online). Available: https://doi.org/10.1109/iic.2015.7150767
38 Recognizing Breathing Rate and Movement While Sleeping … 339
10. Kim J-H, Roberge R, Powell JB, Shafer AB, Williams WJ (2013) Measurement accuracy
of heart rate and respiratory rate during graded exercise and sustained exercise in the heat
using the Zephyr BioHarnessTM. Int J Sport Med 34(6):497–501. http://doi.org/10.1055/s-
0032–1327661
11. Gaiduk M, Penzel T, Ortega JA, Seepold R (2018) Automatic sleep stages classification using
respiratory, heart rate and movement signals. Physiol Meas 39(12). https://doi.org/10.1088/
1361-6579/aaf5d4
Chapter 39
A Fast Face Recognition CNN Obtained
by Distillation
Abstract Nowadays, the trend of the latest research in face recognition model shows
that “the complex—the better” paradigm can be directly applied to these systems,
whose accuracy effectively depends on both a large number of well-trained parame-
ters and a complex functional structure. If this approach is sustainable for an offline
processing on a consumer PC, it is far less appealing in the mobile environment,
where processing power, as well as a high amount of onboard RAM could not be
available. The distillation technique, applied on the cumbersome dlib-resnet-v1 face
recognition model results in a lighter version that, while maintaining a comparable
accuracy, can achieve a faster processing rate (>10×) and a lower memory occupa-
tion (1/6). The final model has been implemented on a single board PC, also using a
neural hardware accelerator.
39.1 Introduction
The face recognition problem has pushed more than any other research topic on
Convolutional Neural Networks (CNNs), because the impact of human-like perfor-
mances in this type of Artificial Intelligence is huge. Nowadays CNNs represent the
key technology for reaching this objective and their apparent simplicity brought this
kind of functionality in many mobile systems. Unfortunately, in the out-of-the lab
environment high reliability is achieved only at the cost of a low processing rate, as
a result of implementing a complex model; a viable option could be the use of an
online cloud service, but the implications on privacy and reliability are evident. In
a previous work [1], after having casted the problem as a “multi-class recognition
in an open-set scenario”, an open-source framework (Dlib) for face recognition has
been identified and exhaustively tested. The resulting classification procedure can
be carried out either with a shallow multi-layer perceptron (MLP) neural network
(highest accuracy, short mandatory training phase), or with a simple distance metric
(lower accuracy, insertion of identities in the database at runtime). All the classifica-
tion algorithms we tested process the features provided by an open-source pretrained
face-features extractor CNN (dlib-resnet-v1) [2] that, in conjunction with the subse-
quent classifiers, proved to be sufficiently discriminative. Besides this, while on a PC
the presence of a CUDA-compatible GPU permits a reasonable processing rate of
5-10 fps, on a mobile hardware with an ARM CPU the average speed is in the order
of roughly 0.5 fps (with Dlib compiled using ARM NEON [3] instructions), making
a mobile use impractical. Another macroscopic problem of this pre-trained model
is that it has been created within the Dlib framework. As a consequence, further
modifications, fine-tuning and research, as well as a simple conversion of the model
represent an unnecessarily difficult burden.
In face recognition, extracting general characteristics from the provided samples
(during the supervised learning process) requires a more complex structure than
the one needed for their actual representation (used during the inference). With
this work we experimentally demonstrate knowledge transfer via distillation in a
metric framework, and its actual implementation. The novelty of our contribution is
threefold: (a) the entire feature vector is used, allowing theoretically for a blind swap
of the oracle; (b) the entire framework is minimal, since it only requires the regression
of the output target; (c) the input image width is reduced by half, permitting the
recognition of small faces by design. By using the distilled model, within the entire
frame processing procedure, face detection algorithms at HD resolution represent
the most time consuming phase (roughly 1 s on mobile HW); frame preprocessing,
face alignment and resizing are negligible in terms of computation time.
The reduction of the time and memory complexity is a process that involves both
the structure simplification and the parameter reduction; the sweet spot is given by a
reduced set of parameters and a smart choice for the data processing flow that maintain
the same level of accuracy as the original network. The form of compression used
in this work [4–7] decorrelates the accuracy that a model achieve when performing
a task, from its learned weights: what is really important to transfer (to distill) into
a new model is the I/O relationship of the model itself, or the capacity to reveal the
latent conditional distribution p(T|X) that relates the inputs X and the outputs T. This
capacity is called “dark knowledge” [6] and the act of transferring it from a slow but
well-trained model (the teacher) to a student model is called “knowledge distillation”
[6].
The training set for the distillation process, carried out as a supervised learning,
is composed of the tuple (X, T), input and corresponding target. The distillation is
carried out as a regression process, forcing the student network to provide the same
descriptor generated by the teacher; in the case of an embedding network, this can be
directly described in a distance metric framework, where a distance larger than the
hypersphere radius of each cluster automatically flags a bad learning. This motivates
39 A Fast Face Recognition CNN Obtained by Distillation 343
to choose as a loss metric the Euclidean Distance L d [4] calculated between the target
features vector T and the corresponding predicted descriptors vector Y.
The Dlib reference network (dlib-resnet-v1) is based on the ResNet-34 [8] model
which was modified by removing some layers and reducing the size of the filters by
half [2]: it presents a 150 × 150 pixel RGB input, 29 convolutional layers and one
fully-connected output layer with a 128D output, for a total of 5.58 M parameters.
As the base architecture for our distilled CNN, a newer architecture called
“DenseNet” [9] has been selected, because it requires fewer parameters than a ResNet
with the same accuracy. The core of this architecture are the so called “dense blocks”,
that consist in a sequence of bottlenecks and compression layers (DenseNet-BC).
From the original DenseNet-121, four lighter version (“cuts”) have been obtained,
gradually halving both the number of dense blocks and the number of inner layers
[4]. The network are denoted ‘Net2.5’, ‘Net2.0’, ‘Net1.0’ and ‘Net0.5’. The number
of parameters is equal to 3.94, 1.48, 0.38 and 0.12 M respectively. Similarly to Dlib,
for all these generated networks the final layer is 128-D, while the input size has
been reduced to 80 × 80 pixel (RGB). This resolution has been chosen observing
that in our setup the smallest meaningful faces detected in an FHD video stream do
not exceed 100 pixels at a couple of meters.
The training dataset of our distillation experiment is composed of a mixture of
250 k images taken from the Casia [10] and VGG [11] dataset: each image is pre-
processed finding the face Region of Interest (ROI), aligning the detected face, and
resizing it to a resolution of 80 × 80. The face detector and the face alignment pro-
cedures use the Dlib API [1]. The set of these generated images is then input to
dlib-resnet-v1 and the corresponding feature vector is saved, forming the tuple (X,
T) that is consumed during the supervised learning of the student model. The best
convergence has been reached removing the resulting average vector both from the
target and the images and using Adam [12] as the optimizer. The training is relatively
fast: 30 epochs on the training dataset (organized in batch of 128 images) suffice for
a good convergence.
The test has been carried out on a completely different dataset, FaceScrub [13],
that has been cleaned from mislabeled identities. Following the approach in [1]
we designed a Multi Layer Perceptron (MLP) classifier composed of three fully-
connected layers, of which the hidden layer presents 100 neurons. In order to max-
imise the classification reliability, for each CNN we have trained an ad hoc MLP
classifier.
Each distillation experiment has been carried out within Keras [14], in order to
simplify the deployment of our final model on Tensorflow-compatible hardware.
344 L. De Bortoli et al.
The evaluation of the new CNNs has been accomplished comparing the resulting
ROC curves with that of the original Dlib model. The system has to correctly identify
faces of people in an ID database, without misclassifying the known identities and
rejecting any other face (unknown ID). For this purpose one MLP for each model
has been trained, using only samples belonging to the ‘known’ database; during the
test phase, the dataset is composed of other images of the same ID group, plus an
identical number of images of completely ‘unknown’ people taken at random from
the remaining faces in the FaceScrub dataset.
The key for the rejection of the unknown identities is a confidence index (a form
of normalized distance, defined in Eq. 39.1 that is used to decide on the reliability
of the classifier decision: with an unknown ID a low confidence value is expected,
whereas the opposite should happen for a known face.
d1 − d2
C= (39.1)
d1 − dn
where d 1 , d 2 and d n are all ‘logit’, i.e. the output of the latest layer of the MLP
(before the SoftMax operator) respectively of the largest, the second-largest and the
smallest value. The value of C is bounded between 0 and 1; by imposing a threshold
for C it is possible to discriminate between known and unknown identities.
ROC curves are plotted calculating the True Positive Rate (TPR) and the False
Positive Rate (FPR) as a function of the confidence index C, according to Eq. 39.2.
NP NF
T PR = ; FPR = (39.2)
N N+F
where N p is the number of correctly classified samples (with C above the selected
threshold of confidence) and N is the number of all known samples provided during
the test; N F is the number of misclassified samples (the number of known people that
have been misclassified plus the number of the unknown people which are classified
with a confidence index above the threshold, i.e. faces that have been erroneously
classified as a known person) and F is the number of all the unknown samples.
During the training of the MLP, an increasing number of images (1, 2, 4, 10, 20,
40) has been used for each class (in the number N c of 10, 20, 50). During the test,
70 samples of each known face have been used, while N c × 70 images of unknown
people balance the testing-set. The entire procedure has been repeated 10 times in
order to observe the average and standard deviation for each ROC curve.
Figure 39.1 shows the comparison of the four distilled CNNs with Dlib teacher,
in the case of 10, 20, or 50 classes of known identities; a fixed number of 40 samples
is used for the training. It is clear from the graph that the accuracy of the distilled
models is very close to that of dlib-resnet-v1. An expected degradation is observable
when the complexity of the network is reduced (less compressed networks perform
better). Figure 39.1 exposes also a counterintuitive behavior, showing that for 10
39 A Fast Face Recognition CNN Obtained by Distillation 345
Fig. 39.1 ROC curve: comparing distilled network with the teacher Dlib network, in the case of
10, 20, 50 classes. Each graph is highly zoomed on the top left corner of the ROC space
Fig. 39.2 ROC curve: performance variation observed using different training set sizes for the two
best distilled network and the teacher Dlib
classes the performances are slightly lower than for 50: we suppose this is due to the
combined action of how the network populates the embedding output space and of
the threshold action on the evaluated confidence. Further testing is needed.
Figure 39.2 shows a comparison on the effect of the number of samples used to
train the ad hoc classifier for Net2.0 and Net1.0; in this case the number of target
classes is set to 50. It can be noted that, if necessary, even with less than 10 samples
an MLP classifier can be trained effectively also on these new CNNs.
39.5 Implementation
The single board PC Odroid XU-4 has been selected as the reference mobile platform.
On this hardware Dlib can exploit the CPU only. From the four variant of distilled
346 L. De Bortoli et al.
network, again Net1.0 and Net2.0 have been ported to this mobile hardware; after
having converted them to the TensorFlow Protocol Buffers format, two strategies of
implementation have been examined: the first one uses TensorFlow Lite [15], while
the second one exploits the Intel Movidius Neural Stick accelerator [16] as the target
device. The latest incarnation of the Intel API for the hardware accelerator is called
OpenVino [17] and allows for an easy deployment of trained models on many Intel
heterogeneous devices (CPUs, GPUs, FPGAs, VPUs). The software development
has been made easier than in the past because OpenCV now encapsulates the Deep
Learning module of the OpenVino toolkit. While TFlite models running on CPU typ-
ically failback to the FP32 datatype, Intel Movidius support FP16 datatype only, thus
making a quantization necessary. Even though the presence of this phase, the gen-
erated features remain well contained in the per-id-hypersphere. The inference time
has been measured over 1000 inference cycles, also taking into account the required
alignment process. Table 39.1 shows the measured time for each configuration. Using
Intel Movidius, and considering ‘Net2.0’ as a reference model (the network with the
best ROC), the required time is about 8% of the original Dlib processing time.
39.6 Conclusion
This paper presents a workflow that can be used to distill the knowledge of an expert
oracle to a lighter CNN structure, that can be targeted to embedded devices. Through
the teacher-student approach, the training of the new models can be reduced to a
regression problem in which the convergence is reached in a relatively short time
using a limited and unlabeled dataset of faces. The distilled TensorFlow model can run
on an embedded CPU or with a HW accelerator. For our example facial recognition
application we highlight a strategy to obtain a new CNN with inference time reduced
by an order of magnitude, an accuracy comparable to the initial CNN, and a memory
consumption reduced by 6 times.
References
1. Marsi S, et al (2018) A face recognition system using off-the-shelf feature extractors and an ad-
hoc classifier. In: Saponara S, De Gloria A (eds) Applications in electronics pervading industry,
environment and society, Lecture notes in electrical engineering, vol 550
2. Dlib API, http://blog.dlib.net/2017/02/high-quality-face-recognition-with-deep.html. Seen 6,
2019
3. ARM Neon, https://developer.arm.com/architectures/instruction-sets/simd-isas/neon. Seen 6,
2019
4. Guzzi F et al (2019) Distillation of a CNN for a high accuracy mobile face recognition system.
In: Electronics and microelectronics (MIPRO) conference, Opatija, Croatia, 20–24 May 2019
5. Hinton G et al (2014) Distilling the knowledge in a neural network. In: Conference on neural
information processing systems (NIPS), Montreal, Canada, 8–13 December 2014
6. Ba J et al (2014) Do deep nets really need to be deep? In: Advances in neural information
processing systems 27, pp 2654–2662
7. Luo P et al (2016) Face model compression by distilling knowledge from neurons. In:
Proceedings of AAAI-16, Hyatt Regency, Phoenix, Arizona (USA), 12–17 February 2016
8. He K et al (2016) Deep residual learning for image recognition. In: IEEE conference on
computer vision and pattern recognition (CVPR), pp 770–778
9. Huang G et al (2017) Densely connected convolutional networks. In: IEEE conference on
computer vision and pattern recognition (CVPR), pp 4700–4708
10. Yi D et al (2014) Learning face representation from scratch. https://arxiv.org/abs/1411.7923
11. Parkhi OM et al (2015) Deep face recognition. In: British machine vision conference, Swansea,
UK, 7–10 September 2015
12. Kingma DP (2014) Adam: a method for stochastic optimization. https://arxiv.org/abs/1412.
6980
13. Ng H-W et al (2014) A data-driven approach to cleaning large face datasets. In: Proceedings
of IEEE international conference on image processing (ICIP), Paris, France, 27–30 Oct 2014
14. Chollet F et al. https://keras.io
15. TensorFlow Lite. https://www.tensorflow.org/lite. Seen 6 2019
16. Intel NCS. https://software.intel.com/en-us/neural-compute-stick. Seen 6, 2019
17. Intel OpenVINO Toolkit. https://software.intel.com/en-us/openvino-toolkit. Seen 6, 2019
Chapter 40
Fine-Grain Traffic Control for Smart
Intersections
Jessica Bellitto, Valentina Schenone, Francesco Bellotti, Riccardo Berta
and Alessandro De Gloria
Abstract As connected and, even more, autonomous vehicles are expected to bring
significant novelties in the future road traffic patterns, we have investigated the control
of a specific, yet very common topology, such as the intersection between two 2-lane
roads. We have addressed the issue with a novel, fine-grain control approach, and
proposed an adaptive prioritization algorithm which weights length of the queue and
arrival order for each lane. From an Uppaal simulation, we deduce that the second
factor looks more important, at higher arrival rates. Compared to a fixed Round-
robin schedule, our algorithm achieves quite a better performance, especially at high
traffic volumes, also with inhomogeneous traffic flow cases. In order to guarantee
robustness to our design, we made a model checking analysis, considering safety
and liveness requirements.
40.1 Introduction
Intersections between two 2-lane roads is a very common topology, ever more
addressed, overall in Europe, with roundabouts. But this solution requires signif-
icant territory space, and is cognitively engaging for drivers. Some other solutions
based on traffic-light typically limit the turn possibilities, inevitably increasing trip
times.
Considering fully connected and autonomous vehicles opens a completely new
context, because of the high control and responsiveness of the vehicles. Literature has
Figure 40.1 shows the topology of the crossroad targeted by our work, with 4 incom-
ing and 4 leaving lanes. Our model schedules at most one vehicle at a time per lane,
thus implementing a fine-grain control. Intersection traversal time depends on the
actual origin and destinations of the interested vehicle(s). As a baseline, we have
implemented a trivial round-robin controller, with schedules the lanes according to a
fixed clock-wise iteration. We have designed an adaptive algorithm which considers
as priority parameters the queue length, the order of arrival and the expected traversal
time, according to (1).
where P is the priority of the lane, Q is its queue length, O is the arrival order, and T
is the intersection traversal time of the first vehicle in the lane. α, β and G are weights
to be tuned.
Given the safety-related nature of the application, we modeled it using Uppaal.
We defined four main templates. The Timer (Fig. 40.2a) implements the time basis,
Fig. 40.2 The Timer (a) and Controller (b) Uppaal templates
352 J. Bellitto et al.
defines the random variable for each cycle, updates the simulated arrival rates accord-
ing to the test plot and awakes the Controller (Fig. 40.2b). The ArrivalManager
(Fig. 40.3a) generates the vehicles, whose arrivals and departures are managed by
the Lane (Fig. 40.3b). Direction conflicts are taken in consideration by the Con-
trollers, so that up to 4 vehicles may simultaneously pass during an iteration of
vehicle passages.
The first step for the adaptive algorithm consists in tuning the α, β and γ parameters.
To this end, we made a set of simulations, with 8 different values and considering
different arrival rates (homogeneous), in a 10 min. time window. Results in Table 40.1
show that the best performance is achieved with α = 1, β = 1, G = 1. Differences
are small for low arrival rates, and grow with them.
We set an experiment simulating a 100 min. with a traffic peak (0.5 vehicle/s.
per lane from min. 40 to 75, 0.25 vehicle/s. per lane from min. 20 to 40, 0.1 in the
other cases). Results are reported in Figs. 40.4 and 40.5, considering the two cases
of homogenous and inhomogeneous arrival rates (one main road and one secondary
road with halved rates). We can see that the fully adaptive algorithm achieves quite
a better performance, especially at high traffic volumes (up to 40% delay reduction).
Table 40.1 Tuning of the algorithm performance (delay in seconds), with different arrival rates
Parameters Arrival rate (per lane) [vehi/s]
α β γ 0.1 0.25 0.5
0 0 1 0,9 3,6 5,5
0 1 0 1,5 2,4 6,2
0 1 1 1,4 3,3 5,4
1 0 0 1,4 3,7 5,4
1 0 1 1 3,0 5,1
1 1 0 1,1 3,4 6,0
1 1 1 0,8 2,2 4,8
40 Fine-Grain Traffic Control for Smart Intersections 353
Performance improvements are even higher in the inhomogeneous case (67%). While
the adaptive case considers only the priority of a single lane, the fully adaptive case
considers the priority of all the compatible concurrent lane combinations.
As connected and, even more, autonomous vehicles are expected to bring significant
novelties in the future road traffic patterns, we have investigated the control of a
specific, yet very common intersection topology. We have addressed the issue with a
novel, fine-grain control approach, and proposed an adaptive prioritization algorithm
which weights length of the queue, arrival order, and intersection traversal time of the
first vehicle in each lane. From an Uppaal simulation, we deduce that the third factor
looks more important. Compared to a fixed Round-robin schedule, our algorithm
354 J. Bellitto et al.
achieves quite a better performance, especially at high traffic volumes (up to 40%
delay reduction in the homogeneous traffic flow case, 67% in the inhomogeneous
case). In order to guarantee robustness to our design, we made a successful model
checking analysis, considering safety and liveness requirements. We believe that, in
automated driving scenarios, this solution could reduce the need for roundabouts.
In modeling, we made some important limiting assumptions. Intersection traversal
job times are fixed, even if we should have reduced the impact of this approximation,
as our algorithm considers a single vehicle per lane, not platoons (e.g., as in [3]).
Moreover, all vehicles are modeled as having the same length and dynamic responses.
More important, we have considered a scenario with autonomous vehicles only, and
completely ignored human factors and the possible presence of pedestrians or other
vulnerable road users, which will need to be very carefully considered. In any case,
more accurate simulations are needed to better characterize the dynamic behavior of
the system and assess its improvement with respect to other optimizations.
References
Abstract The modal analysis of large structures, because of spatial and electrical
constraints, generally requires cluster-based networks of sensors. In such solutions,
dedicated procedures are required to reconstruct the global mode shapes of vibra-
tion starting from the local mode shapes computed on individual groups of sensors.
Commonly adopted strategies are based on overlapped schemes, in which at least one
sensing position is shared among neighbour clusters. In this paper, a non-overlapping
monitoring approach is proposed. It relies on the intrinsic capability of graph sig-
nal processing to encode structural connectivity on edge weights and exploits the
maximization of the global graph signal smoothness to define the best set of scaling
factors between adjacent networks. Experiments on a pinned-pinned steel beam in
condition of free vibrations proved that the proposed method is consistent with re-
spect to numerical predictions, showing great potential for distributed monitoring of
complex structures.
41.1 Introduction
The analysis of signals defined on graphs has been gaining increasing attention due to
its capability of modeling inherent patterns coded in the acquired data as similarities
between adjacent vertices on a graph [5, 6]. Several application fields have recently
benefited from this emerging signal representation, including smart cities, traffic
networks and environmental processes [7]. Furthermore, a number of mathematical
techniques have been developed, including the Graph Fourier Transform (GFT) and
the Graph Laplacian (GL) operators, which can be used to transpose classical spectral
characterization methods in equivalent tools for the vertex-frequency domain [8].
A graph is a mathematical entity described by a set of vertices connected by edges,
whose Algebraic representation is expressed through the Adjacency and Degree
matrices [5]. The weighted Adjacency matrix W expresses the vertex connectivity
between two generic nodes n and m by means of a correspondent edge weight
wnm . Conversely, each entry of the Degree matrix D is given by the sum of all the
weights incident on a specific vertex. The eigendecomposition of the graph Laplacian
operator L = D − W is an extremely useful tool to extract meaningful information
from graph signals. In particular, it can be seen as the graph counterpart of the second-
order derivative operator. Besides, a Fourier-like transform has been developed for
graph spectral characterization, which consists of projecting graph signals on the
Laplacian eigenvectors. The eigenvalues of the Laplacian matrix are also inherently
related to the global graph signal smoothness of a generic function f sampled on the
graph vertices:
N −1 N −1
1
λ= wnm ( f (n) − f (m))2 = f T L f (41.1)
2 n=0 m=0
algorithm comprises the following steps. During the starting phase, (i) a vector con-
sisting of unitary scaling values is considered as the initial guess. Then, after the
currently assembled mode shapes have been normalized (ii), the fitness function λ is
computed (iii) according to (41.1). In particular, some of the graph data processing
procedures from GSPBOX [11] were exploited. Finally, a prediction phase (iv) up-
dates the scaling coefficients. More specifically, the values αk+1 predicted at iteration
k are computed as αk+1 = αk − rk ∇ f (αk ), in which rk and ∇ f respectively represent
the updating ratio and the gradient operator. Steps (ii–iv) are repeated until a con-
vergence criterion is met, which is intended in the current approach as a smoothness
variation between subsequent iterations inferior to a predefined threshold .
Table 41.1 MAC percentages between experimental and graph assembled mode shapes from over-
lapped and disjoint cluster network
Case 1 Case 2 Case 3 Case 4
φ1 φ2 φ3 φ1 φ2 φ3 φ1 φ2 φ3 φ1 φ2 φ3
FDD 95.87 99.62 99.22 99.61 99.87 99.36 97.03 99.73 99.74 99.70 99.07 98.93
TDD 99.87 99.41 99.62 99.81 99.77 99.70 96.73 99.87 99.46 99.85 98.82 99.66
SOBI 95.29 99.77 99.34 99.79 99.94 99.43 97.47 99.86 99.55 99.83 99.24 99.09
Fig. 41.2 Graph-assembled mode shapes at sensing locations chosen for case 4 exploiting SOBI
modal reconstruction technique
360 F. Zonzini et al.
41.4 Conclusions
This paper proposes a new approach for mode shape assembly of vibrating structures
based on clustered sensor networks. Exploiting the advantages of graph signal domain
to account for the underlying connectivity, the described method appears to be a
powerful strategy to overcome the current limitation of state-of-the-art overlapped
solutions. Different sampling grids were tested on the array of 9 sensors installed
on a vibrating steel beam, assessing the robustness of the developed processing
scheme in different spatial configurations. The consistency of the obtained results
corroborates the possibility to deploy accelerometer sensor networks in large and
complex civil structures. Future developments will address the validation of the
proposed data fusion method in setups including damaged scenarios, to verify that the
proposed approach does not affect the damage detection performance. Concurrently,
denser sensor networks will be considered, allowing for a computational evaluation
(e.g. convergence time, required processing resources) of the method under more
complicated situations.
Acknowledgements This work has been partially funded by INAIL within the framework BRIC/
2016 ID = 15, SMARTBENCH project.
References
12. Girolami A, Zonzini F, De Marchi L, Brunelli D, Benini L (2018). Modal analysis of structures
with low-cost embedded systems. In: 2018 IEEE international symposium on circuits and
systems (ISCAS), May. IEEE, pp 1–4
13. Testoni N, Aguzzi C, Arditi V, Zonzini F, De Marchi L, Marzani A, Cinotti TS (2018) A sensor
network with embedded data processing and data-to-cloud capabilities for vibration-based
real-time SHM. J Sens 2018
Chapter 42
Guided Waves Direction of Arrival
Estimation Based on Calibrated
Multiresolution Wavelet Analysis
42.1 Introduction
In the last few decades, Guided Waves (GWs) inspection of plate-like structures
(Lamb waves) emerged as a promising Non-Destructive Evaluation (NDE) method-
ology. The possible applications of such kind of analysis range from aerospace, to
marine, and civil industries, to foster the life-cycle cost reduction and the safety
improvement of sensorized structures [1]. Two methods, active-passive and passive-
only, are usually adopted. In the former, active piezoelectric transducers are used
to generate the GWs, whereas passive piezoelectric transducers are employed for
wave detection. In the latter approach, a passive sensor network, which continuously
acquires data samples, is exploited [2]. Conventional passive GWs inspections usu-
ally allow for the detection and localization of damages such as crack formation or
external impacts [3]. In particular, localization algorithms based on hyperbolic po-
sitioning are the most exploited methodologies due to their low computational cost
and simple implementation. Unfortunately, the resolution and robustness of these
approaches are limited by the dispersive behaviour of waves propagating within the
structure. This inhibits a precise estimation of the Difference in Time of Arrival
(DToA) of the incident wave among the active areas of the sensors. To tackle this
problem, a signal processing procedure for impact localization entirely performed
in the wavelet domain is proposed. By means of multiband signal filtering in the
time-scale plane, the decomposition CWT coefficients of the acquired signals are
extracted. Afterwards, the DToA is obtained by simply cross-correlating the coeffi-
cients. Due to the particular disposition of the PZT transducers on the structure, the
Direction of arrival (DoA) is finally extracted by geometric calculations. The carried
out procedure exploits the theoretical background described in [4, 5], taking into
account the different piezoelectric topology of the new PZT cluster. Moreover, the
enhancement of the final DoA estimation is achieved by a novel calibration method.
An innovative ad hoc PZT transducer designed with three active areas has been used
in this work. The transducer is made of a cluster of three closely-located PZT elements
placed to form an equilateral triangle: as such we call this disposition equilateral.
Three signals are provided by the cluster, one for each active area S1 , S2 and S3 .
The distance between the PZTs centroids is defined as d. P is the point of impact
which occurs at (x p ,y p ) and R0 is the distance between P and the centroid of S1 .
Distances D1 and D2 are defined as the distances the waves travel between S1 and S2 ,
S3 respectively, as shown in Fig. 42.1. If the far field approximation R0 d is valid,
following the derivation presented in [6], appropriately adapted to the equilateral
configuration, the angle of arrival of the waveform generated by the impact can be
written as:
√ 1 − D1/ D2
θ atan 3 (42.1)
1 + D1/ D2
42 Guided Waves Direction of Arrival Estimation … 365
Defining as Δt1,2 and Δt1,3 the time intervals in which the wave travels the D1 and
D2 paths respectively, i.e the so called DToA, the following equations held:
D1 D2
Δt1,2 (ν) = Δt1,3 (ν) = (42.2)
vg (θ , ν) vg (θ , ν)
where vg (θ , ν) is the group velocity of the Lamb wave which impinges on the trans-
ducer; the frequency (ν) dependence which mathematically explains the dispersion
of the considered wave mode is highlighted. Combining Eq. (42.1) with (42.2), the
following result yields:
√ 1 − Δt1,2/Δt1,3
θ atan 3 (42.3)
1 + Δt1,2/Δt1,3
which, properly rotated of 30◦ for a more convenient coordinate reference system,
becomes:
1 2 Δt1,2
θ atan √ − √ (42.4)
3 3 Δt1,3
It is also noteworthy that Eq. (42.3) can be easily reformulated for the specific case
presented in [6]: in this case a rotation of the coordinate system reference of 45◦ and
an angle θx between active areas of 90◦ must be considered.
The DoA estimation is strictly related to the DToA evaluation, according to Eq. (42.4).
Thus, a precise calculation of the DToA is fundamental in order to achieve the
final goal. Because of the dispersive and multimodal behaviour of waves in plates,
Δt j,k (ν) depends on group velocity. Thus, the application of the cross-correlation
method to the raw acquired signals is not strictly possible. The technique hereby
proposed exploits a time-frequency decomposition in the CWT domain followed by
366 M. M. Malatesta et al.
an isofrequential analysis to localize the signal both in the time and in the frequency
domain. Let Si and S j be the ith and jth active area of the transducer, and si (t)
and s j (t) the acquired signals by Si and S j , respectively. The CWT coefficients are
defined as:
+∞
∗
Wi (ψ; a, b) = si (t)Ψa,b (t) dt (42.5)
−∞
+∞
∗
W j (ψ; a, b) = s j (t)Ψa,b (t) dt (42.6)
−∞
a
is the decomposition wavelet and a ∈ R is the scale
parameter. In this domain, the wave components travelling at different velocities can
be easily separated and filtered by means of the energy distribution analysis of the
signal in the scale-frequency plane. Cross-correlation is then computed in the CWT
domain for each scale (or the corresponding frequency range), i.e. by varying the
scale parameter a, as shown in the following equation.
+∞
Ci, j (a, t) = (Wi ∗ W j )(a, t) = Wi∗ (Ψ, a, b)W j (Ψ, a, t + b) db (42.7)
−∞
Then, by applying Eq. (42.8) for each a parameter value, the DoA of the guided
mode can be accurately estimated by means of an averaging procedure [4, 5].
42 Guided Waves Direction of Arrival Estimation … 367
42.5 Conclusion
In this work, a novel method to extract the DoA of Lamb waves generated by im-
pacts in plate-like structures is proposed. The algorithm exploits a procedure carried
out in the CWT domain in order to completely localize the signals in the time-scale
domain. By cross-correlating the signals related to the same event, the DToA can be
determined and used to locate the wave source according to the piezoelectric trans-
ducer configuration. Accurate and reliable results are shown through experimental
tests, further enhanced by an appropriate calibration procedure. The errors achieved
by the system are significantly less than 1◦ , revealing that the discussed method is
especially suitable when high precision SHM localization is required.
Acknowledgements This research has been funded by INAIL within the framework BRIC/2016
ID = 15, SMARTBENCH project.
References
Abstract Standard scanned Color Flow Imaging (CFI) is a common blood flow visu-
alization modality. Despite being introduced more than 30 years ago, this technique
is still hampered by the conflicting requirements for either a good image quality or
a high frame rate. In fact, good image qualities can only be obtained for frame rates
between 10 and 20 Hz, which are unsuitable to show dynamically evolving events.
This paper presents a high frame rate imaging modality that, once integrated with
CFI, allows to overcome the above limitation. Results characterized by improved
quality matched to the capability of properly tracking dynamically evolving flow
rates are shown.
43.1 Introduction
Color flow imaging (CFI) is an ultrasound (US) technique [1] capable of producing
B-Mode images in which the areas interested by blood flow are colored according
to the instantaneous velocity and direction of red blood cells. If the flow is directed
toward or away from the probe, red or blue colors are correspondingly used.
Coloring of US images is done line-by-line [2]. A beam focused on the current line
is transmitted NP times (with NP, “packet size”, typically between 6 and 16) from
the active elements of a linear array probe. During the reception interval, the echoes
received by each probe element are “beamformed” (i.e. properly delayed before
being summed together). The mean Doppler frequency of the NP echo-samples
beamformed for each depth is estimated through the autocorrelation approach, and the
corresponding pixel is accordingly colored. Of course, the higher NP the more robust
the frequency estimate, but also the lower the CFI frame rate (FR). For example, the
time needed to form one CFI image of NL = 64 lines at pulse repetition frequency
(PRF) of 10 kHz and NP = 10 is 64 × 100 × 10 µs, which yields ≈ 15 frames per
second (fps), neglecting the time needed to generate the B-mode background. Such
a FR may not be sufficient to track rapid blood flow changes in the heart or in main
human arteries such as the carotid.
The transmission of plane waves [3] may solve such problem, since the full region
of interest (ROI) is insonified with a single transmission and the FR can thus be auto-
matically increased by a factor NL [4]. In addition, such increase is so high that a
larger packet size (up to 64 and more) becomes acceptable, with a corresponding
image quality improvement. Such results are achieved provided the echoes received
by the probe elements can be simultaneously beamformed along all the lines of inter-
est. The term “parallel beamforming” indicates the capability of creating multiple
image lines after a single transmission (TX) event. Of course, the larger the NL the
higher the needed parallel beamforming speed.
Furthermore, the NL beamformed lines must be, at the same time, processed
according to the color Doppler strategy. Such processing includes, besides the auto-
correlation, high-pass filtering for clutter removal, the evaluation of proper criteria
to avoid that residual small tissue movements are detected as blood movements, as
well as temporal and spatial filtering of the final colored images. Performing all such
processing in real time for multiple lines at high frame rates is quite challenging.
Parallel beamforming for high FR CFI is currently used in only one clinical sys-
tem [5] and in commercial open scanners [6]. However, in both cases, the acquired
raw data are beamformed and processed off-line. The goal of this paper is present-
ing a real-time high FR CFI system obtained by suitably using the programmable
resources available on the ULA-OP 256 open scanner [7, 8]. It is shown that parallel
beamforming is achieved by a special organization of the front-end FPGAs, while
color Doppler processing is performed by one on-board multi-core DSP. With respect
to the work presented in [9] better exploitation of ULA-OP 256 hardware permitted
to obtain increased performance in terms of PRF, packet size range and global image
quality.
ULA-OP 256 is an open scanner developed by the MSD Laboratory of the University
of Florence [7]. This research instrument has been designed with a modular approach,
according to which eight front-end (FE) boards manage up to 256 channels connected
to an equal number of transducer elements. The active FE boards are interconnected
in a ring by a Serial RapidIO (SRIO) link with a bandwidth of 80 Gbit/s—full duplex.
Every FE board transmits, receives and elaborates 32 ultrasound signals. Each
channel is equipped with an independent arbitrary waveform generator, capable of
producing high voltage (up to 200 Vpp) signals at up to 20 MHz frequency. On
the receiving path, 4 Analog Front Ends (AFEs) embed 32 low noise amplifiers
followed by 32 analog to digital converters working at about 80 MSPS. These feed
an FPGA which is in charge of beamforming (see Sect. 43.2.2). Beamformed data
43 High-Frame-Rate Ultrasound Color Flow Imaging … 373
are demodulated and low-pass filtered by an on-board DSP and finally sent to the
Master Control (MC) board, which is here in charge of performing the high frame
rate CFI algorithm.
The 128 central elements of the LA533 linear array probe (Esaote SpA, Italy) were
simultaneously excited at programmable PRF to alternate the transmission of NP CFI
pulses and of 7 B-mode pulses. CFI pulses were 5-cycle Hanning tapered sinusoidal
bursts at 6 MHz center frequency. B-mode pulses, were 3-cycle sinusoidal pulses
at 9 MHz. The 7 B-Mode PWs were transmitted at steering angles of −7.5°,
−5°, −2.5°, 0°, 2.5°, 5°, 7.5° before being coherently compounded [10]. The
ROI was thus fully insonified during each Pulse Repetition Interval (PRI =
1/PRF).
After each TX event, the echo data were beamformed over a programmable depth
range of the ROI, thanks to the serial-parallel beamformer (BF) implemented in the
FE FPGA [11, 12]. This includes a Dual Port Memory (DPM) capable of storing up
to 16384 words, each of 384-bit (12-bit per sample for 32 channels), and 4 parallel
BFs, see Fig. 43.1. The serial-parallel BF implemented on FPGAs of the ARRIA V
GX family (Altera, San Jose, CA, USA) are embedded on the FE boards.
The DPM stores the echoes digitized by 4 AFEs. The DPM is divided in two
buffers (8192 words each); while one buffer is storing the echo-data of the current
PRI, the second one is read to permit processing data acquired in the previous PRI.
The BFs process multiple times the same data to focus the received echoes along
corresponding multiple directions. Each BF works at 200 MHz and is composed
of delay blocks, one per channel, which align the received data before they are
coherently summed. A Memory controller embedded in the FPGAs manages an
external memory in which the delays are stored during the system initialization.
The serial-parallel beamformer implemented on FPGAs of the ARRIA V GX family
(Altera, San Jose, CA, USA) embedded on FE boards can produce up to about 470
MSPS.
It exploits about 31% of the adaptive logic modules, and 24% of the logic registers
present in the FE FPGA. The memory is allocated in the 10 kbit memory blocks
(M10K), which are used at 71%.
374 F. Guidi et al.
Fig. 43.1 Serial-parallel beamformer implemented on the research ultrasound scanner ULAOP-256
The data produced by the serial-parallel BF are processed by the 8-cores DSP
TMS320C6678 (Texas Instruments, TX, USA) mounted on the MC board. The CFI
is here implemented using 1 core to manage the beamforming operation, 1 core as
the master processing unit and the remaining 6 cores as the slave processing units.
The beamforming core is in charge of performing the last beamforming stage of
the system. It sums together the samples from all the FE boards to produce a single
data stream and stores it in external DDR memories, which act as temporary circular
buffers.
The master core mainly acts as a supervisor and schedules tasks to the other
cores. Every time a block of fresh samples is ready from the beamforming core, the
Master core initiates a transfer of data from the DDR memories to the slave cores
that are ready to process new tasks, and instructs them with appropriate processing
parameters of the CFI algorithm. Once a slave core terminates its task, it notifies
the master core that a new column of processed pixels is ready. The master core
initially transfers the pixels to its internal memory, freeing resources in the slave
core, which is accounted for subsequent scheduling. It then processes the pixels of
multiple columns through a 3 × 3 median filter, and sends the result to the PC, where
the CFI frames are displayed on a screen superimposing the B-Mode layer. All data
transfers from and to any core are operated by DMA channels (Fig. 43.2).
43 High-Frame-Rate Ultrasound Color Flow Imaging … 375
Each slave core takes care of processing a vertical line of points. It receives a block
of complex demodulated samples collected over 8 ÷ 64 PRIs, depending on NP. The
core calculates the incoming signal power and processes the samples through the
wall filter, composed by a 4th order IIR high-pass filter. It afterwards calculates the
autocorrelation and variance of the signal. The autocorrelation output is fed into a
spatial filter made up of a 3 × 3 matrix that combines adjacent depths and lines to
obtain smoothened images. Since different lines are processed in distinct cores, a
particular state machine enables each core to directly collect, from other slave cores,
the samples processed by the latter ones up to this stage. The line processed in the
current core is thus combined with its two adjacent lines, and injected, first, into
the spatial filter and then into the persistence filter, which is an IIR filter working
across slow time. Finally, the slave core calculates the power and phase of the fil-
tered autocorrelation, which are proportional to the intensity and the flow velocity
respectively.
The computed phase is directly used to form a color map, while the power is non-
linearly combined with other available parameters, to generate a pixel transparency
mask, used to combine the final CFM and B-mode maps.
The CFI approach has been tested with a standard flow-phantom (ATS 524) connected
to a peristaltic pump and a reservoir, in a closed loop containing a blood mimicking
fluid (ATS 707).
The first experiment was conducted in stationary flow conditions. No tempo-
ral filters were used to maintain maximum temporal resolution. Figure 43.3 shows
376 F. Guidi et al.
Fig. 43.3 Screenshots of ULAOP SW showing the output of real-time CFM for a steady flow
and different packet lengths. From left to right: NP = 8, 16, 32, 64. The system PRF was 3 kHz,
producing an output frame rate = 200, 130, 77, 42 [frames/s] respectively
Fig. 43.4 Screenshots showing 4 frames of the real-time conventional scanned CFI, 2 before and
2 after the velocity peak. These were obtained with NP = 8
43 High-Frame-Rate Ultrasound Color Flow Imaging … 377
Fig. 43.5 Screenshots showing 4 frames of the real-time CFM PW, 2 before and 2 after the velocity
peak. NP = 16, PRF = 2.5 kHz, no temporal filter was introduced to maintain high temporal
resolution
The system was finally tested at PRF = 4 kHz to evaluate the FRs obtainable
at different packet sizes. The results are reported in the Table, which shows the FR
values obtained by mixing CFI and B-Mode (column 2) and by only operating in PW
CFI Mode (column 3). As reported in the right column, up to 403 frames/s could be
processed (NP = 8) (Table 43.1).
43.4 Conclusion
In this paper, a CFI method based on the transmission of PWs has been presented.
The experimental results highlight that, thanks to the increased frames availability,
it is possible to use longer packet-sizes to perform more reliable speed estimation,
maintaining a final frame rate always very high. Moreover, thanks to the intrinsic
frame coherence due to the PW imaging scheme, the output frames do not show
artifacts during fast events as seen in standard CFM methods.
References
3. Tanter M, Fink M (2014) Ultrafast imaging in biomedical ultrasound. IEEE Trans Ultrason
Ferroelectr Freq Control 61(1):102–119
4. Bercoff J et al (2011) Ultrafast compound doppler imaging: providing full blood flow
characterization. IEEE Trans Ultrason Ferroelectr Freq Control 58(1):134–147
5. Supersonic Imagine, “UltraFastTM Imaging.” Online www.supersonicimagine.com/Aixplorer-
R. 06 Dec 2016
6. Ekroll IK, Swillens A, Segers P, Dahl T, Torp H, Lovstakken L (2013) Simultaneous quan-
tification of flow and tissue velocities based on multi-angle plane wave imaging. IEEE Trans
Ultrason Ferroelectr Freq Control 60(4):727–738
7. Boni E et al (2016) ULA-OP 256: A 256-channel open scanner for development and real-time
implementation of new ultrasound methods. IEEE Trans Ultrason Ferroelectr Freq Control
63:1488–1495
8. Boni E, Yu ACH, Freear S, Jensen JA, Tortoli P (2018) Ultrasound open platforms for next-
generation imaging technique development. IEEE Trans Ultrason Ferroelectr Freq Control
65(7):1078–1092
9. Guidi F, Dallai A, Boni E, Ramalli A, Tortoli P (2016) Implementation of color-flow plane-wave
imaging in real-time. IEEE Int Ultrason Symp (IUS) 2016:1–4
10. Berson M, Roncin A, Pourcelot L (1981) Compound scanning with an electrically steered
beam. Ultrason Imaging 3(3):303–308
11. Boni E et al (2017) Architecture of an ultrasound system for continuous real-time high frame
rate imaging. IEEE Trans Ultrason Ferroelectr Freq Control 64:1276–1284
12. Meacci V et al (2019) FPGA-based multi cycle parallel architecture for real-time processing in
ultrasound applications. In: International conference on applications in electronics pervading
industry, environment and society, pp 295–301
Part IX
Vehicular, Robotic and Energy Electronic
Systems
Chapter 44
Empowering Deafblind Communication
Capabilities by Means of AI-Based Body
Parts Tracking and Remotely Controlled
Robotic Arm for Sign Language Speakers
images with two cameras, the signer’s body is tracked with a deep neural network.
The extracted coordinates of the body parts (chest, shoulders, elbows, wrists, palms
and fingers) are used to move one or more robotic arms. The deafblind person can
put his hands on the robots to understand the message delivered by the person on the
other side. The entire system is based on a cloud architecture.
44.1 Introduction
One of the main characteristics of human beings is the ability and the will to com-
municate with other people, thanks to a shared language. Typically, the language is a
spoken language, with acoustic-vocal modalities. Unfortunately, deaf people cannot
hear any sounds. For this reason, they use another kind of language, a gestural-visive
one, the Sign Language (SL). Deaf people can then communicate making gestures,
moving the hands and doing facial expressions. Since they cannot hear words or see
gestures, deafblind people cannot use either voice or sign language. Their language
is then tactile, the so-called Tactile Sign Language (TSL). It is based on the SL,
but the receiver’s hands are placed on the ones of the signer, in order to follow the
signs made. It works fine when the deafblind is in the same place of the signer, but
deafblind people cannot communicate remotely, like other people do with phone or
video calls, because they need to touch the other person’s hands. This limitation due
to their severe disability can lead to isolation and depression [1]. Nowadays, from
0.2 to 3.3% of the world population is deafblind [2].
The idea of the PARLOMA project is to reduce the gap between deafblind people
and the others, in order to decrease the cases of isolation and depression, giving
them the possibility to communicate remotely. This is done tracking the movements
of a person in front of two cameras and reproducing these gestures on a robotic arm,
which can be touched by the deafblind to understand the message delivered remotely.
In order to let the deafblind people communicate in a remote way without external
tools, there is the need of two cameras. The first one is used to capture the frames
for the identification of the pose of the upper part of the body (chest, shoulders,
elbows and wrists) and the second camera is used to detect the position of the hands,
and of a robotic arm, which reproduces the gestures done in front of the cameras.
Because of the complexity to reach this aim, the whole system is based on a client-
server cloud architecture, with the possibility of the connection of more clients on
the camera side and more clients on the robotic arm side (multi-client system). In
the simplest configuration, the system is composed by three entities: a client which
manages the two cameras (Camera Client, CC); the server, which receives the images
from the CC and computes the complete pose of the person (Elaboration Unit, EU);
44 Empowering Deafblind Communication Capabilities … 383
a client which has the task of moving the robotic arm (Robot Client, RC). When
more CCs and more RCs are connected to the elaboration unit, each CC can choose
with whom to communicate, selecting one RC, like in a phone call, or more ones,
simulating a radio scenario. The multi-client scenario is managed by using a simple
database, composed by three tables: the CCs table and the RCs table, which contain
the same fields (name, IP, port and a boolean indicating if it is online or not), and the
communication table, which is used to store the relationships between the CCs and
the RCs. Because of their tasks, both the CCs and the RCs can be common laptops,
while the EU must have at least one GPU.
The Camera Client is the part of the system that collects the frames from the two
cameras and sends them to the EU. The cameras are placed like in Fig. 44.1. The
camera in front of the signer is an Orbbec Astra Pro [3]. It is used to collect the frames
that will be used by the EU for the identification of the coordinates of the joints of
the upper part of the body. Each frame is composed by two images with 640 × 480
@ 30 fps resolution, an RGB image and a depth image. Both they are needed to
have a 3D vision and to let the robotic arm move correctly in the space, after all
the elaborations. Since the depth image is noisy, some filters are applied. First, the
depth image is dilated, with an elliptic kernel, increasing the size of the objects and
deleting some shadows [4]. After that, the image is convolved with a low-pass filter
(gaussian blur), in order to reduce the noise and smooth the image itself [5]. The
two images are sent through Internet, so it is needed to reduce their dimensions: the
RGB image is compressed in jpg, the filtered depth image is compressed in gz. The
Fig. 44.2 Structure of the packet sent by the camera client to the elaboration unit
camera placed under the signer’s hands is a LeapMotion [6]. Since it already has a
hand tracking algorithm, the coordinates (x, y, z) of the hands with respect to the
LeapMotion are directly extracted in the CC and sent to the EU.
The packets sent to the EU are then composed by the size of the compressed
RGB image, the size of the compressed depth image, the compressed RGB image,
the compressed depth image, the size of the coordinates of the hands, the presence
of the two hands and the coordinates of the hands (x, y, z of the wrists, the palms
and the 5 fingers of each hand, for a total of 42 integers for each detected hand).
The structure of the packet is shown in Fig. 44.2. For simplicity, the single packet
is split into two parts: the upper one regards the images coming from the Orbbec
camera and the lower one shows the data from the LeapMotion. Furthermore, to
avoid that the packets can be stolen by man in the middle, an encrypted socket is
used to communicate with the EU.
The Elaboration Unit represents the core of the entire application. Here the packets
are taken and elaborated in order to obtain the key-points of the body. It is divided
in two parts: the connection management and the GPU server.
The connection management takes all the packets coming from the CC and asso-
ciates them to the selected RC. The association is handled exploiting the communi-
cation table which represents the connection between the IP addresses and the ports
of the Robot-Camera clients.
The GPU server receives the packets, extracts the data (RGB, depth and hands
points) and decompresses the images (RGB and depth) in order to obtain the original
frames again. When RGB and depth images are available, the first one is feed to
a Deep Neural Network (DNN) called Open Pose (explained in Sect. 44.3), which
analyses the RGB image and gives as output 7 key-points. These points represent the
coordinates (x, y) of each joint in the camera frame, where (0, 0) is on the upper-left
edge of the frame. But they are not enough to control the robot: the robotic arm needs
the three spatial coordinates (x, y, z). Since each pixel of the depth image represents
the distance between the camera and the objects in the scene, it is possible to extract
the z, mapping the coordinates (x, y) given by the DNN on the depth image.
44 Empowering Deafblind Communication Capabilities … 385
Fig. 44.3 Reference systems: on the left, the body parts which refer to the chest; on the right, the
hand points which refer to the palms
Before sending the key points to the RC, it is necessary to convert both the body
and the hands coordinates in the same reference system. We chose the chest as the
origin for the body points (shoulders, elbows, wrists and palms) and each palm as
the origin for every hand’s finger points. The three reference systems are shown in
Fig. 44.3. Since the DNN does not identify the palms, which are instead pinpointed
by the LeapMotion, the palms are linked to the rest of the body by exploiting the
coordinates of the wrists taken by the network and the ones given by the LeapMotion.
At the end of this step, all the points are in the right reference system and the new
set of coordinates (in millimetres) is sent to the RC. The packet is then composed
by 54 integers, 3 (x, y, z) for each joint, in the following order: left arm, right arm,
left hand and right hand, where each arm is composed by shoulder, elbow, wrist and
palm w.r.t the chest and each hand is composed by thumb, index, middle, ring and
pinky w.r.t its palm.
The Robot Client converts the set of coordinates received from the EU in joint
variables and gives them as input to the robotic arms. Since each joint is identified
by three coordinates, it is easy to control the robots in the space. If there are not both
the right and the left robotic arms, the user can select which arm should move.
386 S. Panicacci et al.
The robotic arm, shaped to resemble the human arm, is divided in three main mod-
ules: upper, representing the shoulder and composed by two revolute joints; middle,
representing the bicep and composed by two revolute joints; lower, representing
the forearm and the hand and composed by three revolute joints. Thus, the robotic
arm consists of seven Degree of Motion (DoM) arranged in a humanoid kinematic
tree, with spherical joints for shoulder and wrists, as shown in Fig. 44.4. In order to
enable a smooth and safe interaction between human and robot, we conceived design
choices such as integration of back-drivable servomotors coupled with serial elastic
transmission, development of low mechanical inertia structure and realization and
integration of a soft sensorized artificial skin. Through such a skin, we were able
to retrieve critical information about the contact between user and robotic arm. We
designed and developed large-area skins embedding optical sensors to solve both
the magnitude and the localization of an applied load onto the skin surface using a
customized neural network. The movement of the arm is based on seven motors: two
linear actuators to handle two out of three DoM of the wrist, and five rotary actuators
for shoulder and elbow joints and for the last DoM of the wrist. The structural shell
of the robotic arm has been printed using a 3D printer and hard plastic filament.
The pose estimation is a re-elaboration of the Open Pose DNN [7]. There are several
implementations of the Open Pose network: Python, OpenCL, Unity or CUDA.
All these make use of the GPU acceleration, to decrease the inference time and to
maintain the real-time. It detects up to 16 points exploiting a 2D image and it returns
their (x, y) coordinates. The coordinates are computed starting from the upper-left
edge of the image and they are normalized in order to allow images with different
resolutions. For our purpose, the body parts needed to communicate are only 7 (chest,
shoulders, elbows and wrists). Thus, the network was slightly changed to obtain a
more fluid and higher speed algorithm, since it is important that all the images are
processed before the next packet arrives to reduce lag. In particular, we simplified
the network removing all the parts of the lower body (e.g. legs and hips) and of the
44 Empowering Deafblind Communication Capabilities … 387
face (e.g. ears and nose) and adding a sorting in the output. In this way, it is possible
to understand if some parts have not been well recognized and which ones are. This
allows EU to send a message to the CC to inform the user of the wrong position. In
addition, if some parts are not recognized, the protocol features that the packet is not
sent to the RC: the loss of a frame at the speed of 30 fps does not lead to a wrong
movement of the robotic arms.
The results of the DNN are very promising, both for the accuracy and for the
inference time: it always recognizes all the joints if they are in the RGB image and
an inference lasts about 30 ms, serving each client without communication delay.
44.4 Conclusion
Deafblind people are affected by a severe disability that binds them to communicate
only with other people in the same room with the tactile sign language and that does
not allow them to communicate with other people remotely. This can lead to social
gap and isolation, which can cause depression. The combination of an AI-based
body parts tracking algorithm and remotely controlled robotic arms gives them the
possibility to communicate remotely, increasing their social relationships. The AI-
based system tracks the sign language movements of a person behind two low-cost
cameras. Because of the precision and the speed of the inference, these movements
are reproduced correctly and in real-time by one or more robotic arms, positioned
in a different place with respect to the signer. So, the deafblind person can lay his
hands on the robots and understand the message delivered by the signer remotely, in
the same way they do when communicating with another person in the same room
with the tactile sign language.
Acknowledgements This study was funded in part by the Italian Ministry of Education, Univer-
sities and Research within the “Smart Cities and Social Innovation Under 30” program through the
PARLOMA Project (SIN_00132).
References
Abstract Autonomous Sail Drone represent an interesting research topic but also a
potential solution for different applications mainly related to the monitoring of large
marine areas or freshwater basins. In this work authors from University of Florence
presents the current evolution of UNIFI sail drone. Current vehicle development
is focused on many activities. In this work authors have focused their attention on
current development of the drone with a particular attention to design of energy
management system which plays a key role in maintaining vehicle efficiency and
reliability.
UNIFI sail drone has been gradually developed from a preliminary design developed
during the master engineering thesis of one of the authors [1]. Proposed layout is
aligned to innovative solutions currently proposed in literature [2]. A simplified
scheme and a brief description of the current evolution of the system are described
in Fig. 45.1 and Table 45.1: the vehicle is compose by a composite hull (fiberglass
and carbon), which is propelled by a sail whose design, as visible in Fig. 45.1 has
been gradually improved in the last two years. Electric energy needed to manage
on board electronics, payload and to actuate sail and rudder surfaces is provided by
E. Boni · M. Montagni
DINFO—Department of Information Engineering, University of Florence, Via di Santa Marta 3,
50139 Florence, Italy
e-mail: [email protected]
M. Montagni
e-mail: [email protected]
L. Pugi (B)
DIEF—Department of Industrial Engineering, University of Florence, Via di Santa Marta 3,
50139 Florence, Italy
e-mail: [email protected]
Fig. 45.1 Brief schematic and evolution of the UNIFI sail drone
solar panels in order to assure a near to null environmental impact and an almost
unlimited autonomy of the vehicle. Since the vehicle must be able to operate even
at night when solar energy is not available, an on-board storage system, charged by
solar panels, is used to provide a stable power source. Vehicle is controlled by the
simple navigation logic described in Fig. 45.2:
• A high-level trajectory planner (details Fig. 45.2a, b) is used to generate a mission
profile congruent to the navigation between imposed waypoints. Mission can be
modified according the intervention of vision-based obstacle identification and
avoidance algorithms (described in Fig. 45.2c).
• Once trajectory is planned, an inner navigation loop (described in Fig. 45.2d) regu-
lates the orientation of sails in order to assure an optimal incidence of the incoming
wind, while the rudder is used to further correct vehicle trajectory. Wind direction
45 Project VELA, Upgrades and Simulation Models … 391
Fig. 45.2 a–d Upper and lower control layer of the UNIFI sail drone
The original solar panel (Fig. 45.3) was positioned almost on the rear of the boat and
divided into two part. Standard solar cells where employed, and the total installed
power was 48 W nominal (24 W each). The new solar panel design (Fig. 45.3)
includes the use of high-efficiency next-generation SunPower flexible solar cells.
Fig. 45.3 Position on the first prototype (old) and in the new one (new)
392 E. Boni et al.
Three solar panels are placed on the top surface of the boat. Two panels are installed
on the rear side, one on the left side, the other on the right side. The third panel is
on the front side. Each panel has a nominal power of 33 W, they are composed of a
series of 20 half cells, each one providing 0.55 V and 3 A at maximum power point.
The nominal panel voltage is 11 V. Dividing the solar panel in 3 regions allows to
maximize the available solar energy, thanks to the fact that at least one full panel is
guaranteed being not covered by the sail shadow in all conditions.
The original dual battery pack was composed by two Lead-Gel 12 V batteries, with a
capacity of 12 Ah each. The new dual battery pack is composed by two independent
series of 4 LiFePO4 cells, each one with a capacity of 20 Ah. The nominal voltage of
the battery pack is 13.2 V, with a minimum of 10 V (2.5 V per cell) and a maximum
of 14.4 V (3.6 V per cell). The minimum and maximum working voltage of the cells
was chosen within the extreme values allowable for this chemistry (2.45–3.65 V)
to preserve the battery life. Since the cells in a series can slowly reach different
charging states, due to microscopical differences from cell to cell, to avoid damag-
ing the battery we designed a battery balancing circuit. The circuit is based on the
BQ76920 chip from Texas Instruments. The chip automatically measures cell volt-
ages and when one reaches the maximum programmed value (3.6 V) a discharging
circuit, in parallel with the cell, is activated. A digital interface allows an external
microcontroller to monitor cells status and manually override the protection algo-
rithm. Moreover, the microcontroller can communicate with the central unit of the
autonomous sailboat, thus providing vital information related to the battery pack to
the main controller. The two battery strings are OR-ed with two Schottky diodes.
This ensure additional power supply reliability regarding the possible failure of one
half of the battery pack.
Each panel is equipped with a separate MMPT battery charger, to better cover dif-
ferent shading conditions and to increase the overall reliability of the system. Panels
#1 and #3 are used to recharge one half of the battery pack, while panel #2 is con-
nected to the second half of the battery pack. The old MPPT charger was based on
a simple constant panel-voltage circuit. While the circuit is very simple, it actually
doesn’t fit the use case. In fact, different shading conditions, due to the sail posi-
tion and boat orientation, easily bring different (lower) MPP voltages for the panel.
Moreover, the high temperatures that the panels can reach during a sunny day forced
the use of a low MPP voltage, thus lowering the maximal power extraction in all
45 Project VELA, Upgrades and Simulation Models … 393
other conditions. The new MPPT charger is based on the LT8490 chip from Lin-
ear Technologies. The chip is a buck-boost switching regulator battery charger. The
internal logic provides automatic continuous maximum power point tracking with
a perturbe-and-observe algorithm. The panel is also scanned periodically to avoid
settling on a local maximum power point for long periods of time, in the case of
non-uniform panel illumination. The circuit design involved careful optimization of
all the components in order to achieve the maximum efficiency. Panel input specifica-
tions, based on the new solar panels, were: 7–11 V input voltage and 3.5 A maximum
input current. Battery output was configured as 9–14.8 V, 4.3 A maximum output
current.
Based on the input/output specifications, the buck-boost switching cell (Fig. 45.4)
was optimized. We started choosing a 170 kHz switching frequency: a compromise
between reducing the switching losses and keeping the inductor small. Then, an
inductor was selected, considering the lower bound of 7 µH due to the combination
of switching frequency, min/max input/output voltage and current values:
DC(MAX,M3,BOOST)
VIN(MIN) · 100%
L(MIN) = H
VRESNSE(MAX,BOOST,MAX) IOUT(MAX) ·VOUT(MAX)
2·f· RSENSE
− VIN(MIN)
the MPPT system was supplied with a current limited power supply, and the output
was connected to a constant voltage load. The power supply was swept between 7
and 11 V and for each voltage the current limit was swept between 0.2 and 4 A. The
output voltage was set to 13.2 V. Input and output voltages, input and output currents
were measured with four Peaktech 3440 multimeters connected to a data collecting
PC (Fig. 45.5). The corresponding input and output power and thus efficiency was
calculated for each setpoint.
The maximum input power of the designed MPPT charger is 38.5 W (11 V, 3.5 A),
while the maximum estimated switching losses are 1.35 W (0.13 + 1.22 W). The
predicted maximum efficiency is thus 96.5%. 1160 efficiency measurement points
where extracted with the measurement setup described in Sect. 45.2. Figure 45.6
shows the actual efficiency measurements results. From the graphs we found that
the circuit reach the maximum efficiency of 96.2% at input values of 10.2 V, 3.5 A.
This well matches with the predicted values. Regarding the battery pack, for the old
lead-acid cells we estimated an 82% charging/discharging efficiency (total energy
Fig. 45.6 Measurement results, ploted against input voltage (left) and input current (right). The
shaded surface is the least square interpolant over the measured points (blue dots), colors are referred
to efficiency values
45 Project VELA, Upgrades and Simulation Models … 395
extracted from the pack vs total energy provided to the pack), while for the new
LiFePO4 pack we estimate a 98% charge/discharge efficiency. Table 45.2 reports
the incremental reduction with respect to the nominal panel power due to three
main reduction factors: sun incidence angle (and average shading), MPPT circuit
efficiency, battery charge/discharge efficiency. The last column shows how much
energy can realistically be stored in a sunny day considering those factors. As we
can see the new system allows to store 378 Wh, while the old one 105 Wh. The
battery balancing chip included in the battery packs has a negligible impact on the
overall system efficiency, being its current consumption, when fully active, less than
200 µA. These performances are superior respect to power management systems
of sail robots in literature [5]. This feature is very interesting especially not only
for the proposed applications but more generally for management and balancing of
batteries for other applications, such as example electric road vehicles [6] or other
kind of autonomous underwater drones [7].
Acknowledgements Authors wish to thank Fondazione Cassa di Risparmio di Firenze that have
granted the funding of VELA Project.
References
1. Allotta B, Pugi L, Massai T, Boni E, Guidi F, Montagni M (2017) Design and calibration of
an innovative ultrasonic, arduino based anemometer. In: Conference proceedings—2017 17th
IEEE international conference on environment and electrical engineering and 2017 1st IEEE
industrial and commercial power systems Europe, EEEIC. I and CPS Europe 2017, art. no.
7977450. https://doi.org/10.1109/eeeic.2017.7977450
2. Alves JC, Cruz NA (2008) FAST—an autonomous sailing platform for oceanographic missions.
In: OCEANS 2008, Quebec City, QC, p 1–7. https://doi.org/10.1109/oceans.2008.5152114
3. Pugi L, Allotta B, Boni E, Guidi F, Montagni M, Massai T (2018) Integrated design and testing
of an anemometer for autonomous sail drones. J Dyn Syst Meas Control Trans ASME 140
(5):055001. https://doi.org/10.1115/1.4037840
4. Sauzé C, Neal M (2011) Long term power management in sailing robots. In: OCEANS 2011,
IEEE, Santander, p 1–8. https://doi.org/10.1109/oceans-spain.2011.6003406
396 E. Boni et al.
5. Boni E, Montagni M, Pugi L (2019) Autonomous sail surface boats, design and testing results of
the MOUNTAINS prototype. In: Lecture Notes in Electrical Engineering, vol 550. pp 453–459
(9783030119720). https://doi.org/10.1007/978-3-030-11973-7_54
6. Pugi L, Grasso F, Pratesi M, Cipriani M, Bartolomei A (2017) Design and preliminary per-
formance evaluation of a four wheeled vehicle with degraded adhesion conditions. Int J Electr
Hybrid Veh 9(1):1–32. https://doi.org/10.1504/ijehv.2017.08281
7. Pugi L, Pagliai M, Allotta B (2018) A robust propulsion layout for underwater vehicles with
enhanced manoeuvrability and reliability features. Proc Inst Mech Eng Part M: J Eng Marit
Environ 232(3):358–376. https://doi.org/10.1177/1475090217696569
Chapter 46
DC-Link Capacitor Sizing Method
for a Wireless Power Transfer Circuit
to Be Used in Drone Opportunity
Charging
46.1 Introduction
issue. It is mainly made by a bridge rectifier that supplies the battery to be recharged
[3]. Depending on the application type and the quality of the WPT system, a DC-DC
converter and/or filters are present in the secondary side to rectify the transmitted
AC power and to control the charge of the battery. The authors reach and track the
maximum charging efficiency using a DC-DC converter with variable duty-cycle in
[4]. However, this architecture might be not affordable for drones where the overall
dimensions and weight are important constraints. Therefore, a simpler solution based
on an LC-filter may be preferable. The DC-link LC-filter architecture used in [5] is
safely sized by choosing the capacitor value C large enough to obtain a constant out-
put DC voltage on it. However, this assumption may lead to overestimate the capacitor
value and to add useless size and load to the drone. The capacitor and inductor sizes
are related to the maximum current and voltage values that they can withstand, so
these factors can be critical when the power requirement of the opportunity charger
becomes large. Moreover, as WPT systems usually work with resonant frequencies
from tens of kHz to MHz [3], the available capacity values of commercial capacitors
are much less than those that work at lower frequencies. The aim of this paper is to
investigate how the sizing of the filtering capacitor influences the power transfer and
the efficiency of a generic WPT, by means of LTSpice time-domain simulations. In
addition, as a generic Li-ion battery shows an intrinsic inductive component in the
WPT frequency range as shown in [6], our idea is to eliminate the external passive
inductor and to use the battery itself as the inductive component of the LC-filter to
achieve a further size and load saving. The final goal of the paper is to find the best
trade-off between the size of the LC-filter capacitor and the power delivered to the
battery.
46.2 Methodology
Figure 46.1 shows the equivalent circuits for the series-series (SS) WPT architecture
investigated in this paper. The secondary circuit consists of the diode rectifier bridge
iIN(t) iOUT(t)
Primary circuit Secondary circuit
D1 D3
C1
C2
M
BaƩery model
LB
+
Power D-Class L1 L2 C0 RB
vSS(t)
supply Amplifier
-
VB
iSS(t) D4 D2
Fig. 46.1 Equivalent circuit of the SS WPT architecture proposed in this paper
46 DC-Link Capacitor Sizing Method for a Wireless Power … 399
and only the passive component C 0 , whereas RB and L B are the intrinsic parameter of
the Randal model [6] of a generic Li-ion battery valid at the frequencies of interest.
Let us determine the frequency response of the LC-filter consisting of C 0 and the
parasitic inductance of the battery.
As described in [7], the SS architecture fixes the current in the secondary circuit.
Therefore, the frequency response of the LC-filter showed in Fig. 46.1 can be evalu-
ated by considering the current iIN (t) coming from the rectifier bridge as input, and
the current iOUT (t) that flows in the Li-ion battery as output. The circuit behaves like
a second order low-pass filter, the Laplace domain response of which is well known
and shown in (46.1), together with the discriminant Δ of the polynomial [8]. The
value of Δ determines the position of the poles of the filter. The value of C 0 that
makes Δ = 0, i.e. C * reported in (46.2), sets the limit between real and complex
conjugate poles. If C 0 is lower than C * , the filter has complex conjugates poles.
i OU T (s) 1
= 2 = C0 R 2B C0 − 4L B (46.1)
i I N (s) s L B C0 + s R B C0 + 1
4L B
C∗ = (46.2)
R 2B
By defining, f 0 as the resonant frequency and ξ as the damping factor of the filter
as expressed in (46.3),
1 RB C0
f0 = √ , ξ= , (46.3)
2π L B C0 2 LB
i OU T (s) 1
= ξ
. (46.4)
i I N (s) 1
s2 + π f0
s +1
4π 2 f 02
When the damping factor approaches to 0, the time-domain response will show
an increased oscillating behaviour.
400 A. Carloni et al.
The time-domain response of the secondary circuit in Fig. 46.1 was evaluated as
a function of the circuit parameters by means of the LTSpice electrical simulator.
Since the SS WPT architecture behaves like a current generator [7], we represent
the circuit up to the bridge rectifier as a sinusoidal current generator iss (t), with 25 A
amplitude and 150 kHz frequency. These values resemble those commonly used in
medium power applications such as drones [2, 9]. The diodes are modelled as ideal
switches with a voltage drop V γ of 640 mV. The battery voltage is fixed at 22 V,
whereas L B and RB were extracted from a real Li-ion battery as it will described
in the following subsection. The step directive in LTSpice allows us to perform a
parametric simulation, where the value of C 0 is logarithmically swept between 50 nF
and 500 µF. Finally, the total power PB transferred to the battery and the input-output
efficiency η are evaluated. Let us note that PB is the active power transferred to the
battery, being defined as the average of the product between the battery electromotive
force V B and the battery current iOUT (t) in Fig. 46.1. Moreover, η is the ratio between
PB and the power at the bridge rectifier input.
The intrinsic resistance and inductance of a real Li-ion battery specific for drone
applications were measured by performing the Electrochemical Impedance Spec-
troscopy (EIS) of a TA-15C-16000-6S1P-EC5 battery. It consists of 6 cells in series,
with 22.2 V nominal voltage and 16 Ah capacity. The spectroscopy test was per-
formed by means of a Gamry Reference 3000 [10] set in galvanostatic mode. The
instrument sets a 0.1 AC-current on the single cell and measures the cell voltage
between 1 Hz and 500 kHz in ten points per decade. The extracted Bode diagrams
of the impedance were fitted by the Gamry Echem Analyst software [11] by which
the intrinsic resistance and inductance of the six cells of the battery were derived.
The resistance values derived as above show an average value of 29.957 m with a
standard deviation of 76.085 µ, whereas the average inductance is 217.3 nH, with
10.721 nH standard deviation. Thus, the total battery resistance RB and inductance L B
are 179.4 m and 1.304 µH, respectively. These values were used by the simulator
to perform the analysis described in Sect. 46.2.2. Furthermore, the value C* defined
in (46.2) is 160.5 µF. The power transferred to the battery PB and the efficiency η
obtained by the circuit simulations when the capacitance C 0 is the parameter are
shown in Fig. 46.2.
46 DC-Link Capacitor Sizing Method for a Wireless Power … 401
(a) (b)
It results that the power delivered to the battery and the circuit efficiency sig-
nificantly depend on the value of C 0 and thus on the resonant frequency f 0 of the
LC-filter. Considering the power graph in Fig. 46.2a, it is possible to determine three
intervals, where the circuit has three different behaviors. Figure 46.3 shows the cur-
rent and voltage waveforms at the rectifier input for three C 0 values, (a), (b) and (c),
Fig. 46.3 Voltage vSS (t) and current iSS (t) waveforms at the input of the bridge rectifier, with C 0
= 5 µF (a), C 0 = 126 nF (b) and C 0 = 50 nF (c). Bridge rectifier input current iss (t) (red line),
bridge rectifier output current iIN (t) (blue line) and battery current iOUT (t) (yellow line) for C 0 =
5 µF (d), C 0 = 126 nF (e) and C 0 = 50 nF (f)
402 A. Carloni et al.
respectively, each one representing a particular behavior. Figure 46.3 also shows the
current waveforms at the rectifier input (red), rectifier output (blue) and in the battery
(yellow), for the same three C 0 values, in (d), (e) and (f), respectively.
For capacity values higher than 500 nF and thus for resonant frequencies lower
than 197 kHz, the power is almost constant at 350 W. The bridge rectifier works in
continuous mode in this capacity interval. Its input voltage vSS (t) resembles a square
wave as shown in Fig. 46.3a; the diodes D3, D4 and D1, D2 work as rectifying
couples, and the filter provides good low-pass effect on the battery current (see
Fig. 46.3d).
Instead, the power delivered to the battery grows for C 0 values between 500 nF
and about 160 nF, i.e. for f 0 between 197 and 349 kHz, as it can be seen in Fig. 46.2a.
We note that the bridge rectifier now works in discontinuous mode. There are time
intervals in each semi-period of Fig. 46.3b, where vSS (t) is fixed at −2 V γ , because
all the diodes of the bridge simultaneously conduct. However, a beneficial effect is
that vSS (t) is more like a sine wave than before, and the active power delivered to
the load is higher with respect to the previous case. As C 0 approaches 160 nF, the
phase angle between the fundamental components of vSS (t) and iSS (t) reduces itself,
and the power delivered increases. This is a very appealing behavior, particularly for
opportunity charging, where the goal is to deliver the maximum power possible for a
limited amount of time. The drawback is found in the reduced filtering action, as the
battery current becomes more oscillating (see Fig. 46.3e). However, this fact does
not affect the battery ageing as demonstrated in [12].
Then, the power starts to decrease when the capacity becomes lower than 160 nF.
Finally, for capacity lower than 50 nF and resonant frequency above 624 kHz, the
waveforms in Fig. 46.3c and the battery current in Fig. 46.3f exhibit consistent
overshoots and undershoots. This is a region to avoid, as the damping coefficient
drops below than 0.02. The efficiency profile in Fig. 46.2b can also be divided in
three sections with capacity intervals similar to the previous ones. The efficiency is
quite constant at 80% for C 0 higher than 500 nF. Here, the iOUT (t) ripple is very low
and produces a negligible loss on the resistance of the battery. For capacity lower
than 500 nF, the efficiency starts to decrease, because of the increased power losses
on the battery resistance due to the higher iOUT (t) values and on the diodes due to
the discontinuous conduction.
46.4 Conclusions
Sizing C 0 of the output filter of a WPT in order to obtain a stable output brings the
designer to oversize its value, adding useless size and load to the drone. The paper
shows that the filter can be reduced to a single capacitor, as the inductance can be
provided by the battery itself, reducing the on-board circuitry. The power delivered
to the battery and the process efficiency were evaluated as a function of the C 0 value.
As in opportunity charging the aim is to maximize the power delivered to the battery,
it has been shown that choosing a value of C 0 that fixes the resonant frequency of the
46 DC-Link Capacitor Sizing Method for a Wireless Power … 403
LC-filter near the double of the excitation frequency of the WPT system leads to the
maximum power transfer. The drawback is a reduced filtering effect on the battery
current and a non-optimal value of the efficiency in the power transfer. Instead, if the
goal is to maximize the efficiency, a value of C 0 that sets the resonant frequency of
the LC-filter close to the WPT excitation frequency leads to the maximum efficiency
in power transfer.
References
1. Lu M, Bagheri M, James AP, Phung T (2018) Wireless charging techniques for UAVs: a review,
reconceptualization, and extension. IEEE Access 6:29865–29884
2. Campi T, Cruciani S, Feliziani M (2018) Wireless power transfer technology applied to an
autonomous electric UAV with a small secondary coil. Energies 11(2)
3. Zhang Z, Pang H, Georgiadis A, Cecati C (2019) Wireless power transfer—an overview. IEEE
Trans Ind Electron 66(2):1044–1058
4. Zhong WX, Hui SYR (2015) Maximum energy efficiency tracking for wireless power transfer
systems. IEEE Trans Power Electron 30(7):4025–4034
5. Liu X, Wang T, Yang X, Jin N, Tang H (2017) Analysis and design of a wireless power transfer
system with dual active bridges. Energies 10(10):1–20
6. Amanor-Boadu JM, Abouzied MA, Sanchez-Sinencio E (2018) An efficient and fast Li-ion
battery charging system using energy harvesting or conventional sources. IEEE Trans Ind
Electron 65(9):7383–7394
7. Zhang W, Mi CC (2016) Compensation topologies of high-power wireless power transfer
systems. IEEE Trans Veh Technol 65(6):4768–4778
8. Seborg DE, Mellichamp DA, Edgar TF, Doyle FJ (2010) Process dynamics and control. Wiley,
p 81
9. Campi T, Dionisi F, Cruciani S, De Santis V, Feliziani M, Maradei F (2016) Magnetic field levels
in drones equipped with wireless power transfer technology. Asia-Pac Int Symp Electromagn
Compat APEMC, 01:544–547
10. Gamry Reference 3000 (Online). Available: https://www.gamry.com/potentiostats/reference-
3000/
11. Gamry Echem Analyst (Online). Available: https://www.gamry.com/application-notes/
software-scripting/
12. De Breucker S, Engelen K, D’hulst R, Driesen J (2013) Impact of current ripple on Li-ion
battery ageing. World Electr Veh J 6:0532
Chapter 47
Distributed Video Antifire Surveillance
System Based on IoT Embedded
Computing Nodes
Abstract This paper shows the design and the implementation of a distributed video
antifire surveillance system based on Raspberry Pi embedded computing board and
RPi Camera, able to detect the smoke and trigger autonomously a fire alarm. These
smart cameras will be placed in different areas under surveillance, connected together
according to an IoT-scheme via wired (e.g. ethernet) or wireless (e.g. Wi-Fi) links, and
accessible to several users via web browser. A centralized web interface node shows
the video stream of each camera in real time, while a video processing algorithm
is responsible for the smoke identification and for the decision making of a fire
alarm. Furthermore, the system is able to auto record the video in case of fire alarm.
Target applications are distributed smoke/fire alarms in smart cities or smart transport
systems or smart factories.
47.1 Introduction
Fire is an undesirable event that causes every year billions of dollars in damage to
property and the environment. Fire and smoke can be detected at the state of art by
installing smoke/fire detector nodes that typically exploit ionization and photome-
try. Through these mechanisms, they can identify the presence of certain particles
and trigger a fire alarm. Although the technologies are becoming affordable, these
state of art systems have the drawback to react slowly in large areas and they cannot
be installed in open spaces. Closed circuit television systems (CCTV) and cameras
instead, are already installed in factory buildings, city streets and public transportation
for surveillance purpose. Exploiting an already existing video infrastructure allows
The nodes in the network in Fig. 47.1 consists of several Raspberry Pi 3 model B
units, a low cost, low power single board embedded computer. The board is equipped
with a Broadcom BCM2836, a System on Chip (SoC) including a 1.2 GHz 64-bit
quad-core ARM Cortex-A53 processor, 512 KB of cache L2, 1 GB of DDR2 RAM,
Video Core IV GPU, 4 USB 2.0 ports, on-board WiFi @2.4 GHz 802.11n, Bluetooth
4.1 Low Energy, 40 GPIO pins and many other features [7]. The camera module used
in the implementation is a PI Camera Board v1.3 able to deliver a 5 MP resolution
image, or 1080p HD video recording at 30 fps [8]. The Pi Camera plugs directly
into the CSI (Camera Serial Interface) connector of the Raspberry Camera board.
We used a common Netgear DGN2200v3 as router network device.
components. In fact, the web application is written in Python using Flask [11], a
web application framework based on Werkzeug WSGI (Web Server Gateway Inter-
face) toolkit and Jinja2 template engine. The WSGI is a specification for a universal
interface between the web server and the web applications, and its toolkit imple-
ments requests, response object and utility functions. Jinja2 is a template engine that
combined with a certain data source (HTML template, relational database, XML
files, etc.) permits to render dynamic web pages. The monitoring system makes use
of three main templates: login.html, home.html and index.html. Each template is
bounded in some specific Python functions that associate respectively three main
URL: http://central-hub-IP/login, http://central-hub-IP/home, http://node-IP/, where
“central-hub-IP” represents the IP address of the Central Node, while “node-IP”
represents the IP address of the smart node. The login page, as well as the other
webpages, are based on Semantic UI [12] framework, which permits to build fast
and concise HTML, along with a complete mobile responsive user experience. The
system is built for a multi-user usage, so a database layer is added to the application.
Unfortunately, Flask does not have any database support out of the box. So that, we
used SQLAlchemy library [13], which is able to manage and query our relational
SQLite [14] database built into the central hub Raspberry Pi. The structure of the
database is very simple with just one table called “users” with three columns: one
called id that act as primary key, one called username and one called password.
Obviously, the tuples of the database represent a user who is allowed to access to
the Antifire System. OpenCV 3.4 [15] is responsible for the management of the Pi
camera.
in our laboratory, the system was tested on a video playback displaying smoke on
the PC monitor and caught in real time by one of the RPi camera node as shown
in Fig. 47.4. In fact, the web interface shows the bounding box around the smoke
properly detected and a fire alarm is triggered in few seconds as expected.
This paper has proposed a low cost implementation of a distributed Smart Antifire
Surveillance System based on Raspberry Pi embedded platform. The system takes
advantages of an existing Video Smoke Detection algorithm, proposed by authors, to
detect smoke in real-time from several cameras distributed in different areas. Every
smart node can feed a central hub that works as collector of the video live stream
while a web Interface permits the monitoring of such cameras, to save video and to
modify camera parameters. A simple test was carried out using four Raspberry Pi 3
proving the feasibility of the whole system. The communication is done over HTTP
so it is not encrypted and it is vulnerable to man-in-the-middle and eavesdropping
attacks. A future improvement would be implementing an HTTPS communication
to protect the authenticity of the webpage/web-interface, to secure accounts and to
keep private user communication and identity. Another possible improvement would
be an automatic notification via email to the users as result of a fire detection alarm.
47 Distributed Video Antifire Surveillance System … 411
References
Abstract With reference to SW-controlled mechatronic units for the new generation
of electrified and assisted vehicles, this work proposes and validate a methodology
to simulate together its three main subsystems: electronic HW components (passive
and actives, both integrated circuits and board-level components), algorithms and
relevant SW implementation running on a Microcontroller unit, mechanical part.
With the support of MAGNA, worldwide leader in the production of automotive
components, particularly for door systems, we have considered the Smart Latch as
case study. The Smart Latch is a new, SW-controlled, mechatronic doors latch. The
proposed methodology allows the creation of a digital virtual design and verification
environment suited both in design phase for multi-domain component specification
(HW, SW, mechanics) or for diagnostic/verification in case of faults.
48.1 Introduction
Due to the increasing demands of the modern economy, the development of new
products must reach new levels concerning the complexity and the implemented
intelligence, while saving resources and reducing the time needed for design and
production. Such products often are interdisciplinary systems and they are also called
mechatronic systems, referring to the synergistic integration of SW, electronics and
mechanics. Hence, mechatronics is interdisciplinary and was defined by Harashima
F. as “the synergistic combination of mechanical and electrical engineering, com-
puter science, and information technology, which includes control systems as well
as numerical methods used to design products with built-in intelligence”.
To satisfy the requirement of saving time and cost for design and production,
model-based design techniques exist, which allow simulating and optimizing the
behavior of the designed structure in conditions similar to the real ones. After the
virtual development of the product, the system must be implemented in real world
and tested in real life conditions in order to validate simulation results. The testing
procedure implies validation of different parts of the product like control algorithms,
sensor systems, actuation systems, electronic boards. This step can be very challeng-
ing and time consuming without using adequate tools and implies a need for methods
and tools which can assist in the various processes involved in realizing such prod-
ucts. Application of computer aided engineering (CAE) capabilities, together with
custom engineering strategies within the company, is a way that is being implemented
in a lot of mechatronic companies. The goal is the usability of all tools and models
throughout the whole process of design from the first conceptual ideas to its use for
troubleshooting of systems that are already in revenue service. This article shows
and tests a methodology to simulate in an integrated environment all 3 subsystems of
a mechatronic system: electronic HW components (passive and actives, integrated
circuits and board-level components), algorithms and relevant SW implementation
running on a Microcontroller unit, mechanical part.
With the support of MAGNA, leader in automotive door systems, we have con-
sidered the “Smart Latch” as case study [1–3]. It is a new SW-defined mechatronic
product and it is a good system for applying the methodology presented below,
since it is based on the interaction among mechanics, electronics and control algo-
rithms (SW running on a microcontroller). Hereafter, Sect. 48.2 reviews the Smart
Latch architecture and the limits of state of art commercially-available design and
verification methodologies for SW-defined mechatronics. Section 48.3 presents the
integrated simulation/verification environment. Simulation results and comparison
to experimental measurements are discussed in Sect. 48.4. Section 48.5 deals with
conclusion and state-of-art comparison.
The “Smart Latch” developed by MAGNA creates a new paradigm for side-door
latches by “the first fully electronic side door latch in the market”. The industry first
application of the latch is on the BMW i8 and it has been selected by other global
automotive manufacturers for future vehicle programs. The features of “Smart Latch”
are: it removes all mechanical latch system components and eliminates the need for
cables, rods and moving handles in the door; significant weight savings compared
to mechanical latches; reduced number of components; flexibility to be used in any
type of car or truck; improved safety and sound quality.
48 Integrated Simulation Environment for Co-design/Verification … 415
The “Smart Latch” system includes an on-board ECU (Electronic Control Unit)
that has power backup capabilities and generates signal to drive motors for a soft
close function for automatic door cinching. The “Smart Latch” provides additional
functions like connection with a car’s network, diagnostic, passive entry, crash detec-
tion and post-crash safety. Figure 48.1 shows a diagram blocks of whole “Smart
Latch” system. This picture highlights the subsystems that describe the model inside
the “Smart Latch”. The “Smart Latch” has HW blocks (pointed by blu arrows in
Fig. 48.1), that ensure interfacing the microcontroller core to power supply, motors
and sensors. The microcontroller of the Smart Latch runs a state machine that imple-
ments the control algorithm (orange arrow in Fig. 48.1). Sensors and motors interact
with the mechanical part (red arrow in Fig. 48.1) to accomplish latch door functions.
Today many SW houses aim at developing tools to aid the engineers in developing
new mechatronics systems. In the following, we consider the main 3 tools in the
market.
Simulink Simscape: It provides an environment for modeling and simulating
physical systems spanning mechanical, electrical, hydraulic, and other physical
domains. It provides fundamental building blocks from these domains that you
can assemble into models of physical components, such as electric motors, invert-
ing op-amps, hydraulic valves, and ratchet mechanisms. Because Simscape com-
ponents use physical connections, the models match the structure of the system
under development. Simscape models can be used to develop control systems and
test system-level performance. The libraries can be extended using the MATLAB
416 E. Abbatessa et al.
simulator application for simulation and verification of analog and mixed-signal cir-
cuits. PSpice is an acronym for Personal Simulation Program with Integrated Circuit
Emphasis. OrCAD EE typically runs simulations for circuits defined in OrCAD Cap-
ture, and can optionally integrate with MATLAB/Simulink, using the Simulink to
PSpice Interface (SLPS recently became OrCAD PSpice Systems Option). OrCAD
Capture and PSpice Designer together provide a complete circuit simulation and
verification solution with schematic entry, native analog, mixed signal, and anal-
ysis engines. The OrCAD PSpice-Simulink integration, OrCAD PSpice Systems
Option provides co-simulation and helps verify system level behavior. A circuit to
be analyzed is described by a circuit description file, which is processed by PSpice
and executed as a simulation.
ADAMS MSC: ADAMS is acronym of Automated Dynamic Analysis of Mechan-
ical Systems and is a multibody dynamics simulation SW [6] equipped with For-
tran/C++ numerical solvers. Adams has been proved as very essential to VPD (Virtual
Prototype Development) through reducing product time to market and product devel-
opment costs. ADAMS provides some basic modules: Adams/View; Adams/Solver;
Adams/Postprocessor. Several additional modules sold separately are available for
extended functionality, for example: Vibration analysis through ADAMS/Vibration
includes mode shape analysis; SISO and MIMO closed loop control system model-
ing and simulation is available through ADAMS/Controls; simulate flexible links,
via Adams/ViewFlex and/or Adams/Flex. Its approach to flexible body modelling
is that of modal analysis which uses a modal neutral Adams/Controls is very well
integrated into MathWorks Simulink by some S-functions. A closed loop between
Simulink and Adams/Controls makes simulation of non LTI systems very simple.
A non-linear time variable model of plant is modeled within ADAMS/Controls and
its behavior is reported to Simulink via Named pipe or TCP/IP communication as
feedback, whereby analyzed by some controller within Simulink and through some
actuators act upon ADAMS/Controls plant in the same communication scheme. Also
through control export mechanism, ADAMS/Control can provide MATLAB’s Con-
trol System Toolbox with a state space model of system under study to be used
further for design of controller. Adams also supports importing a compiled DLL
version of Simulink models built using Simulink Coder. Functional Mock-up Inter-
face has been supported. It is an open standard interface intended for coupling tools
from different vendors for Model Exchange and Co-simulation. Adams is highly
integrated with Actran frequency-domain solver for chained simulation analyses of
moving mechanisms such as gearbox run-up and impact noise studies, such as door
latch mechanisms.
Either OrCad and Adams permit the importing of their models in Simulink. This
means that with the tool “PSpice Systems Option” we can import the OrcCAD
model of electronics system in “Simulink” and we can make the same to import the
ADAMS model of mechanics system with “Adams/Controls”. In addition, the use of
“Stateflow”, that is already a “Simulink” tool suggest the choice of this last software
as main software to develop the model of a mechatronics systems. The advantage
of this approach is that we can exploit the models already developed by MAGNA
for electronics, mechanics and control SW. In this way we avoid the copy model
420 E. Abbatessa et al.
errors that can occur when we translate model between two SW based on different
languages.
To exploit OrCAD model developed from MAGNA in co-simulation with
Simulink, we have added to OrCAD model voltage generators to simulate output
pins of microcontroller and added circuit equivalent of DC motors to take in account
the effect of counter EMF. At this point we have to consider that equivalent circuit of
DC motors models, corresponding at block pointed by the dotted shape in Fig. 48.1, it
is necessary if we want obtain a system that takes in account the interaction between
electronics and mechanics. In fact, the torque developed by a motor is proportional
to the current that goes through the inductance of electrical model of DC motors and
the counter EMF is proportional to the rotor angular velocity.
To take in account the sensor mounted on the mechanics some new library parts
have developed and added to OrCAD model. Connected the sensor developed parts
to other voltage generators we obtain the inputs for the electronics model that depend
by position of sensors in the mechanics model. To exploit ADAMS model, instead,
we added some state variables before export model to Matlab.
After this modification of OrCAD and ADAMS models and applying the tools
presented before we can obtain a single integrated environment that incorporates all
subsystems used to design a mechatronic system. Figure 48.3 shows results of the
model.
In the first test type the boost converter is not running because the battery is fully
charged. To replicate this condition in the proposed model we have set the power
supply to 12 V and we have disabled the boost converter. To compare model and
real “Smart Latch” we have measured the voltage of signals used to drive the power
release DC motor (pin A and pin B) and current absorbed from battery by Smart
Latch (Battery current). Real system measures are reported in Fig. 48.3 with blue
lines, while model system results are red lines in the same figure. With the analysis
of graphics in Fig. 48.3 we can observe the difference of driving signals, that are
constant during power release action, instead, are PWM during power reset action.
The first part corresponds to power release action and in the first two graphics of
Fig. 48.3 we observe some difference between real and modeled command used to
drive the DC motors. This translate into the difference between the two data type
that we see in the first part of the “Battery current” graphic in Fig. 48.3. In the same
graphics we highlight the difference of stalling time between the two kinds of data,
probably due to the unmodeled friction torque and unmodeled command. The second
422 E. Abbatessa et al.
part corresponds to power reset action and as we can see in the first two graphics
of Fig. 48.3, the power reset action signals used to drive the motor are PWM but
real Smart Latch makes use of a variable duty cycle, instead in the model we have
supposed a constant duty cycle.
In the second test type the battery is fully discharged, but the converter is running. To
replicate this condition in the proposed model, we have set the power supply to 0 V
and we have enabled the boost converter. To compare model and real “Smart Latch”
we have measured the voltage of signals used to drive the power release DC motor
(pin A and pin B), current absorbed from release DC motor (motor current) and
the voltage generate by the boost converter of Smart Latch (V PROT). Real system
measures and model systems results are in Figs. 48.4 and 48.5 with blue lines and
red lines, respectively.
After the comparison between signals obtained with the proposed model and the real
measures made on a Smart Latch, we can conclude that:
– Motor stalling time difference is probably due to the frictions present in the real
mechanics.
– Graphics of current show that the signals extracted form simulations are accurate
since the average error versus real measurements is limited and hence the model
obtained is a good approximation of reality.
– The time taken to simulate 250 ms of a real SW-defined mechatronic component
is about 7 h (CPU i7-4500U @ 1.80 GHz, 2.40 GHz and 4 GB ram), but we
must consider that, with a single simulation, we obtain information regarding the
behavior of all three subsystems.
– It is believed that the developed model is a starting point to obtain an even more
accurate model to explore in detail also the integration of the real control software
present on Smart Latch.
– For a qualitatively better study, we could add comparisons between measurements
and simulations of other parts of the “Smart Latch”.
424 E. Abbatessa et al.
References
Abstract The aim of this work is to propose a Spice model of photovoltaic panel for
electronic system design. The model is based on Rp -model of PV cell and implements
the open-circuit voltage and short-circuit current variations from temperature and
solar irradiation. The model was implemented on the LTSpice software characterized
by comparing the System Advisor Model (SAM) software and MATLAB models
with a commercial panel. The results of IV and PV curves are here reported.
49.1 Introduction
The simulation and the models of photovoltaic modules allow to characterize their
behavior and to find the maximum power point with variations of solar irradiation and
temperature. This kind of simulation is also important for the analysis and design
of the electronic circuits that exploit them as a power source [1–7] or for Smart
modules [8]. However, considering the recent development of research on energy
harvesting systems, MPPT (maximum power point tracking) techniques are being
integrated for the development of circuits with energy consumption of a few mW. In
these cases the electronic interface circuit with the panel must be carefully designed
to increase efficiency [9]. In general, the use of circuit simulators for commercial
panels poses problems for photovoltaic generator models in creating voltampero-
metric (I-V) characteristic. In [10–16], several models of Orcad-Pspice or LTSpice
are implemented.
In this paper a Spice model of photovoltaic panel for electronic system design
was presented. The model, based on Rp -model of PV cell with five input parameters,
implements the open-circuit voltage and short-circuit current variation based on solar
irradiation and temperature. A commercial panel was chosen from SAM software
database and the results were compared with the I-V and P-V curves of SAM and
MATLAB models.
In Eq. (49.1), I D is the diode conduction current, a is the ideality factor of the
diode and I 0 represents the saturation current. Furthermore, k is the Boltzmann
constant (1.380653 × 10−23 J/K), q is the absolute constant value of electron charge
(1.60217646 × 10−19 C) and T is the junction temperature (K).
The model of the real behaviour of the cell is described in Eq. (49.2), which
includes the Rs series resistance and Rp parallel resistance (called shunt resistance).
The first term represents the internal losses of the cell while the second one describes
the leakage currents [16]. Equation (49.2) presents five parameters: I pv , a, I 0 , Rs and
Rp .
q V+R I
I = I pv − I0 e( akT )(V +Rs I ) − 1 −
s
(49.2)
Rp
Photovoltaic panels have voltage and current variations that depend on temper-
ature and solar irradiation. In the datasheets of the panels, two coefficients, K i and
K v , sized in %/°C, consider these variations. The first parameter K i represents the
temperature coefficient of I SC (short-circuit current) while K v is the temperature
kT G
VOC = VOC,ST C [1 + K v (T − TST C )] + a ln (49.3)
q G ST C
G
I SC = I SC,ST C [1 + K i (T − TST C )] (49.4)
G ST C
In Eqs. (49.3) and (49.4) the open-circuit voltage V OC,STC and short-circuit cur-
rent I SC,STC are the values reported by the manufacturer measured in standard test
conditions (STC) with ambient temperature T STC = 25 °C and solar irradiation GSTC
= 1000 W/m2 .
The model was implemented in the LTSpice software [18] using the scheme shown
in Fig. 49.1 with a current generator, a diode and two resistors. The diode model
must be scaled as shown in [19], due to variations in voltage and current with respect
to temperature and solar irradiation. From Eq. (49.1) we can derive the new value
of the ideality factor a of the solar cell using open-circuit condition with V = V OC ,
I D = I SC and T = 300 K [17]. The new value of a should be placed in the Spice
model of the diode as parameter N. Nevertheless, it is not enough to scale the diode
Spice model making it dependent on temperature and solar irradiation. Indeed, the
only parameters for the level 1 diode Spice model that determine the variation of the
current I D with respect to the temperature are XTI (Saturation-current temperature
exponent) and EG (Energy gap), in addition to the temperature Spice parameter T
[20]. These parameters must be multiplied by N by a quantity equal to their default
value. The following Eqs. (49.5) and (49.6) show the new parameters to be included
in the diode model in LTSpice.
q
VOC
N =a= kT
I SC
(49.5)
ln I0
E G = 1.11N , X T I = 3N (49.6)
Compared expressions [19], the values returned by Eqs. (49.3) and (49.4) can be
considered for V OC and I SC in (49.5). Indeed, thanks to the use of these equations,
it is possible to have a variation of the diode current dependent on temperature and
solar irradiation.
428 M. Muttillo et al.
To compare the results of the proposed Spice model, a panel present in the database
of the SAM (System Advisor Model) software by NREL [21] is used. The panel
chosen is named “Pythagoras Solar Midi PVGU Windows”, a Mono-c-Si panel with
parameters shown in Table 49.1.
Figure 49.2 shows the LTSpice sub-circuit instance of the proposed model. In
this instance, it is possible to give accurate parameters for the simulation. For the
simulation of the I-V characteristic, a variable resistance load was used.
The LTSpice simulation returns the I-V and P-V characteristics of the analyzed
panel shown in Fig. 49.3. The simulation results have been compared with the SAM
software data and a common MATLAB model used [22]. Simulations of all the
models have been done in STC (1000 W/m2 , 25 °C).
Further results of the simulations are shown in Table 49.2. The relative error with
respect to the SAM for the proposed Spice model is lower than that of the MATLAB
49 Spice Model of Photovoltaic Panel for Electronic System Design 429
Fig. 49.3 Spice simulation results of the proposed Spice model: a IV characteristic; b PV
characteristic
Table 49.2 The relative error of the proposed Spice model with respect to MATLAB model and
SAM model
Pmax [W] Voc [V] Isc [A] Pmax,ε [%] Voc,ε [%] Isc,ε [%]
SAM 20.286 19.400 1.350 – – –
SPICE 20.265 19.356 1.350 0.103 0.228 0.002
MATLAB 20.808 19.399 1.361 −2.571 0.008 −0.807
Fig. 49.4 Spice simulation results with a variation of the solar irradiance (a) and temperature (b)
model for maximum power and short-circuit current. The MATLAB model shows a
lower relative error than the proposed Spice model for the open-circuit voltage.
Furthermore, Spice simulations carried out varying the solar irradiation and tem-
perature, thanks to the use of Eqs. (49.3) and (49.4), are shown in Fig. 49.4. For the
solar irradiation and temperature, the values of 1000, 500, 100 W/m2 and 25, 45,
65 °C respectively were chosen.
49.5 Conclusion
In this work a Spice model of photovoltaic panel for electronic system design was
presented. The model is based on Rp -model of PV cell with five input parameters.
The model implements the equations for the variation of the open-circuit voltage
430 M. Muttillo et al.
References
17. Chin V, Salam Z, Ishaque K (2015) Cell modelling and model parameters estimation techniques
for photovoltaic simulator application: a review. Appl Energy 154:500–519
18. LTspice|Design Center|Analog Devices. https://www.analog.com/en/design-center/design-
tools-and-calculators/ltspice-simulator.html
19. Intusoft Newsletter (2005). http://intusoft.com/nlhtm/nl78.htm#The_Solar_Cell_SPICE_
Model. Available online on 6 Sept 2019
20. Diode Model (PN-Junction Diode Model). http://literature.cdn.keysight.com/litweb/pdf/
ads2008/ccnld/ads2008/Diode_Model_(PN-Junction_Diode_Model).html
21. Home—System Advisor Model (SAM). https://sam.nrel.gov/
22. Implement PV array modules—Simulink—MathWorks Benelux. https://nl.mathworks.com/
help/physmod/sps/powersys/ref/pvarray.html
Chapter 50
Exhaustive Modeling of Electric Vehicle
Dynamics, Powertrain and Energy
Storage/Conversion for Electrical
Component Sizing and Diagnostic
Abstract Electric Vehicles (EVs) will play a major role in meetings Europe’s need
for clean and efficient mobility. The development of new simulation tools, functional-
ities and methods integrated with the controlled development of a vehicle-centralized
controller will also be part of the future solutions for the next generation of EVs.
To improve the safety analysis and reduction costs, the solutions will be based on
flexible user-friendly interfaces and specialized software tools. This work presents
an exhaustive modeling of EV dynamics, powertrain and energy storage/conversion.
The simulation model is useful for both electrical component sizing at designed time
and on-board diagnostic to check component aging. The aim is to model the transient
response of the system while preserving the simplicity and feasibility of simulation.
The design of an EV requires, among others, the development and optimization of
a complete electric powertrain system, including the longitudinal car, battery sys-
tem components, power electronics, electric machine and control system. The paper
presents the modelling and implementation of an entire powertrain system of EVs to
describe the EV dynamics with respect to mechanical and electrical system compo-
nents. Mathematical models based on equations and equivalent circuits are developed
and implemented in MATLAB-Simulink and further study for predicting the final
vehicle driving performance is performed.
50.1 Introduction
described by Eq. (50.1), where F t is the traction force, α is the angle of the driving
surface, M is the mass of the vehicle, V is the velocity of the vehicle, a is the accel-
eration of the vehicle, g is the free fall acceleration, ρ is the air density of dry air, C rr
is the tire rolling resistance coefficient, C d is the aerodynamic drag coefficient and
Af is the front area. Table 50.1 shows the specification of the vehicle dynamic model
considered for simulation results in Sect. 50.3. The values in Table 50.1 refer to a
light-duty vehicle (e.g. a 3-wheel electric scooter) but the model is parametric and
can be applied to any vehicle. Indeed, for the simulations in Sect. 50.4 the values in
Table 50.1 have been rescaled for a 450 kg electric vehicle, like the Renault Twizzy.
1
Ft = Ma + Mg sin(α) + Mg cos(α)Crr + ρCd A f V (50.1)
2
Due to its high-power density and high efficiency the PMSM motor-type is
selected as propulsion system for the vehicle. The electric machine is divided into
an electric part and a mechanical part. In the dq reference frame [7], the electrical
part for the d-axis voltage Vd , the q-axis voltage Vq of the PMSM are expressed as:
di d
Vd = Rs i d + L d −ωr L q (50.2)
dt
di q
Vq = Rs i q + L q + ωr L d i d + ωr d (50.3)
dt
where, Rs is the stator winding resistance, and ωr is the electrical angular speed of
the rotor, L d and L q denote the dq-axes inductance components, and λd is the flux
linkage. The electromagnetic torque of the PMSM Tm and the mechanical dynamics
are given by Eqs. (50.4) and (50.5), respectively
Tm = 3/4 p λi q + L d − L q i d i q (50.4)
dωm
Jm = Tm −TL − Bm ωm , ωm = 2/ pωr (50.5)
dx
where, ωm is the mechanical angular speed of the rotor, J m , Bm , and T L are the
moment of inertia, viscous friction coefficient, and load torque, respectively. The
mechanism of regenerative braking is used in this EV [8]. The control strategy that
suits best for the PMSM is Indirect Field Oriented Control (FOC). The torque pro-
duced by the PMSM is controlled indirectly, by monitoring the stator current is . The
reference currents, isd * and isq *, are obtained from the Maximum Torque Per Ampere
(MTPA) control strategies [9]. For the control algorithm Proportional-Integrator (PI)
regulators are chosen. A generic model of Lithium-ion battery according to Shep-
herd’s [10] model has been chosen from the MATLAB graphical editor Simulink
and experimented in this work. The battery is chosen with maximum rated capacity
of 20 Ah and a nominal voltage of 215 V. A MATLAB function is also included to
adjust the correct operation of the battery in the right ranges. Powering the vehicle
with batteries has the benefit of recapturing the braking energy loss along with zero
emissions. Recapturing the braking energy requires bidirectional DC/DC converters
[11]. The parameters of the Simulink model built for the PMSM drive are reported in
Table 50.2. To demonstrate the model scalability two set of values for two different
PMSM, i.e. PMSM 1 [12] and PMSM 2 [13], are reported and used in Sect. 50.3 and
Sect. 50.4 respectively.
50 Exhaustive Modeling of Electric Vehicle Dynamics, Powertrain … 437
Fig. 50.2 a Drive cycle, the vehicle speed increases from 0 to 60 km/h and remains constant
until the vehicle stops. The vehicle goes up a slope for the first half of time and then a descent
for the remaining time. b Electromechanical torque. c State of charge, current and voltage of the
lithium-battery subsystem
The model described allows to test the vehicle in different scenarios and with different
components. Each element of the model (motor, battery storage, dc-bus, converters,
auxiliary load and so on) can be configured for specific scenarios, changing the
parameters. Another test is shown to verify the validity of the model, to determine
correct configurations of electrical components according to the characteristics of
the vehicle, as well as possible use for diagnostics. The same scenario is used for
the speed profile. The road angle α is set to zero. The ultralight vehicle Twizzy is
considered in this simulation with a mass of 450 kg and a PMSM of 15 kW whose
parameters are shown in Table 50.2. Figure 50.3a reports the electromechanical
torque behavior. Figure 50.3b illustrates the traces for battery characteristics, the
state of charge (SOC), after the vehicle drives for 60 s, is decreased by 1.5% of its
initial value. The EV dynamics including tractive force, speed/acceleration in the
vehicle can be simply monitored.
50.5 Conclusion
Fig. 50.3 a EV electromechanical torque, b State of charge, current and voltage of the battery
and recharging capacity, electric motors PMSM and DC motor, and the electric vehi-
cle dynamics. The model is useful in the diagnostic phase as well as to validate
the correct sizing of the electrical/electronic architecture. The model is parametric
and can be scaled to different vehicle configurations, battery pack, motor, covering
different scenarios. For this purpose, two different configurations have been con-
sidered with different PMSM motors and different light-duty vehicles. Our results
suggest that the proposed model is a complete dynamic model for an electric vehicle
powertrain, including all main subsystems.
References
10. Shepherd CM (1965) Design of primary and secondary cells II. An equation describing battery
discharge. J Electrochem Soc 112(7):657–664
11. Mihet Popa L, Saponara S (2018) Toward green vehicles digitalization for the next generation
of connected and electrified transport systems. Energies 11(11):3124
12. Captain C (2009) Torque control in field weakening mode. Aalborg University M.Sc. thesis,
PED4-1038
13. Dini P, Saponara S (2019) Cogging torque reduction in brushless motors by a nonlinear control
technique. Energies 12(11)
Part X
IoT and Integrated Circuits
Chapter 51
Analysis of 3-D MPPT for RF Harvesting
Abstract We discuss the issues arising in the design of RF harvesters for ultra low-
power environments. The 3-D MPPT approach in [1] is the only one taking into
account the presence of variable output load. Its architecture and performance are
compared with other state-of-the-art MPPT implementations.
51.1 Introduction
Energy harvesting is a physical process aimed at collecting energy from the environ-
ment to power or recharge an accumulator whenever possible. This technique plays
a fundamental role in the development and full exploitation of the Internet of Things
(IoT) emerging world, that is characterized by a huge number of connected devices
that must be fully autonomous. Among the possible energy sources, radiofrequency
electromagnetic field (RF field) is an attractive option since it does not require any
movement or friction nor a thermal gradient, and it is available in both indoor and
outdoor environments [1]. On the other hand, the RF source is ambient dependent,
uncontrollable, and unpredictable. The a priori study of the availability of the power
spectral density (PSD), carried out in the final device location, would be the base
for the design of a dedicated RF harvester. However, the huge number of devices
potentially involved in IoT environments requires a more flexible approach. There-
fore, several techniques for searching and tracking the point of maximum transferred
power have been proposed in literature (MPPT techniques). In this paper, we dis-
cuss some of the aspects characterizing MPPT for RF harvesters. Moreover, the 3-D
MPPT approach reported in [1] is analyzed and compared with other reported MPPT
implementations.
1 1
Q LC = · (51.1)
[R AN T + R I N ] 2π f LC C I N
where CIN and RIN are the input capacitance and resistance of the rectifier. QLC
provides voltage amplification and is strictly related with the sensitivity SIN of the
harvester circuit. The non-linear behavior of the RF rectifier generates a lower bound
for the peak input voltage leading to the minimum input power able to activate the
system:
(VI D )2 · R I N
SI N = (51.2)
2 · Q 2LC · [R AN T + R I N ]2
where VID is the internal voltage drop due to the rectifying devices.
In case of harvesting in environments without any dedicated source or of far field
energy scavenging, it is necessary to achieve a high Q factor to keep SIN as low as
possible, despite the reduction of the bandwidth of the resonant filter (f = fLC /QLC) .
On the contrary, in case of dedicated RF sources in near field a large amount of energy
can be normally collected and the constraint on the Q factor can be relaxed.
The overall transfer efficiency ηTOT is given by the product of the cascade of the
efficiencies of the circuits involved in the harvesting process. In this perspective, the
efficiency of every stage has to be maximized for the best energy collection.
The rectification efficiency ηR largely depends on the rectifier architecture that
can realize both voltage conversion and multiplication. Fully passive rectifier archi-
tectures, such as those adopted in [1–3], do not require any energy from the battery,
at the cost of power losses either on the rectifying devices or to apply techniques of
self-polarization. These circuits are those most suitable for energy scavenging where
RF available energy is limited and the battery must be preserved. Active architectures
for harvesters with dedicated sources and large energy availability have also been
proposed to maximize ηR at the cost of power from the battery [4].
operating on CIN . All these circuits monitor the output voltage for a specific value
of the output resistance. The maximum power of the transferred power is obtained
maximizing VH and therefore PH = V2H /RLH . For this purpose, VH is stored on a
capacitor and then compared with the VH of the previous algorithm step.
Alternative approaches are proposed in [9, 10] where the rectifier in coupled with
a DC-DC converter. The former aims at maximizing the input current of the DC-DC
converter considering constant its output voltage and acting on the switching fre-
quency fS of the converter. The delivered power is hence calculated as PH = VH • IH .
Despite the originality, this system does not consider at all the effect of the output load
on any parameter of the rectifier and it does not seem effective for ultra low-power
RF harvesters. Martins and Serdijn [10] still operates on fS but considering the effects
of IH in particular on the RIN of the rectifier and it maximizes the converted power
maximizing VH . In all the mentioned methods, only one parameter is considered,
thereby simplifying the control strategy and speeding up the MPPT algorithm. Dif-
ferently from these approaches, the harvester in [1] computes the maximum power
considering both the output voltage and the output load of the rectifier.
The system in [1] adopts an alternative approach to MPPT for ultra low-power RF
harvesters and it is designed for adaptation to mutable RF environments in space
and time. The search space of the point of maximum transferred power is three-
dimensional since it considers the input capacitance CIN for the shifting of f LC ,
the output voltage VH , and the output resistance RLH . By means of the last two
parameters, the real maximum power for RF harvesters with non-constant output
load can be computed.
In [1] a Full-Wave Mirror Stacked rectifier with threshold voltage cancellation has
been chosen for the minimum value of the input capacitance, leading to high Q-factors
and very low sensitivity SIN [11]. A mathematical model of the rectifier developed
in MATLAB and post-layout simulations with back-annotated parasitic extraction
demonstrate that the point of maximum transferred power moves in a 3-D space,
function of the additional capacitance CTUN , VH , and RLH . Figure 51.2 shows the
maximum values of PH (RLH ) on VH and PAV , with the best CTUN , obtained from a
MATLAB simulation that includes the model of the rectifier and the computation of
the reflection coefficient L . In the algorithm CTUN , VH , RLH , and PAV are varied to
find the best matching capacitance value CTUN and the maximum PH for given fS ,
RANT , and LM . It is worth to notice that the point of maximum power transfer is not
always at the maximum VH .
51 Analysis of 3-D MPPT for RF Harvesting 447
The limitation of the maximum VH is a matter of fact in real harvesters and the
result shown in Fig. 51.2 suggests to split the graph in two sections: one for high
PAV and one for low PAV . If the VH and the operative PAV ranges are defined a priori,
for high value of RF available power the maximum delivered power is located at the
maximum VH , whereas for low values of available power, where the MPPT is most
necessary, in order to find the maximum PH also RLH must be evaluated.
Fig. 51.3 Left: schematic of the MPPT system for RF harvesting. Right: power meter with variable
load and voltage threshold [1]
448 M. Caselli and A. Boni
VH is compared with the threshold voltages VTH_N of a bank of inverters with different
aspect ratios. Only one inverter at the time is selected for the comparison by means
of the nV -bit word BV .
The dedicated FSM advances the MPPT algorithm by means of the feedback
signal POK , obtained from the comparison of VTH_N and VH . The value of the power
delivered by the rectifier is computed by means of a look-up table. The correct
computation of the delivered output power assumes the linearity of the threshold
voltage steps. The FSM actuates a perturb and observe algorithm operating on the
input capacitance and monitoring the different configurations of the VH –RLH pair.
The evaluation starts from the upper limit of the chosen band with CT0 = 0.
The load resistance is progressively decreased sweeping BR until VH falls below
VTH and POK goes to zero. Here the resonance frequency is shifted down by means
of the bits nC until VH rises again over VTH , POK goes to logic one, and the sweep
of the load resistance resumes. Finally, if the maximum computable power has not
been achieved, VTH is modified and the nested loops are repeated.
Simulation results [1] demonstrate the capability of the proposed system to deal with
multiple tones at different frequencies and to choose the correct point of maximum
delivered power at given quantization levels for the controlling bit words BR , BC ,
BV .
From the comparison of the designed system with several state-of-the-art RF
harvesters equipped with MPPT reported in Table 51.1, some remarks can be offered.
Harvester implementations for both far field without any dedicated RF source [1–
3, 7, 10] and near field with dedicated sources operating on a specific frequency
[5] can exploit the impedance matching variation to enlarge the system bandwidth,
although the latter category obtains a limited improvement of performance. The
3-D MPPT system in [1] is the only one capable to compute the delivered power
taking into account both VH and RLH . The MPPT implementations in [2, 3, 5, 7, 8]
search the maximum power point monitoring only the output voltage of the rectifier
VH . However, this approach is accurate only for RF harvesters with constant and
defined RLH . The information about the best output configuration (VH -RLH ) can be
particularly useful for RF harvesters cascading an input control DC-DC converter at
the output of the rectifier [12, 13]. The control strategy of the DC-DC converter can
be tuned to operate at the desired average VH , whereas the average load resistance of
the rectifier is defined by the incoming power PIN [14]. Most approaches in literature
propose the periodic verification of the point of maximum transferred power due
to the variability of the RF field. In order to limit power consumption, the control
section is normally powered down at the end of the algorithm and turned again on
with a small duty cycle. Little information is provided in literature about the power
consumption of the MPPT systems even if this is crucial given the low RF power
possibly available [14]. Currently, the best reported values are in the order of few
tens of nanoamperes [1, 3, 10] thanks to the low MPPT duty cycle and the minimal
architecture. Moreover, to cope with low power environments high sensitivity rectifier
architectures are required [10, 11].
References
1. Caselli M, Boni A (2019) 3-D Maximum power point searching and tracking for ultra low
power RF energy harvesters. In: IEEE SMACD
2. Zeng Z et al (2016) A WLAN 2.4-GHz RF energy harvesting system with reconfigurable
rectifier for wireless sensor network. In: IEEE ISCAS
3. Stoopman M et al (2013) Self-calibrating RF energy harvester generating 1 V at −26.3 dBm.
In: IEEE symposium on VLSI
4. Wang SH et al (2018) The design of CMOS 13.56 MHz high efficiency 1x/3x 1.99 V/6.29 V
active rectifier for implantable neuromodulation systems. In: IEEE ISCAS
5. Gosselin A et al (2017) A CMOS automatic tuning system to maximize remote powering
efficiency. In: IEEE ISCAS
6. Abouzied MA et al (2017) A fully integrated reconfigurable self-startup RF energy-harvesting
system with storage capability. IEEE J Solid-State Circ 52(3)
7. Bakhtiar AS et al (2010) An RF power harvesting system with input-tuning for long-range
RFID tags. In: IEEE ISCAS
8. Xia L et al (2014) 0.56 V, −20 dBm RF-powered, multi-node wireless body area network
system-on-a-chip with harvesting-efficiency tracking loop. IEEE J Solid-State Circ 49(6)
9. Hua X, Harjani R (2018) A 5 μW–5mW input power range, 0–3.5 V output voltage range
RF energy harvester with power-estimator-enhanced MPPT controller. In: 2018 IEEE custom
integrated circuits conference (CICC)
10. Martins GC, Serdijn WA (2018) An RF energy harvester with MPPT operating across a wide
range of available input power. In: 2018 IEEE international symposium on circuits and systems
(ISCAS)
11. Nakamoto H et al A passive UHF RF identification CMOS tag IC using ferroelectric RAM in
0.35–μm technology. IEEE J Solid-State Circ 42(1)
12. Wang J et al (2017) 900 MHz RF energy harvesting system in 40 nm CMOS technology with
efficiency peaking at 47% and higher than 30% over a 22 dB wide input power range. In:
ESSCIRC 2017-43rd IEEE European solid state circuits conference
450 M. Caselli and A. Boni
Abstract This paper presents the modeling and design activity of a PLL (Phase-
Locked Loop) architecture to generate the clock reference for the new ESA Spacefibre
standard for on-board satellite communications up to 6.25 Gbps. Starting from a
6.25 GHz VCO rad-hard design, integrated in 65 nm technology within an IMEC-
University of Pisa collaboration, this work presents a PLL architecture including
configurable integer divider, down to a reference signal of 156.25 MHz, phase-
frequency detector, charge pump and passive loop filter. Modeling and simulation
analysis, carried out in Keysight ADS environment, show that a fully integrated
solution can be achieved with a 6 MHz low-pass PLL loop filter whose passive
devices can be integrated on chip with an area of about 4600 μm2 . The PLL phase
noise performance are in line with that of the original VCO, and for the stability
a gain and phase margins of 86 dB and 50° are achieved. PLL lock time is about
555 ns. A preliminary circuit for the charge pump implementation is also proposed.
52.1 Introduction
The new Spacefibre standard for on-board satellite communication up to 6.25 Gbps
has been recently released by ESA [1]. A key block for its implementation is the
clock reference generator, which should be tolerant to SEE (Single event effects)
and TID (Total ionization dose), and able to sustain up to 6.25 GHz, as well as its
The PLL’s architecture is shown in Fig. 52.1. It is composed of a PFD (Phase Fre-
quency Detector), that provides two digital signals (UP and DOWN) depending on
the phase and frequency difference between the input signals, a CP (Charge Pump),
that converts UP and DOWN signals in a current (positive or negative), a passive
Loop Filter, a VCO, that generates the output signal of the PLL, and an integer N
frequency divider that provides one of the inputs of the PFD.
The main target frequency that have been chosen for this work (6.25 GHz) is
obtained from a reference frequency of 156.25 MHz thanks to an integer divider
with N = 40. The other frequencies (3.125 and 1.5625 GHz) could be obtained by
divisions by 2 and by 4 of the main frequency without adding other components to
the architecture.
52 Analysis and Simulation of a PLL Architecture Towards a Fully … 453
The VCO has a tuning range of 0–1.2 V and the supply voltage for the PLL is
1.2 V and presents a phase noise <−100 dBc/Hz at 1 MHz. The target phase noise
for this work has been chosen to be <−80 dBc/Hz at 1 MHz, in line with that of the
VCO and better than what required in the Spacefibre standard.
The target architecture of the PFD, shown in Fig. 52.2, consists of two D-FF with
a logic 1 at the inputs. The edges of the reference clock and of the divided clock from
the VCO force the signal UP and DOWN respectively to a logic 1. When both UP
and DOWN are active, the internal feedback chain resets the D-FF, forcing the two
signals to a logic 0. The delay of the reset chain has to be carefully chosen to avoid
the dead zone problem.
The loop filter is the component that determines the bandwidth and stability of the
PLL. It consists of two capacitors (C1 and C2) and one resistor (R1), see Fig. 52.1.
In its simplest form with only one capacitor (C1), it would bring instability and, for
this reason, a resistor (R1) is added. The second capacitor (C2) is added to reduce
spurious tones due to the current mismatch, caused by the not ideal charge pump,
and has to be at most C1/5. C1 and R1 are chosen to reach a loop bandwidth of
6 MHz. This bandwidth provides a good tradeoff between low-noise performance
and integrability of the filter. A complete integrated filter is preferred since all the
problems deriving from exiting the chip are avoided. Actually, higher loop bandwidth
provides higher filtering of VCO’s and loop filter’s noise with smaller capacitors,
while lower bandwidth provides higher filtering of charge pump’s and reference’s
noise, but larger capacitors. Furthermore, higher bandwidth provides faster lock time.
For these reasons, an Icp of 40 μA has been chosen and, given the selected loop filter
bandwidth of 6 MHz and the need of having both resistor and capacitors integrable
on-chip, the following values for the passive devices have been defined: 8 pF for C1,
12 k for R1, 1 pF for C2.
A preliminary charge pump’s schematic design has been done in Cadence for two
different architecture shown in Figs. 52.3 and 52.4.
The first one in Fig. 52.3 is a simpler architecture but presents some disadvantages:
first, it behaves as a current source with low output impedance; second, the switches
are directly connected to the output node, influencing it with charge injection and
clock feed through effects; third, M1 and M5 spends some time in linear region when
SW1 and SW2 are enabled [6].
The second charge-pump circuit in Fig. 52.4 uses as UP/DOWN signals a differ-
ential pair (UP, UPB and DN, DNB). This charge-pump circuit when compared to
the one in Fig. 52.3 shows higher output impedance, less effects on the output node
due to switching activity but it is a more complex architecture. This circuit solution
has been derived from [7]. In this work, with respect to [7] different bias signals to
MN2/MN3 and MN8/MN10 have been provided.
The first charge-pump architecture has been dimensioned with no minimum length
for all mirror’s transistors to enhance the output impedance, while the switches are
low Vt transistors with minimum length and quite large width to reduce the voltage
drop on them. To evaluate the output impedance of the circuits in Figs. 52.3 and 52.4
voltage sources have been applied at the output of the circuits and the relevant Iup, Idn
currents have been measured. The results are shown in Fig. 52.5 for the charge-pump
of Fig. 52.3 and in Fig. 52.6 for the charge-pump of Fig. 52.4. In Fig. 52.5 on the
left the UP (red) and DOWN (black) currents are shown as a function of the output
voltage, while in Fig. 52.5 on the right their difference (black) and the derivative of
the difference (red) are shown. As expected this solution presents a quite low output
impedance. The second architecture has been sized with no minimum length for
mirror’s transistors, while the differential pairs are quite small. Results are shown
456 M. Mestice et al.
Fig. 52.5 Currents as function of Vout of the charge-pump architecture of Fig. 52.3
Fig. 52.6 Currents as function of Vout of the charge-pump architecture of Fig. 52.4
in Fig. 52.6. It shows an higher output impedance for the charge-pump circuit of
Fig. 52.4 versus that of Fig. 52.3, with nearly the same output range.
Firstly, the PLL has been modeled in phase domain to simulate and analyze the
behavior in terms of stability and bandwidth. Closed and open loop PLL models are
shown in Fig. 52.7. The PFD plus charge-pump and the divider blocks are linearized
models with constant gains of Icp /2π and 1/N, while the VCO behaves like an inte-
grator. The total transfer functions are those in Eq. (52.1) in open loop and Eq. (52.2)
in closed loop:
Icp K vco 1
Hol = Z (s) (52.1)
2π s N
Icp
2π (s) s
Z K vco
Hcl = I
(52.2)
1 + 2πcp Z (s) Ksvco N1
52 Analysis and Simulation of a PLL Architecture Towards a Fully … 457
An AC simulation has been done and the results are shown in Figs. 52.8 and 52.9
for the open and closed loop transfer functions. In Fig. 52.10 the step response is
shown for an input phase step of 1°. The zero introduced by R1 in the loop filter
of Fig. 52.1 stabilizes the loop, while C2 tends to reduce the stability and, for this
458 M. Mestice et al.
reason, its value is chosen to maximize the phase margin. The unity gain frequency is
3.31 MHz and the phase margin is 50.9°. From the closed loop analysis the bandwidth
is 5.37 MHz.
Secondly, a PLL’s model in time and frequency domain has been done to analyze
the lock time of the system and noise performances, as is shown in Fig. 52.11. To
achieve this goal, an envelope simulation has been done with both open loop and
52 Analysis and Simulation of a PLL Architecture Towards a Fully … 459
closed loop models to compare the two systems. The models of the VCO_DivideByN
and the PhaseFreqDetCP are noise-free, as well as the reference source SRC1. There-
fore, the block NoiseVCO has been added to insert the VCO phase noise in the anal-
ysis. This block, starting from a piecewise linear curve approximation in frequency
domain of the VCO’s phase noise, provides the equivalent noise on the control voltage
of the oscillator. Instead, ADS’ noise models have been used for the loop filter.
In Fig. 52.12 the lock transient is shown in terms of frequency. From this analysis
a lock time of 555.6 ns results considering a locking error below 0.01%. As expected,
during the transient, peak frequencies are present due to the resistor in the loop filter.
These peaks are not present when the PLL is in locked state because in this model
the charge pump is ideal, but not ideality has to be considered and, therefore, the
second capacitor has been added in the loop filter. This capacitor has been sized so
that the previous results in terms of loop bandwidth, unity gain and phase margins
are kept roughly the same. In Fig. 52.13 the noise analysis’ results are shown in
dBc/Hz. In Fig. 52.13, the target phase noise (<−80 dBc/Hz) is achieved with a
good margin, but reference phase noise and charge pump noise were not considered.
These contributions will be added in the on-going activities. The loop filter’s noise
contribution is predominant at mid frequencies (bandpass response), while VCO’s
contribution prevails at higher frequencies (highpass response).
The same analysis has been done in Simulink using the models provided by the
Mixed Signal Blockset. The achieved results are similar to those of ADS’ model and
the VCO’s Simulink model follows the real VCO behaviour in terms of noise. In
Fig. 52.14 the VCO phase noise in orange, target, and the PLL resulting phase noise
in blue are shown.
The achieved results from the analysis that has been done are summarized in Table
52.1.
An estimation of the occupied area has been done for the loop filter considering
the 65 nm TSMC process design kit, resulting in a total area of about 4600 μm2 .
C1 should occupy, as MIM capacitor, about 4000 μm2 , C2, MIM capacitor as well,
500 μm2 and R1, N Well under OD resistor, 102.5 μm2 .
52 Analysis and Simulation of a PLL Architecture Towards a Fully … 461
The paper has presented the modeling and design activity in both Simulink and ADS
CAD environments of a PLL architecture to generate the clock reference for the new
ESA Spacefibre standard. This standard allows on-board satellite communications up
to 6.25 Gbps. Starting from a 6.25 GHz VCO rad-hard design, integrated in 65 nm
technology, this work presents a PLL architecture including configurable integer
462 M. Mestice et al.
References
Abstract The reliable analysis of DC operating point in circuits with positive feed-
back topology is often challenging, and frequently performed with ad hoc methods.
These techniques are often error prone and lead to the frequent use of sub-optimal
or unnecessary additional circuits for the stabilization or determination of the op-
erating point (startup circuits). We present a simple and reliable technique for the
determination of “stable” circuit solutions, that is based on the use of available circuit
simulators and hence takes advantage of accurate device models. The method has
been experimentally validated on a self-biasing current generator fabricated with a
standard 0.18 µm CMOS process.
53.1 Introduction
In the realm of electronic circuits containing active devices, the determination of the
operating point is a basic step of the design process. It is one of the few engineering
techniques requiring the solution of an inherently non-linear physical system. Since
non-linear systems cannot generally be solved in closed form, the electronic designer
has to resort to approximate solutions, numerical analysis tools or, sometimes, clever
ad hoc tricks. In fact, this intrinsic non-linearity is seldom a problem, since most
circuits are designed to have an operating point that can be easily determined.
However, some applications demand the use of circuits for which the computation
of the operating point is non trivial. The typical case is a circuit with a positive
feedback such as the well known Eccles-Jordan flip-flop. These circuits can have a
few operating points, some of which “unstable”. Due to the mentioned non-linearity,
the analysis of these circuits can be challenging; furthermore, in this case commonly
used circuit simulators, such as SPICE, often provide unreliable information, since
they can converge to the “unstable” solution.
General methods have been developed for the non-linear analysis of active cir-
cuits [1–3], but are generally too abstract, provide poor physical insight on circuit
operation, and are of little help to the circuit designer. As a consequence, non-linear
circuits are usually analysed with simple pencil and paper methods [4]. These calcu-
lations are constrained to the use of crude first-level device models, which can lead to
grossly approximated solutions, missed solutions and also to spurious solutions. An-
other common way to investigate the stability properties of circuits is the use of (time
consuming) transient simulations, but these can also provide unreliable information
in case of circuits with widely separated time constants (ill-conditioned systems).
In order to overcome these shortcomings, we propose a method that is able to find
the operating points and the stability properties of many commonly used non-linear
feedback circuits.
(a) (b)
Fig. 53.1 Self-biased current generator: simplified proof of concept circuit (a) and complete circuit
(b); M6, M7 and M8 are needed to set the bias point of M1 (a native transistor with negative threshold
voltage); the operational amplifier imposes V DS M3 = VDS M4 improving the accuracy of the upper
current mirror; since in the complete circuit V2 = 0 no generator is connected in series with M2
where kum depends on the geometry of M3 and M4. On the other hand, the lower
mirror (M1, M2, V1, and V2) provides a nonlinear relationship between the input
current (the drain current of M1, Iin_lm ) and the output (the drain current of M2
Iout_lm ):
Iout_lm = f (Iin_lm ). (53.2)
The ratio of the input to the output current klm depends on the input current. At
equilibrium we must have
kum = 1/klm . (53.3)
If klm is a monotonic function of the input the (53.3) can be satisfied for a single
set of circuit currents. However, as [4] points out, both mirrors of the circuit provide
zero current when fed with a zero input and hence another equilibrium point exists,
with all null currents (where klm is undefined). For this reason most designers of
self-biased current generators include a startup circuit which forces the circuit to the
desired solution, avoiding the zero-current one [6–8].
However, the above discussion is oversimplified. Simulating the circuit (with a
UMC .18 µm CMOS technology, and with identically sized M3 and M4) we find that
if β1 > β2 and V1 > V2 , where βi = μCox Wi /L i (Wi and L i are transistor width and
length, μ is carrier mobility and Cox is the gate oxide capacitance per unit area) are
referred to transistors Mi , the circuit undergoes a transient ending at the equilibrium
point with non-zero currents. Hence, no startup circuit seems required. Instead, if
β1 < β2 and V1 < V2 the circuit never settles in the equilibrium point suggested
by Eq. (53.3), and no startup circuit can help. For the other possible configurations
(β1 < β2 and V1 > V2 ; β1 > β2 and V1 < V2 ) Eq. (53.3) is never verified and no
equilibrium point is possible.
466 F. Cucchi et al.
ΔV p Rt ΔIt Rt
Rd = = = (53.4)
ΔItot ΔIt − λΔIt 1−λ
where ΔItot is indicated in Fig. 53.2c. From (53.4) we can conclude that if λ > 1 this
solution is unstable. Let us underline that we assumed Rt > 0, which is the typical
situation is practical circuits, but the method can in theory be easily generalized to
any initial sign of Rt . Furthermore, λ is the small-signal DC loop gain, and hence
the fact that values in excess of 1 lead to instability is well known.
(a) (b)
=0.56 =3
(c) (d)
=1.65 =0.23
Fig. 53.3 SPECTRE dc sweep of circuit of Fig. 53.1, cut at the drain of M3: current generator
to gate-drain of M1, voltage generator to M3 drain. β1 > β2 and V1 > V2 (a); particular of low
current region (b); β1 < β2 and V1 < V2 (c); particular of low current region (d) (λ is the derivative
of the current at the intersection; the black straight lines are I V = It , while the red lines show the
simulation results)
Hence, the practical application of the method consists of cutting open a loop,
inserting the proper generators and performing a DC simulation of the circuit with
an input current sweep. The analysis of circuit Fig. 53.1a (for which is Rt > 0) leads
to the results of Fig. 53.3a and b, which show that a single and stable operating point
is obtained only for β1 > β2 and V1 > V2 . It is worth noticing that in this case no
equilibrium point exist at It = 0 and hence no startup circuitry is needed. Figure 53.3c
and d shows instead that for β1 < β2 and V1 < V2 the solution is unstable, and another
stable solution is present for very small currents. Therefore, with the use a circuit
simulator equipped with accurate device models we can learn that often some pencil-
and-paper results, such as the zero-current stable solution, can indeed be artifacts due
to the use of too simplistic device models.
Furthermore, this approach provides valuable physical insights on the circuit.
Since the Iv (It ) relationship provided by the simulations can be interpreted as the
input-output characteristic of an amplifier, a designer can usually devise modifica-
tions to the circuit which can modify it in a foreseeable manner. Hence, the above
analysis not only can provide evidence of bias or stability problems, but is also a tool
for their solution.
The circuit of Fig. 53.1b has been designed and fabricated, using native transistors
(with threshold voltage < 0) for M1 and M2. In this version of the circuit M1 was
not diode-connected and a proper bias circuit was added in order to bias M1 in
468 F. Cucchi et al.
0
5.2n 5.4n 5.6n 5.8n 6.0n 6.2n 6.4n 6.6n
Fig. 53.4 Iv versus It for the complete circuit of Fig. 53.1b (left) and I M1 distribution in 14 samples
of Fig. 53.1b circuit (right)
saturation. V1 and V2 were set to 335 mV and 0, respectively. Using the proposed
method, we obtained the results of Fig. 53.4 (left). The current in M1 is about 7 nA,
and the operating point is stable. This is confirmed by measurements on 15 samples
realized in a 0.18 µm UMC CMOS technology. Figure 53.4 (right) shows the current
distribution in 14 working samples; the mean current is 5.85 nA (σ = 0.24 nA) and
no start up problems were observed.
Acknowledgements This work has been partially supported by the Electronic Components and
Systems for European Leadership Joint Undertaking and by the Italian Ministry of Education,
University and Research (MIUR) under grant agreement No. 737434 (CONNECT).
References
1. Chua L, Green D (1976) A qualitative analysis of the behavior of dynamic nonlinear networks:
stability of autonomous networks. IEEE Trans Circuits Syst 23:355–379
2. Green M, Wilson A (1995) An algorithm for identifying unstable operating points using SPICE.
IEEE Trans Comput-Aided Des Integr Circuits Syst 14:360–370
3. Gajani GS, Brambilla A, Premoli A (2008) Numerical determination of possible multiple DC
solutions of nonlinear circuits. IEEE Trans Circuits Syst 55:1074–1083
4. Sedra A, Smith K (1997) Microelectronic circuits. Oxford University Press, New York
5. Green M, Willson A (1992) How to identify unstable dc operating point. IEEE Trans Circuits
Syst 39:820–832
6. Guo J, Leung KN (2012) A CMOS voltage regulator for passive RFID tag ICs. Int J Circuit
Theory Appl 40:329–340
7. Liang CJ, Chung CC, Lin H (2011) A low-voltage band-gap reference circuit with second-order
analyses. Int J Circuit Theory Appl 39:1247–1256
8. Tsitouras A, Plessas F, Birbas M, Kikidis J, Kalivas G (2012) A sub-1V supply CMOS voltage
reference generator. Int J Circuit Theory Appl 40:745–758
Chapter 54
IoT Ubiquitous Edge Engine
Implementation on the Raspberry PI
Abstract In the Internet of Things (IoT) ecosystem, sensors and actuators represent
the edge that is the source of data. The amount of data being generated by edge
devices is exploding. Storage and processing of all the data in the cloud has become
too slow and costly to meet the requirements of the end user. Edge computing presents
a substantial solution through facilitating the processing of device data closer to the
source. However, computing data from various and different sources is a formidable
challenge for edge programming. This abstract presents lab experiments for testing
versions of a multi-purpose generic edge engine on open-hardware edge devices,
specifically the Raspbersry Pi 3 as a test bed with a standard Operating System (OS)
and the STMxx as an MCU with Real-Time Operating System (RTOS).
54.1 Introduction
The edge denotes the layer closest to the physical world that we are interested in
sensing. Other than mobile devices, edge devices are considered low-end computing
systems due to their limited computational power abilities. The edge engine that we
propose in this paper is designed to provide feasible edge computing capabilities
on low-end IoT edge devices. It is Ubiquitous in the way that it can be adopted for
different use case implementations. In this paper, part 2 discusses related work. Part
3 presents the architecture and implementation of the edge engine. Then, in part 4,
we present the test methodology and discuss the results. Finally, we conclude with
future work.
54.2 State-Of-The-Art
The main requirement of edge processors is real-time computing from continual input
in small time-periods. Computations such as aggregation, filtering, processing and
other form of data manipulation must keep up with the input flow of raw data. Another
requirement of edge processors is the backup and storage of important data to the
cloud. These requirements are suitable for limited bandwidth on the communication
channels available for IoT node connections (I2C, BLE, Wifi) as well as internet
connection to the cloud. Furthermore, for better data management and structural
organization of edge devices, a common edge hub for multiple sensory nodes works
better than connecting each node directly to the cloud.
We developed a ubiquitous engine that runs through the Express.js framework
(programmed in NodeJs [5]) that enables the edge device to act as an IoT hub
between sensors/actuators and the cloud. Figure 54.2 shows the block diagram of the
framework where the edge hub plays a central role.
The edge engine, when run, goes through two main stages: Initialization and run
loop. The complete flowchart representation is shown in Fig. 54.3.
In the initialization stage, the engine is set up. First, by logging in to the cloud
through user credentials. The login is an HTTP POST request to the URL of the
cloud server: https://api.atmosphere.tools. The request body would contain the user
credentials (username/password) which are provided to the operator by the database
administrators. In case of successful request, the server returns (in an HTTP response)
a Json Web Token (JWT) for use in further http requests to the cloud for a time frame
of 30 min, then it must be renewed. Second, by downloading the edge script by
specifying the script Id. The script contains all the necessary information for the
edge engine to run according to the edge device operator’s intent, since the operator
is responsible of writing the script and storing it in the database beforehand.
Fig. 54.2 Block diagram showing the edge engine within the IoT ecosystem
472 A. Kobeissi et al.
Fig. 54.3 Flowchart representing the coding of the edge engine in Node.js
The information consists of edge descriptor information (tags, http method, fea-
tures, device properties, etc.), delay intervals for looping processes, and computations
(operations) parameters. A sample of the edge script is viewed in Fig. 54.4 along
with labels indicating the different information contained in the script.
After a successfull initialization, the edge engine enters a recursive stage call run
loop. The run loop exploits the event loop mechanism within Nodejs to run three
different processes consecutively and concurrently. The first to execute is reading
from densors, which check the connected ports for input data which it saves in and
54 IoT Ubiquitous Edge Engine Implementation on the Raspberry PI 473
input buffer. The next process is dependant on the first, thus it executes immedi-
ately afterwards. This processor applies computation operations on the input buffer,
replacing it’s contents with aggregated data row for row.
The computation operations are specified within the edge script which specifies
the type and required condition/parameters. In the example of Fig. 54.4, the script
specifies two operations in order: a filter keeping the values that exceeds 30, and a
map function to apply a transformation of the initial value.
The example is a demonstration of a temperature monitoring application where
only high values are recorded then transformed from Celsius to Fahrenheit. These
operations are supported by the ‘array.prototype’ JavaScript constructor which con-
tains up to 30 operational methods. The last process in the run loop is a login just
like the one performed in the initialization. A new login is required every 30 min to
re-aquire a valid JWT for the measurements upload to be allowed to the cloud. The
intervals for each of the three processes in the run loop can be set within the script,
if the operator fails to do so, the engine will load default values for each interval.
We implemented backup scenarios for certain common events that occurs at the
edge. One scenario is the offline more (disconnectivity from the Internet) or inter-
mittent connectivity issues. In this case, we activate local storage of the aggregated
buffer untill the connection is re-established or the memory if ull, in which the engine
starts replacing the oldest of the records. Another scenario is the case of incomplete
scripts or even no script at all. The engine has default values for essential parameters
to run, where raw data are uploaded to the cloud as is. The final case scenario is
474 A. Kobeissi et al.
corrupt data handling. Data can get corrupted upon certain types of operations, the
engine detects and de-activates computing by recording raw data immediately to the
cloud.
We designed a lab experiment to test the performance and parametric limits of the
edge engine deployment on a Raspberry Pi 3 b [6]. The experiment was designed to
simulate a smart home IoT environment. It included up to 16 sensors, wired connected
to the GPIO port of the Raspberry Pi. Those sensors are 4 dual temperature and
pressure sensors, 4 switch sensors, 3 photodetectors, 3 passive infra red (PIR) sensors,
1 humidity sensor, and 1 moisture sensor. These sensors have different polling rates,
with the fastest at 100 Hz frequency reached by the PIR sensor. That indicated that
the minimum delay that still captures a change in measurements from the sensors
is 10 ms. In the experiments, we ran 7 different edge scripts. Each script specifying
different delay parameter for input reading from sensors and output writing to the
cloud, both at the same time keeping the ration between input and output streams
10x. The scripts ran the same number of consecutive operation at four, since the
change in this parameter had little to no effect on the Raspberry Pi’s load. The
experiment result in two main observation as presented in Fig. 54.5. The CPU usage
reached it’s maximum at 90% with 4 threads running on the 4-core CPU at the
minimum limit of possible input stream delay at 1 ms. The typical delay of 10 ms
for input stream corresponded to 27% CPU usage with 3 running threads. Such
usage is acceptable considering the number of input streams (16) and computations
(4) running a 100 times per second. The other observation, which concerned the
memory usage was unexpected. The test recorded a decline in memory usage in
regards to higher output stream delays. One explanation for this observation is the
cashe management mechanism within the Raspbian OS, which keeps the buffers that
were cleared by the engine saved for a while. So, the more buffer gets cleared by the
engine in a smaller timeframe, the more buffers the OS is cashing.
Fig. 54.5 Test observations of CPU and memory usages with respect to delay changes
54 IoT Ubiquitous Edge Engine Implementation on the Raspberry PI 475
The presented results proves the feasibility and usability of the edge engine archi-
tecture on low-end computing devices. As a test case, the Raspberry Pi performed
rather well under extreme parametric conditions.
In this paper, we presented a ubiquitous IoT edge engine implementation for low-end
computing devices such as microcontrollers. We performed a lab experiment to test
the engine’s performance on a Raspberry Pi unit connected to 16 sensors. Results
came in support of the ability of such devices to perform remarkable computations
(100 * 16 *4) per second within acceptable hardware usage. In future work, we look
forward to perform similar experiments on RTOS-based microcontrollers like the
STM32 and Arduino. A possible addition to the supported operations at the edge is
lite machine learning algorithms in both the supervised and unsupervised categories.
Acknowledgements The heading should be treated as a 3rd level heading and should not be
assigned a number.
References
55.1 Introduction
Every record heat summer shows the need for action against the ongoing waste of
resources, their intrinsic release of climate affecting gasses and their heating of the
atmosphere. To effectively combat the misuse of power consumption, it is essential
to get a clearer image of how energy is consumed in today’s life, as [1] pointed out,
households play a crucial role here. A very promising concept to understand electrical
as well as thermal power consumption or water usage is Non-Intrusive Load Moni-
toring, where consumption is measured on a single point, and appliance level power
consumption is extracted from an agglomerated signal. This practice, compared to
equipping single consumers, cuts installation cost dramatically and therefore makes
scaling at a city scope possible. As a key component of smart cities Non-Intrusive
H. Wöhrl
Department of Electronics and Computer Science, Technical University of Berlin, Straße Des 17.
Juni 135, 10623 Berlin, Germany
e-mail: [email protected]
D. Brunelli (B)
Department of Industrial Engineering, University of Trento, Via Sommarive 9, 38123 Trento, Italy
e-mail: [email protected]
Load Monitoring holds the chance to efficiently measure and visualize not only pri-
vate household power consumption data but also buildings of the commercial sector
as offices, malls or factory sites. Also, for the Industry 4.0 in-depth power analy-
sis plays a major role and holds many possibilities [2]. Through pattern detection
anomalies in power consumption can be detected in real-time [3, 4] and machines
can be maintained before a major fault occurs, and thus further consequences eluded.
Non-intrusive Load Monitoring describes the task of disaggregation power con-
sumptions of single appliances from an agglomerated mains power measurement.
From the machine learning point of view, this is considered a single-channel blind
source separation problem, where multiple sources need to be extracted from one
combined measurement. George W. Hart founded the field of energy disaggregation
in the 1980s and published 1992 the seminal paper for Non-intrusive Load Mon-
itoring [5], where he introduced different NILM scenarios and implemented first
disaggregation algorithms based on low-frequency features at a sampling rate of
1 Hz. In 2015, along with an overall rising interest in the field of machine learn-
ing the topic of NILM gained a boost in popularity, resulting in various publications
combining different classification methods and features. This can generally be distin-
guished into two different approaches. One of which is using low frequency data and
e.g. machine learning methods as Kelly in 2015 with the first application of Neural
Networks to NILM [6]. The other one is deploying richer features of higher sampled
measurements as for example S. Gupta [7] by using high frequent electromagnetic
interference features. The most significant advantage of the low-frequency approach
is its applicability in commercially available smart meters; this method still has some
shortcomings as requiring big amounts of labeled data while still facing accuracy
challenges [8]. The higher frequency approach is due to its increased hardware cost
generally less investigated, but as e.g. Bernard [9] and Gupta [7] have shown, a richer
feature set can significantly improve existing NILM algorithms, allowing them to
classify more complex as well as similar loads. It furthermore permits new usage
scenarios like anomaly detection.
In this paper, we present a smart measurement node that uses sampling rates up
to 10 kHz and outperforms previous prototypes [10, 11], even with harvesting capa-
bility [12]. Therefore, it can be flexibly deployed in various environments allowing
a combination of low and high-frequency features.
To measure the power intake of the respective building, the measurement node is
connected to the in-house mains power supply, which also serves for the power supply
of the board. For the current measurement, different analog interfaces can be selected
via a multiplexer. The analog interfaces are described in more detail in Sect. 55.2.1.
Two microcontrollers are deployed to handle the data stream and classification, this
comes with the advantage of a fast implementation of the training phase on one
microcontroller while allowing to have an optimized real-time classification phase
55 Non-intrusive Load Monitoring on the Edge of the Network … 479
on the other microcontroller, Sect. 55.2.2 explains this in detail. The measurement
data then gets streamed via the onboard Wi-Fi Module (Sect. 55.2.3) to a TCP/IP
Server, which stores the data and trains the classification model. After training the
model gets retransferred to the microcontroller, which then is ready to do online
classification (Fig. 55.1).
To acquire the voltage and current measurements, we deploy two interfaces, each
driven by a 3MSps ADC capable of sampling simultaneously two differential chan-
nels at 1.5Msps. Their low current intake makes a power-efficient operation possible.
The SPI Busses of the ADCs are connected to a multiplexer which routs one SPI
connection to the microcontrollers. The first interface measures the grid voltage via
a voltage divider on one channel and the consumed mains current via a shunt on the
other channel. The second interface offers another option for the current measure-
ment by having a hall-effect current sensor connected on one channel and a Rogowski
Coil on the other. While the second interface already handles only isolated signals,
the first interface is directly connected to mains voltage and therefore has a digital
isolator between the ADC and multiplexer to prevent AC voltages on the logical
voltage level in case of failure. The installed STM32 microcontroller theoretically
would allow operation at the maximum sampling rate, making the PCB potentially
usable for further applications as Non-intrusive load monitoring on electromagnetic
interference [7] or power quality analyzers.
In our implementation, the analog frontend gets sampled by creating a steady
pulse with a frequency of 10 kHz to trigger the conversion. After the trigger, the data
is fed into the microcontroller using a DMA, where it is stored in two ping pong
buffers.
480 H. Wöhrl and D. Brunelli
The heart of the system consists of two microcontrollers, of which one is active, and
one is idle at a time. One microcontroller is an ultra-low-power RISC-V processor
developed by Greenwave Technologies named GAP8, while the other is an ultra-
low-power Cortex-M4 microcontroller from the STM32-F410 family (Fig. 55.2).
The first contains a multicore processor with a cluster of 8 cores which offers enough
computing power to do near real-time classification on the chip. This allows to move
the classification to the edge of the network and so cuts power consumption drastically
since recorded data can be processed locally instead of in the cloud. Furthermore,
an application on the network edge cuts operation cost by making maintenance of a
server expendable and reduces privacy concerns by storing most data locally.
The STM32-M4 microcontroller has the necessary interfaces to communicate
between the different submodules of the device and is therefore used in the training
phase where data needs to be fetched from the ADCs and passed to the Wi-Fi Module.
It also comes with an ultra-low power intake, making a power-efficient operation
possible. As a processor of the ARM architecture it is highly flexible and can be
programmed also to fit different applications, or compression algorithms [13].
In the training phase the STM32 microcontroller sends the measured data to
a server which learns the classification model in the next step. Afterwards, the
extracted model gets transferred to the GAP8 microcontroller, which then does the
classification of newly recorded data.
To establish the connection to the server, we use a 2.4 GHz 802.11 b/g/n Wi-Fi
module with an integrated MCU. Other standards, such as Bluetooth, have not enough
bandwidth [14], and, on the other side, the Wi-Fi module makes the implementation
considerably easy and reduces programming effort while increasing flexibility. The
Wi-Fi module is connected via UART to the STM32 microcontroller at a baud rate
of 1Mbps, which in optimal circumstances allows conversions at up to 35KSps. The
UART communication is implemented with a DMA that streams the buffers away
as soon as they are full.
55.3 Evaluation
To evaluate the operation of the sub-modules firstly, the throughput of the Wi-Fi
connection was tested. A program was written for the STM32 microcontroller that
opens a TCP/IP Server and then sends dummy data to the server. This resulted in a
net bandwidth of approximately 100 kB/s, which would allow sampling rates up to
20 kHz.
55 Non-intrusive Load Monitoring on the Edge of the Network … 481
In a second test, the STM32 was to sample data at 10 kHz from the mains power
with different appliances. This test showed promising results in respect to a NILM
operation since differences in spectra even for very similar devices can be clearly
seen by analyzing their harmonics (Fig. 55.3).
Fig. 55.3 FFT spectra of current intake of two notebook power supplies in idle operation
482 H. Wöhrl and D. Brunelli
References
1. Armel KC, Gupta A, Shrimali G, Albert A (2013) Is disaggregation the holy grail of energy
efficiency? The case of electricity. Energy Policy 52:213–234
2. Rossi M, Rizzon L, Fait M, Passerone R, Brunelli D (2014) Energy neutral wireless sensing
for server farms monitoring. IEEE J Emerg Sel Top Circ and Syst 4(3):324–334
3. Nardello M, Rossi M, Brunelli D (2017) A low-cost smart sensor for non intrusive load mon-
itoring applications. In: 2017 IEEE 26th international symposium on industrial electronics
(ISIE), Edinburgh, pp 1362–1368
4. Nardello M, Rossi M, Brunelli D (2017) An innovative cost-effective smart meter with embed-
ded non intrusive load monitoring. In: 2017 IEEE PES innovative smart grid technologies
conference Europe (ISGT-Europe), Torino, pp 1–6
5. Hart GW (1992) Nonintrusive appliance load monitoring. Proc IEEE 80(12):1870–1891
6. Kelly J, Knottenbelt W (2015) Neural NILM: deep neural networks applied to energy disag-
gregation. In: Proceedings of the 2nd ACM international conference on embedded systems for
energy-efficient built environments, pp 55–64
7. Gupta S, Reynolds MS, Patel SN (2010) ElectriSense: single-point sensing using EMI for
electrical event detection and classification in the home. In: Proceedings of the 12th ACM
international conference on ubiquitous computing, pp 139–148
8. Kelly D (2016) Disaggregation of domestic smart meter energy data
9. Bernard T Non-intrusive load monitoring (NILM): combining multiple distinct electrical
features and unsupervised machine learning techniques
10. Porcarelli D, Brunelli D, Benini L (2014) Clamp-and-Forget: a self-sustainable non-invasive
wireless sensor node for smart metering applications. Microelectron J 45(12):1671–1678
11. Balsamo D, Porcarelli D, Benini L, Davide B (2013) A new non-invasive voltage measurement
method for wireless analysis of electrical parameters and power quality. In: SENSORS, IEEE,
Baltimore, MD, pp 1–4
12. Porcarelli D, Brunelli D, Benini L (2012) Characterization of lithium-ion capacitors for low-
power energy neutral wireless sensor networks. In: 2012 ninth international conference on
networked sensing (INSS), Antwerp, pp 1–4
13. Brunelli D, Caione C (2015) Sparse recovery optimization in wireless sensor networks with a
sub-nyquist sampling rate. Sensors (Switzerland) 15 (7):16654–16673
14. Negri L, Sami M, Macii D, Terranegra A (2004) FSM-based power modeling of wireless
protocols: the case of Bluetooth. In: Proceedings of the 2004 international symposium on
low power electronics and design (IEEE Cat. No.04TH8758), Newport Beach, CA, USA, pp
369–374
Chapter 56
Design of a SpaceFibre High-Speed
Satellite Interface ASIC
Abstract In the last few years, data rate requirement in on-board data handling
for space missions has continuously grown, due to the presence of high resolution
instruments. This lead the European Space Agency to start working on a new commu-
nication standard named SpaceFibre. It is able to fulfil a data rate of 6.25 Gbit/s per
communication lane (up to 16 communication lanes). This work proposes the design
of a SpaceFibre interface Application Specific Integrated Circuit. The block diagram
of the system is presented, together with results in terms of area occupation and
power consumption (excluding serialiser-deserialiser circuitry) after the synthesis
on a 65 nm CMOS technology.
56.1 Introduction
SpaceFibre is a novel very high-speed serial link, designed specifically for space ap-
plications. It has been recently standardized by the European Cooperation for Space
Standardisation (ECSS) [1]. It is thus adoptable as on-board data handling protocol
in next generation space missions, from earth observation to telecom and science
satellites (i.e. FLEX [2] and BIOMASS [3]) will mount Synthetic Aperture Radars
(SARs) and hyper-spectral imagers which will require high speed on-board commu-
nication), where data rate requirement is particularly demanding. Different missions
obviously have different on-board communication architectures. However in the fol-
lowing we present a scheme which describes a generic high-speed communication
architecture for space applications.
In Fig. 56.1, it is possible to observe a schematic satellite on-board data handling
topology. Generally, several instruments are hosted on the same spacecraft; each one
produces a significant amount of data which will are then processed with front end
electronics and sent to the Main Control Unit (MCU) of the experiment, where data
are stored, processed or sent to the downlink. Redundancy is usually required (i.e.
each link should be doubled). Moreover, each instrument may have a separate bus
for communication to the MCU. It is known that constraints in space applications are
particularly harsh, particularly in terms of radiation tolerance, fault tolerance, low
power consumption, harness and data-rate. Such stringent requirements led to devel-
opment of highly optimised solutions rather than adaptation of existing commercial
product.
Management
Packet Interface Broadcast Interface
Interface
Network layer
Multi-lane layer
Lane layer
Physical layer
Physical interface
requirements. It is able to operate both on copper cables and fibre optic and sup-
ports data rate of 6.25 Gbit/s per communication lane. SpaceFibre includes built-in
quality of service (QoS) and Fault Detection, Isolation and Recovery (FDIR) tech-
niques, which provides system level benefits without requiring complex limiting
software implementation. SpaceFibre is currently being integrated onto various FP-
GA technologies by IngeniArs [5], STAR-Dundee [4] and Chobam Gaisler [6]. The
SpaceFibre protocol stack is shown in Fig. 56.2.
– Network Layer is responsible for packet transfer over the link or the network.
This is an optional layer, see [7].
– Data Link Layer is responsible for the QoS, flow control and for resending in-
formation in case a temporary fault occurs over the link. It is also responsible for
packaging data to be sent over the link and for broadcasting (and receiving) short
messages across a SpaceFibre network.
– Multi-lane Layer is responsible for running several SpaceFibre lanes in parallel
to provide higher data throughput. This is an optional layer.
– Lane Layer is responsible for lane initialisation error detection and re-initiali-
sation. Symbols are encoded with 8b/10b encoding [8], with AC coupling of data
signals.
– Physical Layer is responsible for serializing and de-serializing encoded symbols
and for transmitting them over the physical link. It also recovers clock from the
received data.
– Management Layer is responsible for the control and configuration of each layer.
A SpaceFibre link is composed of two differential lines, one for serial data trans-
mission and one for serial data reception. The clock signal is transmitted together
486 P. Nannipieri et al.
with the data as symbols are processed with 8b10b encoding, providing a number
of bit transitions sufficient to recover the clock from the incoming bit stream with a
PLL. Attempts has been done in literature to design and build a SpaceFibre interface
ASIC [9]. Unfortunately, no details are shared onto the technology node chosen,
the circuit complexity and power consumption. Moreover, the work reported in [9]
refers to a very early draft version of the standard (F3). Therefore, we identified the
need to document an ASIC implementation of SpaceFibre codec fully compliant to
the recently released final version of the standard, including indication of area and
power consumption. In this work, the IngeniArs SpaceFibre IP has been used for the
design of a SpaceFibre ASIC. In Sect. 56.3, the architecture of the synthesized circuit
is shown, and in Sect. 56.4 synthesis results in terms of area occupation and power
consumption, for the single SpaceFibre interface, is presented. Finally, in Sect. 56.5
conclusions are drawn.
In Fig. 56.3, the block diagram of the system is shown. The blocks displayed are the
following
– SPFI Codec: SpaceFibre single lane CODEC, implemented by IngeniArs. To have
two independents communication lanes two separate single lane codecs have been
included.
– VC Switch: as the SpaceFibre interface is wide due to the presence of several
separated Virtual Channels, a multiplexer to reduce the number of I/O pins is
necessary.
– SPI Slave: this block, which enables the device to be configured as SPI slave
peripheral, is provided as external IP.
– SERialiser DESerialiser (SERDES) Interface: it is the lower part of SpaceFibre
Lane layer. It comprehends 8b10b encoding/decoding, symbol and word synchro-
nisation and elastic buffering.
– HSSL: High Speed Serial Link: this block is an IP technology dependent, which
serialise and de-serialise input and output data streams. Please note that this block
is the SERDES itself, which is usually a technology dependent block. Therefore
it is not taken into account in the presented results.
The HDL of the CODEC itself has been tested and validated in previous work [5].
rad-hardened silicon technology. The system has been synthesised in order to reach
the operating clock frequency of 312.5 MHz in the lower sections of the CODEC
(8b10b encoder/decoder, symbol synchronizer) where the data path is 2 symbols
wide, and 156.25 MHz where the data path is 4 symbols wide (in the rest of the
CODEC), which corresponds to a serial data rate of 6.25 Gbps (the fastest data rate
reachable according to the standard). Area occupation is presented in Table 56.1 and
the estimated power consumption is presented in Table 56.2. A NAND cell area of
3.12 µm is considered to compute the gate equivalent area.
56.5 Conclusions
References
Abstract This paper proposes a method for depth estimation in video sequences
acquired by a monocular camera mounted on a mobile platform. The proposed algo-
rithm is able to estimate in real time the relative distances of the objects in the field of
view exploiting the parallax effect, provided the platform movement complies with
a few constraints. The developed system is designed to operate at the input pixel
cadence and is thus applicable to any video resolution. The final architecture, using
operators no more complex than an adder and a memory that is just a fraction of a
frame memory, can be realized in a low-cost FPGA.
57.1 Introduction
Recent climatic upheavals lead to environmental catastrophes like the one that hap-
pened in northern Italy in Autumn 2018, when many thousands of centuries-old trees
were uprooted during a storm in a wide and sparse mountain area. To help coping
with these phenomena, a continuous monitoring of the territory is required, which
in turn implies the availability of low-cost systems suitable to operate autonomously
and capable to reach areas that are poorly accessible by normal vehicles or even on
foot. Drones are a very interesting option [1] for this task. To increase their capability
of autonomous flight they should be equipped with an economic and effective system
able to estimate in a 2-D view the distance of the obstacles, in order both to analyze
the underlying territory and to independently establish the most appropriate route,
the exploration area and possibly the landing area.
Systems capable of detecting the distance of a target can adopt several types of
sensors [2], often combined with dedicated illuminators. They can be based e.g. on
laser beams, on infrared systems, or on ultrasonic sources; they differ in range, accu-
racy, sensitivity, resolution, and so on. Time of Flight cameras are devices equipped
with an illuminator [3] and a special camera able to evaluate for each pixel the time
taken by the emitted beam to be reflected back to the camera. These systems are
generally complex and show important limitations. They are affected by other light
sources and present a rather limited range.
Systems based on laser scanning can be easily used to deal with large distances [4],
but are quite expensive, operate with difficulty if mounted on a moving acquisition
system, and typically require some time to perform a complete 2-D acquisition.
Passive multi-camera systems are cheaper and more robust, but have a range
limited to a few tens of meters; moreover, they are obviously more complex, expen-
sive and heavier than a possible single-camera system. The latter, however, typically
adopts quite complex algorithms [5–7], often based on neural networks, to provide
the distance information; these algorithms may be unsuitable for real-time appli-
cations since they require massive computing resources and actually provide fairly
approximate results.
To overcome these limitations, we have designed a simple and effective method
that can operate in real time using just a single camera. The developed system,
mounted on a moving platform (drone, aircraft, operating machine, . . .) uses the
principles of stereoscopic vision but takes advantage of the different acquisitions
made by a single camera during the platform motion.
The depth estimation algorithm that we propose in this paper can be realized using
a low-cost FPGA. The proposed architecture works at the cadence of the input pixels
and is therefore independent of the resolution of the input images. Furthermore, the
proposed design uses only operators no more complex than an adder and a memory
as large as only a fraction of a frame memory.
The method we developed to extract the distance information from the image se-
quence acquired by the camera relies on the following constraints:
– the optical axis of the camera should be orthogonal to the movement direction;
– the camera should move at an approximately constant speed;
– the direction of the motion should be close to the horizontal axis of the acquired
frame.
Coarsely speaking, the drone should follow at a steady pace a straight path, orthog-
onal to the axis of the camera. It should be noted that the third constraint is not
fundamental for the method, but, when verified, permits a major simplification in
the implementation of the algorithm. Moreover, this constraint is not particularly
binding and can be simply obtained by rotating the camera around its optical ax-
is. Thanks to this particular configuration, the distance information can be inferred
by the relative motion of the various objects in the scene. Taking advantage of the
57 An FPGA Realization for Real-Time Depth Estimation … 491
parallax, in fact, the objects closest to the camera appear to move faster than those
placed in the background; objects placed at infinite distance will appear practically
still. Moreover, thanks to the third constraint, this motion is purely horizontal. The
devised method estimates the apparent horizontal speed of the various parts of the
scene and, from these estimates, their distance from the camera.
To evaluate the movements within a sequence of images, a technique commonly
adopted in the literature is block matching. For our applications, however, we have
some information about motion which can be exploited to simplify the search. Indeed,
we already know that the motion will proceed mostly along the horizontal direction,
with a known orientation and a limited entity.
The main idea is therefore to switch from a standard 2-D block matching to a 1-D
matching of the vertical projections of the blocks. Two blocks match if the vectors
of the averages of their columns are similar.
Considering two video frames having size m × n, the first phase of the algorithm
consists in sectioning the input images into horizontal slices of size b × n, where b
is the size of the chosen block. These slices can partially overlap. If ov is the overlap
amount in pixels, the total number of slices of the image will be
m
u= (57.1)
b − ov
For each slice we then proceed to calculate all the vertical averages of the pixels:
1
b−1
pi ( j) = xi (k, j) (57.2)
b k=0
where pi ( j) are the projections of the pixels of slice i and xi (k, j) represents the
pixel in position (k, j) in slice i. The projection is further segmented into a suitable
number of blocks which may also partially overlap by oh positions; then, the number
of blocks available from each projection is
n
v= (57.3)
b − oh
For each block l belonging to the slice i, the algorithm searches for the best matching
block on the same slice in the second image; let di,l be its shift with respect to the
position of the reference block. To determine the best match we minimize the SAD
(sum of absolute difference)
min
b−1
di,l = abs[ pi (s + k) − pi (s + k + d)] (57.4)
d
k=0
492 S. Marsi et al.
where
0 ≤ l < v, 0 ≤ i < u, s = l(b − oh ) (57.5)
57.3 Architecture
The proposed method presents several features which can be fruitfully exploited in
a real-time realization:
– it requires a very small memory;
– it can be realized using operators not more complex than an adder;
– it can be highly parallelized by organizing the operations into a pipeline; in this
way the system can work at the input pixels rate, and therefore the implementation
is suitable for any desired video resolution.
The system adopts as input a video stream which sequentially provides, line by line,
all the pixels coded with an 8 bit gray levels representation. The processing, organized
in a pipeline, can subdivided in the following steps:
– Projection processing: Using port A of a true dual port memory M p , all vertical
projections of the pixels blocks are calculated pixel by pixel loading the previous
data from the memory, adding to them the present pixel, and re-storing the results
in real time. When all the pixels constituting the slice have arrived, the memory
address pointer moves to the next row and proceeds to evaluate the projections for
the next slice.
57 An FPGA Realization for Real-Time Depth Estimation … 493
Fig. 57.1 Simplified block scheme (without the control system) for the first two algorithm steps:
the projection processing and the matching evaluation
– Matching: Using the second port of the memory M p (port B), the system accesses
the data belonging to the area of interest and stores the projections values into
two local buffers. From there, through appropriate pointers the system accesses
the values of pi (s + k) and pi (s + k + d) (see Eq. 57.4) to determine the absolute
value of the difference, and accumulates the results in a temporary register by
varying k. When k reaches the value b − 1, the data in the register is stored to the
location (d, s) of a memory Mm .
An approximate block scheme of these two steps is depicted in Fig. 57.1.
– Minimum estimation: By sequentially analyzing the data stored into the memory
Mm , a dedicated circuit searches for the minimum in each row and determines
its location. The location data, which represent the estimated displacement of the
block, is stored in a M D memory composed by a suitable number of shift registers
with length v that feed the following median filter. The number of shift registers
corresponds to the size of the median filter.
– Median Filter: To reduce local errors, the data present in the memory M D are
filtered through a suitable (typically 3 × 3) median filter realized through a systolic
array of comparators [8, 9] before being supplied as output data.
Some overlap among the blocks is desirable, since it permits to improve the resolution
by increasing u and v. Indeed, both the slices of pixels on which the projections are
calculated and the blocks on which the matching is performed can overlap. This solu-
tion obviously requires more resources, but their organization can be easily planned
by maintaining the same pipeline operation and simply replicating the structure de-
scribed above.
494 S. Marsi et al.
The method has been tested using several sequences compliant with the constraints
reported in Sect. 57.2. We show here two relevant results.
In Fig. 57.2a a drone flies above a forest and takes up a view of the underlying
elements. The estimated distances are reported in false colours superimposed to the
original frame (Fig. 57.2b).
By increasing the time delay between the pair of used frames, the distance can be
estimated also for far away objects: in Fig. 57.2c the drone acquires a lateral view of
the “Vertical Forest” skyscrapers in Milan: it can be noted from Fig. 57.2d (shown
in gray levels for a better visualization) that the system is able to discriminate the
distance not only of the elements in the foreground, but also those of the most distant
ones such as the skyscrapers in the background. Both videos have been taken with
HD 1080 resolution at 30 f/s; in the first sample we have used all the frames, while
in the second one the time delay between the considered frames has been increased
by a factor 10.
It should be noticed that in these first experiments only the relative distances of
the objects have been assessed. However, using further information always available
in practical cases (speed of the drone, data of the optics and from other sensors of
the on-board camera), the distances can be determined in a quantitative way.
Fig. 57.2 Frames extracted from sequences acquired by a drone flying horizontally and taking up
an above (a, b) or side (c, d) view
57 An FPGA Realization for Real-Time Depth Estimation … 495
References
1. Saracin CG, Dragos I, Chirila AI (2017) Powering aerial surveillance drones. In: 2017 10th
international symposium on advanced topics in electrical engineering (ATEE), Mar 2017, pp
237–240
2. Schning J, Heidemann G (2016) Taxonomy of 3D sensors—a survey of state-of-the-art con-
sumer 3D-reconstruction sensors and their field of applications. In: Proceedings of the 11th
joint conference on computer vision, imaging and computer graphics theory and applications,
vol 3: VISAPP (VISIGRAPP 2016). INSTICC, SciTePress, pp 192–197
3. Foix S, Alenya G, Torras C (2011) Lock-in time-of-flight (ToF) cameras: a survey. IEEE Sens
J 11:1917–1926
4. Xharde R, Long B, Forbes D (2006) Accuracy and limitations of airborne LiDAR surveys in
coastal environments. In: 2006 IEEE international symposium on geoscience and remote sensing,
Jul 2006, pp 2412–2415
5. Duan X, Ye X, Li Y, Li H (2018) High quality depth estimation from monocular images based
on depth prediction and enhancement sub-networks. In: 2018 IEEE international conference on
multimedia and expo (ICME), Jul 2018, pp 1–6
6. Wang J, Liu H, Cong L, Xiahou Z, Wang L (2018) CNN-monofusion: online monocular dense
reconstruction using learned depth from single view. In: 2018 IEEE international symposium
on mixed and augmented reality adjunct (ISMAR-Adjunct), Oct 2018, pp 57–62
7. Zhang Z, Xu C, Yang J, Gao J, Cui Z (2018) Progressive hard-mining network for monocular
depth estimation. IEEE Trans Image Process 27:3691–3702
8. Hu Y, Ji H (2009) Research on image median filtering algorithm and its FPGA implementation.
In: 2009 WRI global congress on intelligent systems, vol 3, May 2009, pp 226–230
9. Cadenas J (2015) Pipelined median architecture. Electron Lett 51(24):1999–2001
Part XI
Digital Circuits and Systems
Chapter 58
Integration of a SpaceFibre IP Core
with the LEON3 Microprocessor
Through an AMBA AHB Bus
Abstract Nowadays, requirements for satellite electronics are becoming more strin-
gent due to the increasing complexity of space missions. In particular, data rate
requirement is growing up due to the adoption of high-speed payloads such as
Synthetic Aperture Radars and hyper-spectral imagers that overcome the capabil-
ity of state-of-the-art on-board data handling system. The European Space Agency
answered to this request introducing a new high-speed communication protocol,
SpaceFibre. At the same time, data collected by high-speed interfaces may be pro-
cessed on-board with specific hardware or general-purpose microprocessor such as
the LEON3. The aim of this paper is to describe the integration of a SpaceFibre IP
core in the Cohbam Gaisler GRLIB library, to integrate the functionalities offered
by the SpaceFibre CODEC with the potential of the LEON3 microprocessor. Imple-
mentation results on a Xilinx Virtex-6 and an analysis of the performance of the
SpaceFibre interface on an AMBA 2.0 AHB bus are presented.
58.1 Introduction
In the last years, data rate requirement for satellite on-board data handling systems
continuously grew: for example, payloads such as Synthetic Aperture Radars (SARs)
and hyper-spectral imagers require high-speed communication interfaces, able to
transfer several Gb/s. The European Space Agency (ESA) answered to this request
introducing a new protocol, SpaceFibre (SpFi) [1], a multi-Gb/s high-speed serial
link that supports data rate up to 6.25 Gb/s per lane. SpaceFibre is a candidate to
become the successor of the SpaceWire (SpW) protocol [2], which is the state-of-
the-art for spacecraft on-board communication, supporting a maximum data rate of
200 Mb/s. SpFi is backward compatible with SpW at packet level and can operate on
both copper cable and optical fibre. The protocol stack described in the SpFi standard
[3] is composed of:
• Network layer: it is responsible for transferring data packets over a SpFi network.
The Network layer is optional and can be omitted for point-to-point communication
link.
• Data Link layer: it is responsible for the Fault Detection Isolation and Recovery
(FDIR) mechanism, which resend a data packet in case a communication error
occurs. It handles independent flows of information through Virtual Channels
(VCs) and the broadcast service.
• Multi-lane layer: it allows to parallelize the communication up to 16 lanes. The
Multi-lane layer is optional.
• Lane layer: it is responsible for establishing the communication between the two
ends of the communication link. Data words are encoded/decoded using 8b/10b
encoding/decoding.
• Physical layer: it is responsible for sending/receiving data over the physical link.
For more details about the architecture of protocol stack layers please refer the
SpaceFibre standard [3].
The GRLIB Intellectual Property (IP) library by Cohbam Gaisler is a collec-
tion of IPs (i.e. Ethernet interface, memory controller, etc.) interconnected through
an Advanced Microcontroller Bus Architecture (AMBA) 2.0 Advanced High-
Performance Bus (AHB). It also includes the LEON3, a 32-bit Reduced-instruction-
Set-Computing (RISC) processor compliant with the Scalable Processor ARChite-
cure (SPARC) V8, available under the GNU GPL license [4]. The LEON3 micro-
processor exploits a SPARC V8 instruction set, and it has a 7-stage pipeline and a
Floating-Point Unit (FPU) compliant with the IEE-745 FPU standard. The LEON3
processor is compatible with the AMBA 2.0 AHB bus interface and runs up to
125 MHz on FPGAs, guaranteeing 1.4 DMIPS/MHz. A fault-tolerant and Single
Event Upset (SEU) proof version of the LEON3, the LEON3FT, is available, and its
Single Event Effects (SEEs) performances has been evaluated [5].
The LEON3 processor was exploited in several ESA missions such as CHEOPS
[6] and PROBA-3 [7] and National Aeronautics and Space Administration (NASA)
missions such as Lunar Flashlight, INSPIRE and MarCO [8]. The LEON3 found
58 Integration of a SpaceFibre IP Core with the LEON3 … 501
different applications inside the avionic system. Indeed, in many space systems it is
adopted as the On-Board Computer (OBC) [9], whose functionalities and architec-
ture are specified by Space AVionics Open Interface aRchitecture (SAVOIR) initiative
[10]. In this case, the use of high throughput data links is necessary for data trans-
mission from the payload to the platform, (i.e. science data that are necessary for the
control of a platform) and for transmitting platform telemetry data, using payload
telemetry hardware, as described in [10].
Even when it is used for different aims, applications may require a medium/high-
speed data link for communicating with the LEON3 processor. For instance, in
CHEOPS [6], the LEON3 is used to process sensory data and it is interfaced through
the SpW protocol. In the CCSDS File Delivery Protocol (CFDP) IP Core present in
ESA portfolio, a LEON3 processor is exploited to implement various IP function-
alities, and a high-throughput data link (e.g. Ethernet, SpW) is used to realize the
UnitData Transfer layer [11].
In view of that, the aim of this paper is to describe the integration of the IngeniArs
SpaceFibre IP core [12] in the GRLIB IP library in order to exploit the high data
rate capabilities offered by the SpFi standard, with the features of the state-of-the-art
space qualified LEON3 microprocessor. In Sect. 58.2, a description of the system
architecture is presented. In Sect. 58.3, implementation results for a Xilinx Virtex-6
are presented and discussed. Finally, in Sect. 58.4 conclusion are drawn.
JTAG Bitstream
AHB/
APB SpaceFibre
APB
slave register file
MIG bridge
DDR3 memory
controller
Rx DMA Rx SERDES
AHB SpaceFibre
master IP core
Tx DMA Tx SERDES
Fig. 58.1 Architecture of the GRLIB-based system with the SpaceFibre IP core
The system described in Sect. 2 has been implemented on a Xilinx Virtex-6 ml605
Evaluation kit, mounting a XC6VLX240T-1FFG1156 FPGA. Implementation results
are presented in terms of Look-Up-Tables (LUTs) (Virtex-6 family combinatorial
logic is based on 6-input LUTs), Registers (Reg), Block RAM18 (18 kb block RAM)
and block RAM36 (36 kb block RAM) necessary for the implementation of the
proposed design. The percentage of used resources out of the total is also indicated.
The LEON3 maximum frequency for the target FPGA is 80 MHz. SpFi CODEC
target frequency is 62.5 MHz that guarantees a data-rate of 2.0 GHz (the protocol
transmits/receives a 32-bit word per clock cycle to/from the bus). The implementation
of the LEON3 requires also four Digital Signal Processors (DSPs) (not included in
Table 58.1).
The integration of the SpaceFibre IP core in the Cohbam Gaisler GRLIB is intended
to exploit the potential of the AMBA 2.0 AHB bus, which supports high-bandwidth
58 Integration of a SpaceFibre IP Core with the LEON3 … 503
Table 58.1 Resource occupation for the GRLIB-based system and for the SpaceFibre IP core on
a Xilinx Virtex-6 FPGA
LUT LUT Reg. Reg. Block Block Block Block
(%) (%) Ram18 Ram18 Ram36 Ram36
(%) (%)
SpFi interface 5245 3 3629 1 12 1 0 0
Leon3/GRLIB 27842 18 16579 5 8 1 26 3
peripherals
Total 33087 21 20208 6 20 2 26 3
δ = β · fbus · n (1)
The SpaceFibre IP core master interface has a data rate δspfi of 2 Gb/s. To transfer
data at full speed its bus occupation βspfi shall be as shown in Eq. (2):
That means that for lower values of βspfi , the SpaceFibre IP actual data rate is
reduced owing to the limited availability of the AHB bus. On the other hand, if
the IngeniArs SpaceWire IP core is considered [13], a maximum data rate δspw of
200 Mb/s is available. For this reason, to transfer data at full speed it requires a bus
occupation βspw , as shown in Eq. (3):
These results suggest SpaceWire represents the bottleneck in the data transfer,
when it is possible to guarantee a bus occupation higher than 7.81%. In view of that, in
these cases SpaceFibre is a better choice, since it guarantees a higher communication
throughput, since it allows to exploit the available capacitance of the AMBA 2.0 AHB
bus.
504 G. Dinelli et al.
58.5 Conclusions
In this paper, the integration of the IngeniArs SpaceFibre IP core on the Cohbam
Gaisler GRLIB library is presented in order to combine SpFi high data rate capability
with the state-of-the-art space qualified LEON3 microprocessor. This platform can
represent an enabling technology for future high-speed elaboration system, involv-
ing the newest high-bandwidth spacecraft payloads. In particular, a system architec-
ture interconnected through an AHB bus, including the LEON3 and the IngeniArs
SpaceFibre IP core is described. Furthermore, implementation results of the pro-
posed architecture on a Xilinx Virtex-6 are presented and an analysis of the SpFi
achievable data rate on an AMBA 2.0 AHB bus is discussed.
References
Abstract The electronics devices that operate in the extreme space environment
require a high grade of reliability in order to mitigate the effect of the ionizing par-
ticles. For COTS components this can be achieved using fault-tolerant design tech-
niques which allow such design to fulfil the space mission requirements. This paper
presents the design and the implementation of one of the Klessydra F03x microcon-
troller soft core family, called the F03_mini, which is a RISC-V RV32I compatible
fault-tolerant architecture enhanced by a Hardware Thread (HART) full/partial pro-
tection and a thread-controlled Watch-Dog Timer module. The core architecture has
been synthesized and implemented on an ARTIX-7 A35 FPGA and fault-injection
by the meaning of a functional RTL simulation has been performed in order to evalu-
ate the robustness to Single Event Effects (SEE). Experimental results are provided,
illustrating the impact and the benefits obtained by the usage of the proposed TMR
protection techniques as well as a thread-controlled Watch-Dog Timer.
59.1 Introduction
The electronic devices that operate in the extreme space environment require a high
grade of reliability in order to mitigate several effects of ionizing particles [1]. In our
design, we considered only soft-errors (SE), such as Single Events Effects (SEE), as
we focus on low clock speed (25 MHz) applications.
The usage of Commercial Off-the-Shelf (COTS) components as well as an open-
source Instruction Set Architecture (ISA) allow a reduction in cost due to the low
volume demand for aerospace applications. From this point of view, the growing
interest for an extendable microprocessor Instruction Set Architectures (ISA) has
led many companies to support the RISC-V open standard [2, 3].
Since this kind of components are not intrinsically protected at hardware level,
a fault-tolerant architecture design is required in order to fulfil with the severe
environment requirements as well as with resource availability [4, 5].
This paper describes the design and the implementation of a compact variant of the
Klessydra F03x microcontroller soft core family (named F03d or F03_mini) which
is a RISC-V RV32I compliant, fault-tolerant architecture enhanced by a TMR-based
full/partial Hardware Thread (HART) protection and a Thread Controlled Watch-Dog
Timer (TC-WDT) module.
In the following, Sect. 59.2 provides an overview of the core microarchitecture and
its compatibility with the RV32I instruction set. Section 59.3 describes the proposed
HART full/partial protection techniques, as well as the utilization of the dedicated
TC-WDT. Section 59.4 reports experimental results about the FPGA implementa-
tions and the HDL fault-injection simulation, and finally Sect. 59.5 provides the
conclusions.
The Klessydra processing core family is a set of cores featuring full compliance
with the RISC-V instruction set and intended to be integrated within the PULPino
microcontroller platform [6]. To date, the Klessydra family [7] includes:
• a minimal gate count single-thread core, Klessydra S0;
• a set of multi-threaded low-end IoT-oriented cores, Klessydra T0x;
• a set multi-threaded fault-tolerant cores, Klessydra F03x, featuring different fault-
tolerance techniques;
• a set of multi-threaded cores, Klessydra T1x, supporting vector processing
acceleration for high-speed controllers in high-end IoT nodes [8].
The Klessydra core family features can be summarized as follows:
• Full compliance with the RISC-V architecture specification (instruction set,
control and status registers, interrupt handling mechanism and calling convention);
59 A RISC-V Fault-Tolerant Microcontroller Core Architecture … 507
In this section we discuss the architectural choices included in the F03_mini core in
order to minimize area overhead required by fault-tolerance features. The core shares
the same baseline architecture as the T03x [9, 10], on which classic Triple Modular
Redundancy (TMR) has been applied (Fig. 59.1). As opposed to the Klessydra F03a
core, featuring full TMR protection on all the HARTs supported by the hardware
microarchitecture, the F03_mini introduces the following general characteristics in
order to save hardware resources:
• Different degrees of error protection among the HARTs.
• Reduced set of Counter and Performance Registers.
All the CSRs and PCs are protected using TMR technique, while the non-critical
Counter and Performance Registers (CPRs) are not protected at all, in order to reduce
the usage of hardware resources.
Klessydra F03_mini supports the execution of 3 HARTs. The hardware microar-
chitecture features a fully-protected datapath for HART0 and a weakly-protected dat-
apath architecture for HART1 and HART2. Actually, the Processing Pipeline (PP) is
completely TMR-protected, while only HART0 has a TMR-protected register file. In
this way it is possible to reduce the use of resources by reducing the reliability of two
HARTs. A limited degree of protection is guaranteed on HART1 and HART2 by the
introduction of the TC-WDT. Moreover, the user can implement software protection
techniques to prevent processing failures on weak-protected HARTs, by exploiting
the thread-communication features offered by the Klessydra architecture. From the
508 L. Blasi et al.
application software point of view, HART0 will handle the mission critical tasks of
the satellite, while HART1 and HART2 will handle non-critical tasks (Fig. 59.2).
The TC-WDT is a critical component for the correct behaviour of the micro-
controller core to be used in the space environment, as it provides a limited degree
fault-tolerance for weakly protected HARTs whenever a loss of control due to a SEE
occurs within the application program flow.
The TC-WDT can be controlled only by HART0, i.e. the only one which has full
TMR protection. In normal operation, i.e. in the absence of critical SEE on HART1
and HART2, all threads will send their reset request (RST_REQ) to the WDT before
its timeout has elapsed. The reset command (RST_CMD) o the WDT can be sent
only by HART0 once it has verified the correctness of the results for weak-protected
HARTs (HART1 and HART2). All the requests and commands are performed with
write/read access on memory mapped register (accessed by AMBA Peripheral Bus
(APB) interface). The complete reset request sequence is described by the following
points:
1. HART1 and HART2 send their reset requests (WDT_RST command).
2. The WDT enables the flags for HART1 and HART2 (in the WDT_CSR register).
3. HART0 checks periodically both flags in the WDT to check the correct behaviour
of HART1 and HART2.
4. HART0 requests the WDT reset by the WDT_RST command (Fig. 59.3).
According to the above description, whenever a SEU causes a loss of control
within the program flow of HART1 or HART2, HART0 will detect a mismatch
when checking the WDT flags. In this case, a dedicated software routine will handle
59 A RISC-V Fault-Tolerant Microcontroller Core Architecture … 509
the error detection in the proper way, which is application dependent. A software
support by means of dedicated error recovery routines is therefore required in order
to ensure a reliable behaviour.
Klessydra F03_mini has been coded in VHDL-2008 HDL language and implemented
on a Xilinx ARTIX-7 xc7a35tftg256 device. Here we report the essential results
related to area, speed and fault tolerance tests.
Table 59.1 provides results for the area usage, compared with the fully TMR-
protected F03a version.
Table 59.2 reports a group of tests that have been executed to compare the per-
formance between the F03_mini fault tolerant core and non fault-tolerant T03x core.
We can see that the application of the TMR technique in the F03_mini cores does
not reduce performance in terms of cycle count.
To verify the proposed fault tolerant features, we performed several HDL fault-
injection simulation. The tests are based on TCL scripts which force random bit flip
in each flip-flop inside the core with a rate up to 18 upset bits/µs. Table 59.3 provides
the results of fault-injection campaign.
59.5 Conclusions
References
Abstract The Consultative Committee for Space Data Systems File Delivery
Protocol (CFDP) is a protocol designed for the transmission of files in space envi-
ronment, characterized by frequent link disconnections and high transmission delay.
This work presents the characterization of the CFDP IP Core included in the European
Space Agency (ESA) IP portfolio in terms of downlink data-rate through a custom
methodology. For the characterization of the acknowledged/unacknowledged trans-
mission modes, the CFDP IP Core was implemented on board Virtex-5 and Virtex-6
Field Programmable Gate Arrays and tested by using the ESA Ground Segment
CFDP reference software, acting as a secondary CFDP entity. The delivered CFDP
packets were encapsulated in Unit Datagram Protocol (UDP) packets and transmit-
ted through Ethernet protocol. Wireshark was used to measure the time for a file
transmission. The presented methodology provides a way to estimate the IP Core
maximum transmission throughput and to identify the architectural bottlenecks.
60.1 Introduction
In 2007 the Consultative Committee for Space Data Systems (CCSDS) issued the
CCSDS File Delivery Protocol (CFDP) in view of the necessity of a unique file deliv-
ery protocol that can transmit files in space environment, characterized by frequent
link disconnections, limited availability of bandwidth and high transmission delays
[2]. To address the necessity of a broad variety of missions, the CFDP protocol per-
mits to exploit different communication protocols as UnitData Transfer (UT) layers
and supports Acknowledged and Unacknowledged single hop (class 1–class 2) and
multi hop (class 3–class 4) transmission modes [2].
Nevertheless, the improvements in resolution of onboard instruments, the incre-
ment of data storages and the availability of high throughput communication links
allowed to increase the quantity of onboard data of space missions, which shall be
transmitted to ground [6]. For such reason, research focused to optimize the CFDP
protocol to increase the transmission throughput and to reduce the time requested for
a file delivery [10]. In view of that, this work presents the characterization in terms
of downlink data-rate of the CFDP VHDL Intellectual Property (IP) Core included
on the European Space Agency (ESA) portfolio, through a dedicate methodology.
Such IP Core was designed by Braunschweig University [1], which also realized
the first complete prototype implementing the CFDP IP Core on ML509 Virtex-5
Field Programmable Gate Array (FPGA)-development board [7]. Moreover, through
system level simulations, Braunschweig University provided an estimation of the
maximum transmission throughput of such implementation. A new prototype was
realized onboard Virtex-6 ML605 board [8] to verify the design portability and to
measure the increment in performances due to the use of a FPGA of the next gen-
eration. The presented methodology permitted to measure the prototypes downlink
data-rate during in class 1 and class 2 modes and to delineate the bottlenecks of the
architecture, confirming the results of the simulations performed by Braunschweig
University.
The architecture of the CFDP IP Core, which supports class 1 and class 2 transmission
and reception, is shown in Fig. 60.1.
Such IP is realized according to a hardware/software codesign. In particular, the
CFDP hardware realizes an accelerator to control and carry out the different CFDP
transactions. The CFDP software is integrated in the Real-Time Executive for Multi-
processor Systems (RTEMS) [5], and it is organized in a layered structure composed
of three parts: CFDP Drivers + CFDP Entity API, realizing a CFPD entity; CFDP
client, which permits to implement different test scenarios. Furthermore, the outgo-
ing and ingoing CFDP Protocol Data Units (PDUs) are encapsulated in User Data
Protocol (UDP) packets and transmitted over Ethernet. The communication between
60 Estimating the Downlink Data-Rate of a CCSDS File … 515
The set-up shown in Fig. 60.2 was used for the characterization of the CFDP imple-
mentations in terms of the transmission data-rate.
To test the Virtex-5 and Virtex-6 prototypes, the ESA Ground Segment CFDP
software, provided by the European Space Operations Centre (ESOC), was used as
a CFDP receiving entity [3]. To exploit its functionalities, the ESA Ground Segment
CFDP software was run on a Personal Computer (PC) and linked to the prototypes
on FPGAs by using a UDP over Ethernet approach, as specified in Sect. 60.2. To
run the various tests, different CFPD client layers were executed on the LEON3FT
processor, as described in Sect. 60.2, through GRMON2 [4] software. Finally, to
observe outgoing packets exchanged between the two CFDP entities Whireshark [9]
was used. Moreover, the latter permits to get the timestamps correspondent to the
516 G. Meoni et al.
Sniffing packets
(with timestamps)
Fig. 60.2 Set-up for performance characterization of prototypes implementing the CFDP IP Core
arrivals of the different PDUs, whose information allows to estimate the transmission
data-rate, by measuring the total time necessary for relaying a file of fixed dimension.
The time requested to deliver a file as a function of the system clock frequency in
class 1 and class 2 modes was measured for both the prototypes. For such aim,
multiple syntheses of the CFDP IP Core with different system clock frequencies
were carried out for both the FPGA families. This approach permits to measure the
dependency of transmission data-rate on the system clock frequency and to identify
the value of the clock frequency that leads to the maximum data-rate. Furthermore, the
various delivery tests were carried out by transmitting files of different dimensions,
such as 5 and 50 kB, to measure eventual dependencies of performances on the file
size. As explained in Sect. 60.3.1, the timestamps provided by Wireshark, measured
in correspondence of the arrival of the PDU packets, were used to estimate the
transmission data-rate. In particular, for each test case, which is therefore identified by
the following parameters (FPGA used, file size, system clock frequency, transmission
class), 10 files were transmitted to perform a better estimation of the data-rate. For
each file transmitted, the timestamps relatives to the arrival of the first (TF_F i ) and the
last (TL_F i ) FileData PDUs were acquired. The difference (TL_F i − TF_F i ) between
these two values corresponds to the time necessary to deliver a file by excluding the
service PDUs, i.e., Metadata PDU, End of File, etc. Such information was exploited
to estimate the prototype throughput Ttest for a particular test case, together with the
a priori knowledge of the file size Fsi ze , by using Eq. 60.1:
60 Estimating the Downlink Data-Rate of a CCSDS File … 517
Fsi ze
Ttest = 1 9
(60.1)
10
· i=0 (TL_F i − TF_F i )
60.4 Results
Table 60.1 shows the implementation results of the CFDP IP prototype (including
LEON3FT processor) onboard Virtex-5 and Virtex-6 FGPAs in terms of source uti-
lization and maximum system clock frequency f C L K M AX . The results of characteriza-
tion of the transmission data-rate for Virtex6 FPGA are shown in Fig. 60.3. For both
the prototypes, in all the cases the transmission data-rate/system clock frequency
dependency is well approximated by using a linear trend, described by Eq. 60.2:
Ttest ( f C L K ) = θ0 + θ1 · f C L K (60.2)
Fig. 60.3 Transmission data-rate/system clock frequency trends for the different test cases on board
Virtex6 ML605 board
518 G. Meoni et al.
possible explanation is that the higher is the file size, the higher is the overhead time
due to the numerous accesses and collisions on the AHB bus, owing to the single AHB
bus topology and owing to the use of the Mass Memory as a temporary storage for
those data that require to be processed by different hardware units. Such hypothesis is
confirmed by the experiments described in Fig. 60.4. In particular, Fig. 60.4a shows
that throughput linearly decreases for growing file sizes by using the same prototype
and system clock frequency. In addition, Fig. 60.4b shows that by using two different
values of file length transmission, i.e., 1024 and 4096 B, to relay a file of 50 kB, the
throughput/system clock trend results lower for all the clock frequencies by choosing
the lower packet size. Indeed, a lower PDU data size requires a higher number of
bus accesses, by increasing the overhead time. By using a PDU data size of 4096 B,
the maximum transmission throughput for Virtex5 resulted of 26.66 Mb/s.
60.5 Conclusions
References
61.1 Introduction
echoes, it is possible to investigate moving tissue (e.g. heart walls) or flowing par-
ticles (e.g. blood cells), and obtain significant information about organs dynamics
(echo-Doppler image) [1, 2].
Blood flow investigation of the carotid artery is a common ultrasound exam suit-
able to investigate general hemodynamic conditions and to monitor the presence of
dangerous atherosclerotic plaques. It can represent a life-saving exam [3]. An expert
sonographer searches for the correct probe position (e.g. transverse probe position)
on the patient neck by checking the B-mode image. In this condition the carotid
looks like a dark almost-circular structure. Then the sonographer selects the region
of interest (ROI) in the middle of the carotid lumen and switches the echograph in
Echo-Doppler modality to save the flow data.
Recently, several very economic and simplified small scanners have been intro-
duced. Some of these are simple smartphone add-ons [4]. Despite the evident lim-
itations of such instrumentation with respect to hospital scanners, these tools can
be precious in developing countries, points-of-care located in remote areas or emer-
gency situations. Unfortunately, in addition to the echograph, the presence of an
expert sonographer is necessary for exams like the carotid blood flow investigation.
In this paper we present a method for the automatic detection of the carotid artery
lumen in the B-mode live image. This method can be used for the automated setting
of the ROI in the carotid lumen for acquisition of flow data. The user (not necessarily
an expert professional) is requested only to place the probe on the patient neck so
that it crosses transversally the artery and to start the automatic procedure.
The method is based on the automatic detection of the position of the carotid artery in a
transverse B-mode image. The image is obtained by positioning the ultrasound linear
probe on the patient’s neck, about at half of its extension. The probe is rotated roughly
at 90° with respect to the neck axis. In this way the image plane cuts transversally
the common carotid artery (CCA), which, being the blood almost transparent to
ultrasound, is represented in the image by an anechoic (dark) region and presents
a roughly circular shape. The surrounding tissue has variable echogenicity and, in
general, it appears of variable clearer gray tone (see Fig. 61.1 for an example of
carotid B-mode image). Other dark portions are present in the image, in particular
in the deeper region where the echoes are weaker, but their contours and dimensions
are quite discernable from the artery.
From this premise, an image-processing algorithm has good chances to
autonomously detect the artery. In the proposed method we employed the Hough
transform, a well-established method for detecting curves and shapes in images [5].
In particular, the Circular Hough Transform (CHT) algorithm is adapted for find-
ing circular shapes. The method potentiality was first investigated in Matlab (The
Mathworks, Natick, MA, USA). A sequence of 25 B-Mode frames is used for each
detection. The image sequence is pre-processed to reduce noise and adjust contrast
61 Automatic Detection of the Carotid Artery Position … 523
Carotid
Fig. 61.1 B-mode image of the common carotid artery in a healthy volunteer. The image plane
cuts the vessel transversally. The vessel is represented by a dark region of circular shape
and brightness. Figure 61.2 shows the processing steps. For each frame, CHT detects
all the dark circles whose radii are within a physiological range. Their center coordi-
nates (x i , yi ) and radii (r i ) are collected in the so-called Dark Circles Matrix (DCM).
For each of the dark circles that have been detected, the brightness values of their
pixels are collected in the RGB Matrix (RGBM). The circle showing the darkest
value, i.e. the minimum brightness value, is selected as “candidate circle”, and its
center (x ci , yci ) coordinates with radius (r ci ) are saved in the Candidates Matrix (CM).
Once the 25 B-Mode frames sequence has been processed, the most frequent (x c , yc ,
r c ) triad occurring in the CM is selected as the final choice.
61.3 Experiments
The method was tested with the experimental scanner ULA-OP [6] managed in
real-time by Matlab® , and connected to a LA533 linear probe (Esaote s.p.a). A script
running in Matlab® configured the scanner (see Table 61.1) and started the acquisition
of the B-Mode frame sequence. The sequence was immediately available in Matlab® .
The carotid was detected by the described procedure and the lumen center was used to
set-back to ULA-OP the sample volume [1] position suitable for a possible Doppler
investigation.
The non-trained user placed the probe on the volunteer’s neck roughly in the
position where the CCA is expected to be traced, without the help of the B-mode
display. The automatic procedure started and about 4 s later it was concluded. As
reference, Matlab® saved the B-mode and the estimated CCA position. Figure 61.3
reports an example of automatic segmentation of the carotid. The image shows the
524 R. Matera and S. Ricci
Fig. 61.2 Procedure for locating the carotid position. Candidate circles are located on a sequence
of 25 B-mode images. The most frequent one is then selected as output decision
Fig. 61.3 Carotid position (yellow circle) detected in the pre-processed image sequence. ‘X’
represents the circle center, i.e. the region where the flow analysis should be carried out
result of the pre-processing, and although it appears more confused with respect to
the original B-mode image (see Fig. 61.1 as an example), it is more effective when
processed by the detection algorithm. The yellow circle represents the position of
the carotid automatically located. The yellow ‘X’ represents the center of the carotid
lumen (located as the center of the circle) and is passed back to the scanner as the
point where to focus the flow investigation. In the experiments the carotid position
was located correctly in about 90% of the tests. When non-located, it was sufficient
for the user to slightly move the probe and retry to obtain a correct result.
Acknowledgements Authors thank Prof. Piero Tortoli for the valuable advices that contributed to
improve the paper. This work is part of the AMICO project funded from the National Programs
(PON) of the Italian Ministry of Education, Universities and Research (MIUR): code ARS01_00900
(Decree n. 1989, 26 July 2018).
References
1. Evans DH, McDicken WN (2000) Doppler ultrasound physics, instrumentation and signal
processing. Wiley, Chichester. ISBN: 978-0471970019
2. Urban G, Vergani P, Ghidini A, Tortoli P, Ricci S, Patrizio P, Paidas MJ (2007) State of
the art: non-invasive ultrasound assessment of the uteroplacental circulation. Semin Perinatol
31(4):232–239. https://doi.org/10.1053/j.semperi.2007.06.002
3. Grant E, Benson C, Moneta G et al (2003) Carotid artery stenosis: gray-scale and doppler US
diagnosis—society of radiologists in ultrasound consensus conference. Radiology 229(2):340–
346. https://doi.org/10.1148/radiol.2292030516
4. Huang CC, Lee PY, Chen PY, Liu TY (2012) Design and implementation of a smartphone-
based portable ultrasound pulsed-wave doppler device for blood flow measurement. IEEE Trans
Ultrason Ferroelectr Freq Control 59(1):182–188. https://doi.org/10.1109/tuffc.2012.2171
5. Duda RO, Hart PE (1972) Use of the Hough transformation to detect lines and curves in
pictures. Commun ACM 15(1):11–15. https://doi.org/10.1145/361237.361242
6. Tortoli P, Bassi L, Boni E, Dallai A, Guidi F, Ricci S (2009) ULA-OP: an advanced open
platform for ultrasound research. IEEE Trans Ultrason Ferroelectr Freq Control 56(10):2207–
2216, https://doi.org/10.1109/tuffc.2009.1303
7. Tortoli P, Guidi G, Pignoli P (1993) Transverse Doppler spectral analysis for a correct interpre-
tation of flow sonograms. Ultrasound Med Biol 19(2):115–121. https://doi.org/10.1016/0301-
5629(93)90003-7
61 Automatic Detection of the Carotid Artery Position … 527
62.1 Introduction
The RISC-V open instruction set architecture paved the way for computer architects
to design innovative capable cores able to execute complex instruction extensions
[1]. The instruction set was designed to support 32/64/128 bit architectures in bare
metal, supervisor, or user modes [2], and has available encoding space that allows
processor designers to augment their own custom instruction set, for educational,
research or industrial application purposes.
The family of RISC-V processor cores Klessydra has been designed to sup-
port domain-specific features, while fitting in the Pulpino microcontroller platform
[3]. Here we present the Klessydra T13 architecture, which is an extension of the
Klessydra T03 version [4, 5, 8] in the domain of computation intensive embedded
microcontrollers.
This study presents the features of an efficient accelerator named Special Purpose
Mathematical Unit (SPMU) facilitates vector execution on an instruction level basis.
Then it shows how this SPMU can be scaled to run fast convolutions on embedded
systems, and identify what is the most convenient data level parallelism setting that
brings out the most acceleration while still maintaining a relatively small area occupa-
tion. Section 62.2 explains the architecture of the accelerator. Section 62.3 shows the
synthesis results on the FPGA, and the maximum speed of each layout generated.
Section 62.4 shows the performance on the instruction level, and the acceleration
contributed by the different implementations in the SPMU, and then again is shown
how the SPMU faired when executing a set of convolutions in which it was config-
ured for different data level parallelism settings. The last section determines which
configuration gives a good performance boost while still maintaining a relatively
small area occupation.
The architecture of the SPMU is divided into two main sub-systems. The Special
Purpose Engines (SPE), and the Scratchpad Memory Interface (SPI), as shown in
the block diagram in Fig. 62.2. The SPMU can be configured with many different
parameters.
62 Efficient Mathematical Accelerator Design Coupled … 531
Prg Mem
PC
PC harc
Fetch Debug
Updater Updater
Decode
Regfile
SPE
CSR WB SPI
B0 B1 B2
Data Mem
SPE Exec
Control / Mapping
Bank0 Bank1 BankN
Add
Shft Mul Accum Relu
Sub
The T13x core comes with multiple configuration parameters to generate a set of
different designs:
• The first sets the number of scratchpad memories (SPM); this cannot be set to
lower than two, since a two-source operand instruction requires that we read from
two different SPMs simultaneously.
• The second sets the SPM size: It can be set to any number which is a power of
two. The total size of the SPM will be divided on the number of banks in the SPM.
532 A. Cheikh et al.
• The third parameter changes the SIMD capability of the engine, for example; if
this value was set to 4, all the functional u-nits become multiplied by 4, and each
SPM will have four banks of 32-bit words.
Now this study is NOT going to explore the best number of SPMs per core, since
this setting is used per application basis, and every user might utilize his SPM space
differently. However, note that the more the user increases the number of SPMs, the
more complex the crossbar connecting the SPMs to the SPI will become.
The SPE is the engine that executes the special purpose instructions with the “K”
extension. It is divided into multiple subsystems.
The exception handler is the controller which is a part of the initialization phase
that checks and predicts for any exceptions at the very first cycle of the execution of
a custom instruction from the “K” extension. The reason behind checking in the first
cycle is that in the case of encountering an exception, the core can recover the state
of the processor precisely to the time before the exception occurred. After the first
cycle, another instruction might be issued to execute in parallel with the accelerator,
and the program counter will change its value.
the following is a list of what might trigger an exception:
1. Out of bound SPM access; in this case, one of the pointers to a data element is
pointing to an address not belonging to any of the SPM memories.
2. Dual SPM read access; a SPM has one read port, and if the two operands point
to the same SPM, we encounter an exception.
3. Overflow data read and write; this happens when the SPM pointer plus the vector
size will overflow the address of the SPM being indexed. This overflow exception
only traps when the operand being indexed is used as a vector, and not scalar.
4. Misaligned access; SPMs are 32-bit word aligned and any misaligned access will
trigger this exception.
The SPE initialization block configures the functional units correctly in order to
execute the current instruction in flight. An example of some configurations might
be; Setting the FU controls to execute the data type to be computed on, such as; chars,
shorts or ints. Other configurations might also be to transform the input operands
into their two’s complement or they might be to configure outputs to either become
sign extended or zero extended.
In the execute state of the SPE, the hardware loops start incrementing the vector
addresses to fetch the next element, and decrementing the vector length. When the
vector size becomes zero, the hw-loops stop, and the instruction is considered done.
A masking vector is created depending on the number of elements left, such that if
the number of elements is less than the number of bytes processed in one cycle, the
mask will disable the upper bytes of the fetched elements. This is essential when
62 Efficient Mathematical Accelerator Design Coupled … 533
elements fetched get accumulated. In this case, we need to avoid accumulating data
not belonging to the instruction in order to get a correct accumulation result.
The fetched input operands go into a mapping unit, which maps the inputs to
their corresponding functional units. The operands can be scalar or vector, and they
can be fetched from the SPM or the register file. The outputs of the functional units
will connect again to the mapping unit, in which they will be written back to the
SPMs.
A control unit in the SPE, controls the fetching of the inputs, and the writing of
the results, as well the halting the vector processor in case the source SPMs are being
accessed by the load-store unit simultaneously. When the SPE gets a halt signal,
all the data in the pipes will maintain their state, and the hardware loops will stop
counting until the halt signal returns back to zero, and then the SPE recovers its
previous state.
The SPE has five different functional units (FUs). All the units work with dif-
ferent data width (8-bit, 16-bits, 32-bit). Three of the FUs work in partial mode; the
adder, shifter, and multiplier. The partial FUs increase the parallelism for smaller
data width elements while maintaining a small area occupation. Table 62.1 shows
how many operations we do in one cycle in every FU and for each data type when
the SIMD parameter is configured to be 1. Bigger SIMD configurations will double
the number of parallelisms on all the functional units.
The partial adder as seen in Fig. 62.3 is a set of four 8-bit adders cascaded
together. To produce 8-bit sums, no carries are propagated from the partial sums.
For 16-bit additions, only the first and the third adders are allowed to propagate their
carries, while for the 32-bit sum all carries are allowed to be propagated. However,
the adders are split into two pipe stages, so the carry from the lower 16 bits, goes to
the upper sixteen bits through a register and not a wire.
For the 32 bit multiplier the partial multiplication structure is based on four 16-bit
multipliers, according to the following implementation:
A31−0 ∗ B 31−0 = ( A31−16 16) + A15−0 ∗ (B 31−16 16) + B 15−0
This method can generate two 8-bit, or two 16-bit MULs per cycle, or one 32-bit
MUL per cycle. The reason this was not divided to do 8-bit partial multiplications
instead, was because one DSP slice is utilized in the FPGA whether an 8-bit or16-bit
4*32 B
4*32 A
4*23-16 B
4*15-8 B
4*7-0 B
4*31-24 A
4*31-24 B 4*23-16 A
4*15-8 A
4*7-0 A
FF FF FF
FF
FF FF
multiplication is done. So for vector operations using multipliers, we will only get
twice the speed-up for 8-bits of data and not four times as in the case of the partial
adders.
The partial shifter in the SPE works in the opposite manner (Fig. 62.4). One
32-bit right logic shifter slides the input operands and computes one 32-bit shifted
output. If the data width was 16-bits, it will execute as follows: The two 16 bits data
4*32 Y1
Mask_en
Sign_Ext
4*32 Y2
62 Efficient Mathematical Accelerator Design Coupled … 535
will go into the right shifter, the output of the shifter will be sent to the next stage
where the lower bits of the upper 16-bit input that were slided into the upper bits
of the lower 16-bit input will be masked with a zero if the shift was logical sign
extended if the shift was arithmetic. A similar approach is applied for 8-bit data
types.
The remaining two functional units are a 2-stage accumulator, which accumu-
lates an input vector source into a scalar output, and a ReLu unit that rectifies all
negative vector elements to zero.
The engine is interfaced with a set of SPMs. Each SPM has a read and write port,
and every SPM line has a set of banks that hold a 32-bit word. An SPM read or write
access will fetch or write an entire line in one cycle. If the fetch pointer was not
pointing to the beginning of the line, the data fetched will be from the current line
being indexed, and the next line, therefore maintaining the fetching of one complete
line per cycle.
Misaligned fetches go into a read-rotator circuit to make it appear as if the fetching
is from the beginning of the line. In this manner operand_a[i] will always be aligned
with operand_b[i] and go the same functional unit. Without rotation, misaligned
accesses might send operand_a[i] and operand_b[i+2] to go to the same functional
unit, and that produces erroneous outputs. During the result write, the result to be
written will be rotated back to point to the correct bank indicated in the write address.
The SPI has a serialized access grant unit, in which the instruction that comes first
in program order will either lock the read or write access of a certain scratchpad.
And since T13 is an in-order processor, there will never be data hazards with the
serialized access grant.
The T13 core was synthesized with different configuration parameters of the SIMD
variable. Synthesis results were generated by Vivado 2018.2. A clock constraint of
1 ns was placed in order to compel Vivado to generate the fastest netlist possible
from each configuration. The Genesys 2 was our target board for implementation
[7]. Table 62.2 shows the element utilization for each SIMD configuration, while
Fig. 62.5 shows the maximum frequency of each element. You can see that the LUT
utilization almost doubles when going from SIMD 1 to SIMD 8, and the number
DSP and BRAM slices are four times or more. Looking at the maximum operating
frequency of each generated layout, we see that the maximum clock frequency of
each configuration lies in the same range going from 140 to 150 MHz, we also note
that the overhead of the extra SIMD did not affect the net-delay by much.
536 A. Cheikh et al.
120
100
SIMD 1 SIMD 2 SIMD 4 SIMD 8
In order to benchmark the SPMU, a set of simple arithmetic tests were run to see
what approaches bring forth the biggest performance boosts. Figure 62.6 shows the
number of cycles taken to run an arithmetic operation without using the SPMU, that
is used as a reference for comparisons. All data types take the same time to execute
when not using the accelerator. Figure 62.7 shows the speed when using the SPMU
without any SIMD capabilities, and also the fetching of the next element is done by
software loops instead of hardware. In this configuration, the speed-up only comes
from having burst loads and stores to the SPM, and the low latency in the SPM, so
the superscalar execution can present significant boosts for big vectors, while for
very small vectors they can be slower than the non-accelerated approach.
Figure 62.8 shows hardware loop impact, with a speed boost gain over 200% for
all vector sizes. Going from SIMD 1 to SIMD 4 as shown in Fig. 62.9 boosts the
speed by a small margin for big vectors, and by a barely detectable margin for small
vectors. The SIMD boost can be better seen when running more complex tests.
We ran a set of matrix convolutions as shown in Table 62.3 ranging from matrix sizes
of 4 × 4 to 32 × 32. It appears that SPMU no matter what configuration it is set to,
cannot generate good performance when the matrix is 4 × 4. That is probably is due
to a few reasons; the first being that every custom instruction takes from six to ten
cycles of latency to process one scratchpad data line, and every other line fetched
538 A. Cheikh et al.
from the SPM the vector augments one extra cycle of latency. While on the other
hand in the normal execute stage, T13 takes from one to three cycles to execute an
instruction. Second of all the SPMU instruction operands indirectly reference their
data values, while the normal RISC-V instructions do a direct referencing of their
data. This indirect referencing requires one to two cycles to create the pointer, plus
the interleaving of two threads in every pointer creating instruction we have. So, for
creating the three pointers of the source and destination operands, the SPMU adds a
significant time overhead if the vectors are small.
It is apparent that as the matrix size gets bigger, the acceleration becomes more
apparent, such that for 32 × 32 convolutions we get more than double the speed
boost, and that is just by using hardware loops, and the low latency SPMs. While for
data-level parallelism, higher-order SIMD configurations got a speed boost of more
than four times.
62.5 Conclusions
Our tests show that data level parallelism was the smallest contributor to the speed
boosts in the accelerator in the basic tests, however its significance was more apparent
for the relatively big applications. The Hardware loops in fact showed the greatest
speed improvements while they only contributed to a very small area utilization. The
number of cycles in the convolution test show that SIMD 8 configuration saturates
in the performance boost, but gives a steady linear growth in the area utilization. So,
for our current design, if the matrix sizes do not exceed 32 × 32, we recommend the
configuration of the core to be set to SIMD 2 or SIMD 4, in order to get a good boost
and maintain a small layout. Finally, it is suggested that small vector calculations
less than four elements are preferred to be executed without using the SPMU, while
vectors bigger than four elements should be processed by the SPMU.
62 Efficient Mathematical Accelerator Design Coupled … 539
References
1. Waterman A, Asanovic K (eds) (2017) The RISC-V instruction set manual—volume I: User-
Level ISA—Document Version 2.2, May 2017. https://riscv.org/specifications/
2. Waterman A, Asanovic K (eds) (2017) The RISC-V instruction set manual—volume II:
Privileged ISA—Document Version 1.10, May 2017. https://riscv.org/specifications/
3. Traber A, Zaruba F, Stucki S, Pullini A, Haugou G, Flamand E, Gurkaynak FK, Benini L (2016)
PULPino: a small single-core RISC-V SoC. In: 3rd RISCV workshop
4. Olivieri M, Cheikh A, Cerutti G, Mastrandrea A, Menichelli F (2017) Investigation on the
optimal pipeline organization in RISC-V multi-threaded soft processor cores. In: Proceedings
of 2017 new generation of CAS (NGCAS). IEEE, pp 45–48
5. Cheikh A, Cerutti G, Mastrandrea A, Menichelli F, Olivieri M (2017) The microarchitecture of
a multi-threaded RISC-V compliant processing core family for IoT end-nodes. In: International
conference on applications in electronics pervading industry, environment and society. Springer,
Cham, pp 89–97
6. Bechara C, Berhault A, Ventroux N, Chevobbe S, Lhuillier Y, David R, Etiemble D (2011) A
small footprint interleaved multithreaded processor for embedded systems. In: 2011 18th IEEE
international conference on electronics, circuits, and systems. IEEE, pp 685–690
7. Genesys 2 Reference Manual by Digilent. https://reference.digilentinc.com/reference/
programmable-logic/genesys-2/reference-manual
8. Blasi L, Vigli F, Cheikh A, Mastrandrea A, Menichelli F, Olivieri M (2019) A RISC-V Fault-
Tolerant microcontroller core architecture based on a hardware thread full-weak protection
and a thread-controlled watch-dog timer. In: Applications in electronics pervading industry,
environment and society. ApplePies
Chapter 63
AXI4LV: Design and Implementation
of a Full-Speed AMBA AXI4-Burst DMA
Interface for LabVIEW FPGA
Luca Dello Sterpaio, Antonino Marino, Pietro Nannipieri,
Gianmarco Dinelli and Luca Fanucci
63.1 Introduction
This work aims to design an IP core bridge module to interface directly any Lab-
VIEW FPGA (LVFPGA) Virtual Instrument (VI) with an AMBA 4 AXI interconnect
fabric. This will allow hardware designers to include complex architecture based on
this kind of bus into LVFPGA projects or, viceversa, LVFPGA VIs as part of a
L. Dello Sterpaio (B) · A. Marino (B) · P. Nannipieri (B) · G. Dinelli (B) · L. Fanucci (B)
Department of Information Engineering, University of Pisa, Pisa, Italy
e-mail: [email protected]
A. Marino
e-mail: [email protected]
P. Nannipieri
e-mail: [email protected]
G. Dinelli
e-mail: [email protected]
L. Fanucci
e-mail: [email protected]
© Springer Nature Switzerland AG 2020 541
S. Saponara and A. De Gloria (eds.), Applications in Electronics Pervading
Industry, Environment and Society, Lecture Notes in Electrical Engineering 627,
https://doi.org/10.1007/978-3-030-37277-4_63
542 L. Dello Sterpaio et al.
63.2 Architecture
to
LV 3W
LV-to-AXI DMA Bridge (H2T/W) LVFPGA
to H2T IF
H2T FIFO
AXI-Burst
AXI-Burst
Master IF to
DATA BUS
LV 3W
AXI-to-LV DMA Bridge (T2H/R) LVFPGA
T2H IF
T2H FIFO
T2H/R
CTRL
to
H2T/W CTRL AXI-Lite
AXI-Lite
CTRL Register File Slave IF
CFG BUS
AXI4LV
Fig. 63.2 A detailed block diagram of the AXI4LV IP design, illustrating internal hierarchy and
functional blocks
( REG_ENABLE == 0)
CTRL_ERROR
ERROR
CTRL_IDLE CTRL_DONE
( REG_ENABLE == 1 )
&& CTRL_WORK ( M_USR_ACK == 1 )
( SIZE_AVAILABLE >= SIZE_TODO )
Fig. 63.3 State diagram of the control finite state machine for both H2T/W and T2H/R transactions
This submodule implements an AMBA AXI4-Lite slave interface to read and write
32-bit values of the control register file. Table 63.1 shows the structure of the register
file.
Address space is 32 bit long, according to the hexadecimal pattern of
0xYYYYzRRR. The four most significant bits are the base address of the AXI4LV
instance. H2T and T2H register subsets are separated with a 0x00001000 offset.
The least significant bits address the register of interest. Considering AXI4 is byte
addressed and data word are on 32 bits, each register is therefore located with an
offset of 4 from others.
DMA modules are blocks that carry out data transactions and operate stream flow con-
trol. H2T/Write DMA can issue requests independently on AW, W and B channels;
T2H/Read DMA can issue requests on AR and R channels.
When enabled by the control FSM, this module internally stores configuration
register values of interest and then orders AXI request(s) accordingly. Once opera-
tions end, a feedback signal is provided to its control FSM unit and, if enabled by
user, an external interrupt pulse is generated.
AMBA AXI4-Burst transactions are carried out by requesting read or write access
to the target slave on the address channel, starting from provided base address and
of requested length. Burst length may differ: it is indeed the least long among (1)
requested amount of data, (2) maximum allowed burst length or (3) distance from next
4k-boundary address [5]. If needed, AXI4LV IP is capable of automatically split the
whole requested amount of data into multiple AXI transactions, taking into account
all of the above, up to the complete fulfillment of requested data packet. Moreover,
the module supports overlapping (optional AXI4 feature) for maximum paralleliza-
tion and pipeline capabilities (i.e. AXI4LV instances can immediately order another
transaction on the address channel before the previous request is completed on the
data channel).
Data is simply forwarded between the two interfaces and flow control can be
operated either on the AXI bus signals or onto the LabVIEW 3-Wire protocol, in
order to stabilize the streams across and, thus, to not lose any data to be transferred.
On AXI data channels (W and R), flow control is operated driving the ready or valid
signals high or low, exactly as for dready and dvalid signals on the LV 3-Wire IF.
In this section, synth results are presented. AXI4LV sources are tech-independent
code, yet results refer to Xilinx 6-Series and 7-Series devices on which National
Instruments PXI FPGA modules are based on.
Frequency analysis returns 156 MHz as the highest possible clock. At 100 MHz
clock already, no data-rate degradation is exhibited in transfers. Introduced latency
is just 6 clock ticks between request and actual start of the AXI-Burst transaction:
considering this bridge will be used to move large amount of data, it is negligible.
Table 63.2 reports IP synthesis results, carried out by ISE (for 6-Series devices) and
Vivado (for 7-Series devices), IDE of reference tools.
Table 63.2 Synthesis results on target technologies of interest
Device family AXI parameters Resource utilization
Addr width Data width Max burst Slices LUTs Registers
TOT LUT2 LUT3 LUT4 LUT5 LUT6
Virtex-6 32 32 16 522 1540 282 175 382 267 434 725
0.44% 0.32% 0.08%
Kintex-7 32 32 16 497 1295 237 146 319 229 364 737
0.67% 0.43% 0.12%
63 AXI4LV: Design and Implementation of a Full-Speed …
Percent values refers to largest device model in the product family (i.e., XC6VLX760 and XC7480T chips)
547
548 L. Dello Sterpaio et al.
63.4 Conclusions
Presented novel IP efficiently exchange large amount of data between any Lab-
VIEW applications on a host PXI controller PC and any memory mapped AXI target
implemented on a PXI FPGA peripheral module. This IP is essential for AXI-based
architecture implementation onto PXI FPGAs. It supports the most advanced features
introduced in the latest version of the AMBA AXI4 standard, such as overlapping
transactions, wider bus width and 256-elements burst. Provided synthesis results tar-
gets the main reference technologies and highlight the overall very small footprint
in terms of absolute FPGA resource cost.
References
1. AMBA AXI and ACE Protocol Specifications. Issue D, ARM Holdings, Cambridge, Oct 2011,
IHI 0022D
2. National Instruments, “Importing External IP Into LabVIEW FPGA,” LabVIEW FPGA 2014
Support. Accessed on 05/2019. Available at www.ni.com/tutorial/7444/en/
3. National Instruments, “Using FPGA FIFOs (FPGA Module),” LabVIEW FPGA 2010 Sup-
port. Accessed 05/2019. Available at http://zone.ni.com/reference/en-XX/help/371599F-01/
lvfpgahelp/creating_fpga_fifos/
4. National Instruments, “How DMA Transfers Work (FPGA Module),” LabVIEW FPGA 2012
Support. Accessed on 05/2019. Available at http://zone.ni.com/reference/en-XX/help/371599H-
01/lvfpgaconcepts/fpga_dma_how_it_works/
5. Dello Sterpaio L et al (2019) Exploiting LabVIEW FPGA socketed CLIP to design and imple-
ment soft-core based complex digital architectures on PXI FPGA target boards. In: 2019 Inter-
national conference on synthesis, modeling, analysis and simulation methods and applications
to circuit design (SMACD), Lausanne (CH), July 2019
Chapter 64
3D-HEVC Neighboring Block Based
Disparity Vector (NBDV) Derivation
Architecture: Complexity
and Implementation Analysis
Abstract HEVC (High Efficiency Video Coding), the state-of-the-art video coding
standard has 3D extension known as 3D-HEVC, which is established by JCT-3V.
In current design of 3D-HEVC, to exploit the redundancies of the 3D video signal,
various tools are integrated. In 3D-HEVC, the neighboring block disparity vector
(NBDV) mode is used to replace the original predicted depth map (PDM) for inter-
view motion prediction. A new estimated disparity vector depth oriented neighboring
block disparity vector (DoNBDV) is used to enhance the accuracy of the NBDV by
utilizing the coded depth map. In this paper, the complexity and implementation
analysis of the NBDV and DoNBDV architectures are analyzed in terms of per-
formance, complexity, and other design considerations. It is hence concluded that
NBDV and DoNBDV for 3D-HEVC video signals provide attractive coding gains
with comparable complexity as traditional motion/disparity compensation.
64.1 Introduction
The rest of this paper is organized as follows. Standard contributions and previous
academic research associated with disparity vector generation and estimation are
presented In Sect. 64.2. NBDV has been integrated as a necessary technique in 3D-
AVC and 3D-HEVC standards. The technical specifics of the NBDV and a more
refined form of the NBDV i.e. Depth-oriented NBDV (DoNBDV) are presented
in detail. The Complexity of the NBDV is presented in Sect. 64.3. By extensive
discussions and experimental results, the complexity is analyzed. The high-level
hardware design of the NBDV is presented in Sect. 64.4 to confirm the benefits of
the NBDV method in real-world codecs. Section 64.5 presents the conclusion of this
article.
The solution presented in [17] becomes the basis of the development of NBDV
derivation method. The NBDV is integrated as an important method in 3D-HEVC and
3D-AVC standards, after the validation of its effectiveness in 3D platforms. Already
coded motion fields are used for the generation of the NBDV based disparity vectors
with no additional signaling. There is no conversion between depth and disparity
during this disparity generation process. The NBDV does the same process at both the
decoder and encoder with reduced complexity with respect to any main component
of the video codec. Inspiration of the NBDV is presented in this section with a broad
picture of the NBDV process. Further, the NBDV optimizations i.e. DoNBDV is
discussed.
The actual NBDV method was presented in [18, 19] for 3D-HEVC. The basic
concept of this method was to use the temporal and spatial neighboring blocks as
shown in Fig. 64.1. An effective NBDV method depends on the subsequent features:
(i) The likelihood of discovering a disparity-based motion vector, therefore a dispar-
ity motion vector could be obtained; (ii) If obtainable, the rate distortion optimization
correctness based on disparity vector; (iii) Memory access increment based on refer-
ence frames; (iv) Number of additional blocks and additional calculations required
to be checked; and (v) Extra impermanent memory vital to complete NBDV [20].
A1
552 W. Ahmad et al.
An important feature of the NBDV method is, to decode a current block. If the
required blocks are already used by the HEVC decoding, then supplementary retriev-
ing these blocks is not measured as an extra load particularly as of memory bandwidth
viewpoint.
Major milestone in the NBDV optimizations methods propose the reduction of
extra memory accesses. The decoded depth of the main view (also called as base
view) already exist, while coding the dependent view texture. The availability of the
base view depth map might enhance the dependent view texture coding disparity
derivation method. A better disparity vector can be obtained by the following steps:
1. NBDV based disparity vector derivation.
2. Then this NBDV vector is used to trace the matching block in the reference
view’s coded depth. This reference view must have the same view order index as
the NBDV based disparity vector has its view order index. If the traced matching
depth block locate on the boundary or outside the depth picture, then the pixels
located outside the depth picture are clipped to boundary of the picture. The
samples located within the depth picture are reserved untouched.
3. The matching block’s depth in the base view is supposed to be the “virtual depth
block” of the dependent view’s current block.
4. The four edge pixels of the virtual depth block are used for the retrieval of
maximum value of the depth.
5. The disparity is achieved by converting the maximum depth value obtained in
step 4.
D0 denotes the coded depth map of the view 0 as shown in Fig. 64.2. T1 is the
texture to be coded. Using the disparity vector estimated by NBDV, depth block from
the coded depth D0 for the current block (CB) is derived. In NBDV the estimated
disparity vector is obtained as stated in step 1, this method is already integrated in
the Test Model of HEVC [22]. The candidates for disparity vectors can be from
temporal/spatial motion compensated predicted (DV-MCP) neighbouring blocks,
temporal/spatial disparity compensation prediction (DCP) neighbouring blocks. The
CB
Coded D0
main purpose of NBDV optimization is to utilize the extracted virtual depth to retrieve
a more precise disparity vector for prediction. In the existing implementation, the
maximum disparity of the virtual depth is converted into the new disparity vector. The
camera parameters and view position are used for the conversion of depth to depth
values. The new improved disparity vector is termed as “depth oriented neighbouring
block disparity vector” (DoNBDV). The DoNBDV has overhead of only accessing
the reference depth buffer. The other coding tools can use the virtual depth obtained
during the estimation of DoNBDV. The merge mode and Advanced Motion Vector
Prediction (AMVP) makes the utilization of DoNBDV in obtaining the inter-view
motion prediction.
According to the HTM reference software and the algorithm description, the steps
listed in Sect. 64.2 are used for the disparity vector derivation. For each 4 × 4 block,
in the depth map, one depth value is maintained. During this process, the holes may
occur for the depth map and hole-filling process may be carried out. The hole-filling
process fills the depth hole by means of the available foremost depth value of the
same line.
One of the most important issue in the hardware design is the worst-case memory
bandwidth requirement. As, the system on chip (SoC) has fixed total data transfer rate
and hardware design of video must share it with further applications. In 3D-HEVC,
all HEVC tools are supported, the extra memory bandwidth required for the tools of
3D-HEVC should be controlled well to reduce memory bandwidth requirement.
Due to the motion prediction of HEVC design, no extra memory is needed for the
spatial neighboring blocks of NBDV because those blocks are already accessible.
Since, half of the temporal neighboring blocks belong to the Temporal Motion
Vector Prediction (TMVP) co-located picture, hence, extra memory access is needed
for two blocks only of the extra candidate frame. In general, we can say, for the NBDV
process the additional memory bandwidth required is the identical as required for
the HEVC TMVP process. The subsequent candidate image is accessed the identical
way as per TMVP.
Overall memory access analysis is presented by evaluating the number of samples
to be retrieved for minutest (i.e. 8 × 8) coding unit decoding in the scenario of the
worst-case. The worst-case occurs in case of merge mode and Prediction Unit (PU)
554 W. Ahmad et al.
size of 8 × 8 with bi-prediction. If we assume 4 bytes for motion vector and 1 byte
for reference picture index representation, though it is implementation dependent.
The analysis of the complexity of the dependent view decoding in 3D-HEVC
can be performed by comparing its complexity with single-view HEVC coding, the
complex modes such as Advanced Residual Prediction (ARP), inter-view motion
prediction are excluded while analyzing the complexity. Hence, only major modules
such as merge list construction and motion compensation are considered as anchor
for memory bandwidth.
Motion Compensation. In HEVC, 8-tap interpolation filters [23, 24] are used to
interpolate one PU of size 8 × 8. For an 8 × 8 block bi-predicted, 450 (i.e., (8 +
7) × (8 + 7) × 2) pixels need to be accessed. This figure can be even more for the
motion compensation depending on the memory access pattern. Thus, even reduced
percent of total memory bandwidth may be required for the NBDV process.
Construction process of Merge list. For the construction of merge list, motion
information together with two motion vectors and indices of two reference pictures,
of up to 2 temporal neighboring block and 5 spatial neighboring blocks are retrieved
in the co-located picture. For creating the two combine lists of the said block, the
total bytes needed to retrieve are: (2 + 8) × 2 × (5 + 2) = 140.
NBDV. Like, we described earlier, the motion data, including the indices of the
reference linked with the two more blocks of extra candidate picture and two motion
vectors required to be accessed for a block of size 8 × 8, totaling up to (2 + 8) × 2
= 20 bytes. Because of the inherent characteristic of NBDV, blocks of the temporal
positions can be used early to examine the four reference indices of the two blocks for
motion vectors retrieval, this happens only in case of identification of the reference
image of the inter-view prediction situation. The additional bytes required to be
retrieve in this situation are 4 + 4 = 8 bytes.
Table 64.1 shows, for the NBDV the increase in memory access is about one (1)
%. For 3D-HEVC, additional tools such as ARP and motion prediction of inter-view
motion may need additional accesses of memory as compared to NBDV. Reason for
this extra memory access is because of the pixels from extra blocks and inter-view
reference picture motion vectors. The major tools of the 3D-HEVC are based on
NBDV, this estimated 1% memory bandwidth increment is acceptable.
In NBDV, the motion data of the two aspirant images in Decoded Picture Buffer
(DPB) is used, each image in the DPB holds motion vectors, thus, no extra information
storage in DPB is required to rise the memory storage. Even though, the Derived
Disparity Vector (DDV) may need to store one vector per slice, but, for the current
picture decoding the temporary memory required is negligible.
NBDV process required to check the Disparity Motion Vector (DMV). Thus, in
comparison with the motion compensation the complexity is negligible. The extra
reference indices access by the NBDV is up-to nine blocks to both lists of reference
pictures. Hence, only 18 extra conditions per CU are inserted. In each access, the
reference index must be checked, thus, this checking is approximately as expensive
(in terms of hardware) as an addition process.
For a bi-predicted 8 × 8 CU block, the interpolation of 64 × 2 pixels for luma
part may be needed. Eight multiplications and seven additions are required for the
interpolation of each pixel of luma component. If we suppose the complexity of
the operation of one multiplication is about 5 times the complexity of the addition
operation. Then the complexity for the process of motion compensation of the luma
component will be roughly about [(5 × 8 + 7) × 128] 6016 operations of addi-
tions. Thus, the complexity of the luma component of motion compensation is about
334 (6016/18) times the NBDV process complexity. Therefore, in comparison with
the overall decoder’s computation, the NBDV process has negligible computational
complexity as shown in Table 64.2.
Table 64.2 NBDV complexity and complexity of bi-prediction of luma block of motion
compensation
CU Size 8 × 8 Required approximate Complexity
hardware
NBDV 18 additions Negligible as compared to
bi-prediction
Bi-predicted luma component 6016 additions 334 xNBDV
motion compensation HEVC
556 W. Ahmad et al.
the availability of the basic requirements of the NBDV process i.e. spatial/temporal
neighboring blocks of the CU.
In Fig. 64.4, the DoNBDV block level hardware implementation is presented it
involve the inclusion of already coded depth map of base view for the more accurate
derivation of disparity vector of the current coding block.
64.5 Conclusion
In this paper, the complexity and implementation analysis of the NBDV and
DoNBDV designs are examined in terms of complexity, performance and other
design deliberations. It is determined that NBDV and DoNBDV for 3D-HEVC
video signals offers striking coding gains with analogous complexity as traditional
motion/disparity compensation. NBDV and DoNBDV are the important tools of
the 3D extensions of HEVC and H.264/AVC. As discussed above, these tools have
appropriate coding gains and the hardware implementation of these tools can further
help to improve the coding gain and performance of video coding standards.
References
1. Tanimoto M (2006) Overview of free viewpoint television. Signal Process Image Commun
21(6):454–461
2. Vetro A, Matusik W, Pfister H, Xin J (2016) Coding approaches for end-to-end 3D TV systems.
In: Proceedings of the 23rd picture coding symposium (PCS’04), San Francisco, California,
USA, Dec 2004, pp 319–324
3. Urey H, Chellappan KV, Erden E, Surman P (2011) State of the art in stereoscopic and
autostereoscopic displays. Proc IEEE 99(4). https://doi.org/10.1109/JPROC.2010.2098351
4. Dodgson NA (2005) Autostereoscopic 3D displays. IEEE Comput 38(8):31–36
5. Fehn C (2004) Depth-image-based rendering (DIBR), compression, and transmission for a new
approach on 3D-TV. In: Proceedings of SPIE, stereoscopic displays and virtual reality systems
XI, vol 5291, May 2004, p 93
6. Scharstein D, Szeliski R (2002) A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. Int J Comput Vis 47(1):7–42
7. Foix S, Alenya G, Torras C (2011) Lock-in time-of-flight (ToF) cameras: a survey. IEEE Sensors
J 11(9):1917–1926
8. Salvi J, Pagès J, Batlle J (2004) Pattern codification strategies in structured light systems.
Pattern Recognit 37(4):827–849
9. Call for proposals on 3D video coding technology, N12036, MPEG of ISO/IEC
JTC1/SC29/WG11, Geneva, Switzerland, Mar 2011
10. Kauff P, Atzpadin N, Fehn C, Müller M, Schreer O, Smolic A, Tanger R (2007) Depth map
creation and image based rendering for advanced 3DTV services providing interoperability
and scalability. Signal Processing: Image Communication. Special Issue on 3DTV, Feb 2007
11. Advanced video coding for generic audiovisual services, Rec. ITU-T H.264 and ISO/IEC
14496-10 (MPEG-4 AVC), 2012
12. High efficiency video coding, Rec. ITU-T H.265 and ISO/IEC 23008-2, Jan 2013
13. Sullivan GJ, Ohm J-R, Han W-J, Wiegand T (2012) Overview of the high efficiency video
coding (HEVC) standard. IEEE Trans Circuits Syst Video Technol 22(12):1649–1668
558 W. Ahmad et al.
14. Hannuksela MM, Chen Y, Suzuki T, Ohm J-R, Sullivan G (2013) 3D-AVC draft text 8. Presented
at the 6th meeting joint collaborative team on 3D video coding extension development, Geneva,
Switzerland, 25 Oct–1 Nov, 2013, Doc. JCT3V-F1002
15. Tech G, Wegner K, Chen Y, Yea S (2013) 3D-HEVC Draft Text 2. Presented at the 6th meeting
joint collaborative team on 3D video coding extension development, Geneva, Switzerland, 25
Oct–1 Nov 2013, Doc. JCT3V-F1001
16. Applications and requirements on 3D video coding, N12035, MPEG of ISO/IEC
JTC1/SC29/WG11, Geneva, Switzerland, Mar 2011
17. Schwarz H, Wiegand T (2012) Inter-view prediction of motion data in multiview video coding.
In: Proceedings of picture coding symposium, May 2012, pp 101–104
18. Zhang L, Chen Y, Karczewicz M (2012) CE5.h: disparity vector generation results. Presented at
the 1st meeting joint collaborative team on 3D video coding extension development, Stockholm,
Sweden, 16–20 July 2012, Doc. JCT3V-A0097
19. Zhang L, Chen Y, Karczewicz M (2013) Disparity vector based advanced inter-view prediction
in 3D-HEVC. In: Proceedings of IEEE international symposium circuits system, May 2013,
pp 1632–1635
20. Tech G, Müller K, Ohm J-R, Vetro A, Overview of the multiview and 3D extensions of high
efficiency video coding. IEEE Trans Circuits Syst Video Technol 26(1)
21. Chen Y et al (2014) Test model 10 of 3D-HEVC and MV-HEVC. Joint collaborative team
on 3D video coding extensions of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11,
Document JCT3V-J1003. 10th meeting, Strasbourg
22. Gerhard Tech et al (2012) 3D-HEVC Test Model 1. Joint collaborative team on 3D video
coding extension development of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11,
JCT3V-A1005, July 2012, Stockholm
23. Ahmad W et al (2016) High level synthesis based FPGA implementation of H. 264/AVC
sub-pixel luma interpolation filters. In: Modelling symposium (EMS), 2016, European. IEEE
24. Ahmad W, Martina M, Masera G (2015) Complexity and implementation analysis of synthe-
sized view distortion estimation architecture in 3D High Efficiency Video Coding. In: 2015
International conference on 3D imaging (IC3D). IEEE
Author Index
L
E Lasagni, Marco, 251
ElHassan, Bachar, 145 Laurendi, D., 267
Lopomo, N. F., 81
F
Falaschi, Francesco, 117 M
Falbo, Vincenzo, 103 Magazzù, G., 11
Fanucci, Luca, 117, 381, 483, 499, 513, 541 Magistrati, Giorgio, 513
Faralli, Stefano, 11, 233 Malatesta, Michelangelo Maria, 363
Fazzolari, Rocco, 325 Manghisoni, M., 19
Fernandes Carvalho, D., 81 Mangraviti, G., 25
Ferrari, P., 81 Mansour, Ali, 145
Ferri, Giuseppe, 425 Maresca, Luca, 259
Feyen, D. A. M., 243 Marino, Antonino, 483, 499, 541
Fienga, Francesco, 259 Marrazzo, Vincenzo Romano, 259
Filosa, Mariangela, 381 Marsi, Stefano, 341, 489
Fiore, Gaia, 433 Martina, Maurizio, 137, 549
Martínez Madrid, Natividad, 333
Marzani, Alessandro, 355, 363
Masera, Guido, 137
G Massari, Luca, 381
Gagliardi, Alessio, 405 Mastrandrea, Antonio, 109, 505, 529
Gaiduk, Maksym, 333 Matera, Riccardo, 521
Gaioni, L., 19 Matta, Marco, 325
Gallina, Paolo, 489 Maya López, Armando, 43
Gastaldo, Paolo, 301 Meacci, Valentino, 371
Genovese, Antonino, 293 Melo, Douglas, 91
Giammatteo, Paolo, 191 Menichelli, Francesco, 109, 505, 529
Gianoglio, Christian, 301 Menicucci, Alessandra, 91
Giardino, Daniele, 325 Meoni, Gabriele, 499, 513
Girolami, Alberto, 355 Mercola, M., 243
Girolami, Marco, 33 Merenda, M., 267
Giuffrida, Gianluca, 381 Messina, E., 243
Gloria De, Alessandro, 103, 349, 469 Mestice, Marco, 451
Graziano, M., 173 Mihet-Popa, Lucian, 433
Guidi, Francesco, 371 Monda, D., 25
Guzzi, Francesco, 341, 489 Montagni, Marco, 389
Morozzi, Arianna, 3
Motto Ros, Paolo, 207
H Muanenda, Yonas, 233
Hussain, Fawad, 549 Muttillo, Mirco, 425
Author Index 561