J Aca 2021 338520
J Aca 2021 338520
J Aca 2021 338520
h i g h l i g h t s g r a p h i c a l a b s t r a c t
a r t i c l e i n f o a b s t r a c t
Article history: In the domain of chemometrics, multiblock data analysis is widely performed for exploring or fusing data
Received 1 February 2021 from multiple sources. Commonly used methods for multiblock predictive analysis are the extensions of
Received in revised form latent space modelling approaches. However, recently, deep learning (DL) approaches such as con-
27 March 2021
volutional neural networks (CNNs) have outperformed the single block traditional latent space modelling
Accepted 13 April 2021
Available online 16 April 2021
chemometric approaches such as partial least-square (PLS) regression. The CNNs based DL modelling can
also be performed to simultaneously deal with the multiblock data but was never explored until this
study. Hence, this study for the first time presents the concept of parallel input CNNs based DL modelling
Keywords:
Data fusion
for multiblock predictive chemometric analysis. The parallel input CNNs based DL modelling utilizes
Artificial intelligence individual convolutional layers for each data block to extract key features that are later combined and
Spectroscopy passed to a regression module composed of fully connected layers. The method was tested on a real
Chemistry visible and near-infrared (Vis-NIR) large data set related to dry matter prediction in mango fruit. To have
the multiblock data, the visible (Vis) and near-infrared (NIR) parts were treated as two separate blocks.
The performance of the parallel input CNN was compared with the traditional single block CNNs based
DL modelling, as well as with a commonly used multiblock chemometric approach called sequentially
orthogonalized partial least-square (SO-PLS) regression. The results showed that the proposed parallel
input CNNs based deep multiblock analysis outperformed the single block CNNs based DL modelling and
the SO-PLS regression analysis. The root means squared errors of prediction obtained with deep mul-
tiblock analysis was 0.818%, relatively lower by 4 and 20% than single block CNNs and SO-PLS regression,
respectively. Furthermore, the deep multiblock approach attained ~3% lower RMSE compared to the best
known on the mango data set used for this study. The deep multiblock analysis approach based on
parallel input CNNs could be considered as a useful tool for fusing data from multiple sources.
© 2021 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
* Corresponding author.
E-mail address: [email protected] (Puneet Mishra).
https://doi.org/10.1016/j.aca.2021.338520
0003-2670/© 2021 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Puneet Mishra and D. Passos Analytica Chimica Acta 1163 (2021) 338520
2
Puneet Mishra and D. Passos Analytica Chimica Acta 1163 (2021) 338520
2.2. Parallel CNNs based deep learning Since the main objective of this work is the introduction of a
proof-of-concept architecture, for the sake of simplicity we chose to
The DL model architecture used in this study was an extension optimize only a limited number of models hyperparameters, i.e.,
of the 1-dimensional convolutional neural network (1D-CNN) ar- the filter sizes for both conv. blocks 1 and 2, the size of the training
chitecture presented in [27] and implemented in [35e37]. The 1D- mini-batch and the strength of the L2 regularization, b. A grid
CNN architecture used in [27] was introduced to deal with a single search was implemented that probed 1260 models with “filter
block of data. For multiblock analysis, the 1D-CNN architecture sizes” 1 and 2 in the interval [5,10,15,20,25,30], batch size in [32, 64,
must be modified to accept multiple sources of data. An intuitive 128, 256, 512] and b in [0.001, 0.003, 0.008, 0.01, 0.015, 0.02, 0.03].
solution to solve this problem is to implement an architecture with The training set was further partitioned into calibration (66.66%)
input parallel layers that can simultaneously extract the features and validation (33.33%) using the ‘test_train_split’ function from
from multiple sources of data. Hence, the solution proposed in this sklearn (https://scikit-learn.org/stable/).
study is to use parallel convolution layers to process different data The optimisation strategy used first checks the effect of different
types that were later concatenated and flatten before being fed to a b on the minima root mean squared errors (RMSE) of calibration
dense layers block. A summary of the architecture proposed in this and validation sets and identifies an “optimal” b as the one
study for deep multiblock analysis is shown in Fig. 1. For an easier achieving lowest difference between the calibration and validation
explanation of the concept, we opted to implement a simple CNN sets RMSE. A low difference in these RMSEs was a signal of less
architecture with only two receptive fields composed by 2 convo- overfitting of the model on the training set. The optimal batch size
lution layers with 1 filter each and stride ¼ 1 followed by a flatten was chosen based on the same criteria. Once the optimal b and
layer that is connected to 3 fully FC layers with 36, 18 and 12 batch size were set, the kernel sizes for each block were identified
neurons, respectively and a final output layer with one neuron. The by searching for common minima in the calibration and the vali-
number of units (or neurons) in the FC layers follows the pre- dation in filter-1-size vs filter-2-size RMSE contour plots. The
scription of the original architecture in [25]. After each layer, the models with optimal hyperparameters were used for predicting the
data flows through an exponential linear unit (eLU) activation test set. The parallel input CNN was implemented using the Python
function, except for the last output layer where a linear activation (3.6) language and the deep learning framework Tensorflow (2.4)
function was used. The mean squared error (MSE) was used as the with the tf.Keras API on a workstation equipped with a NVidia GPU
loss function and layer regularization was implemented by adding (GeForce RTX 2080 Ti), an Intel® Core™ i7-4770k @3.5 GHz
an L2 penalty (b) on the model weighs (and added to the loss and 64 GB RAM, running Microsoft Windows 10 OS. The chemo-
function). We rely on the Adaptive moment estimation (Adam [32]) metric analysis related to outlier removal was performed in MAT-
optimizer with the back-propagation algorithm to train the model LAB 2018b, MathWorks, Natick, USA using the freely available MBA-
weights. Adam was initiated with an initial learning rate (LR) given GUI [1].
by 0.01 (batch size)/256 and to increase the chances of conver-
gence toward a global minimum, the LR was iteratively decreased 2.3. Benchmark analysis
by a factor of 2 when the validation loss wasn’t improved by 105
after 25 epochs (using the tf.keras.ReduceLROnPlateau() in func- The performance of the deep multiblock analysis based on
tion). The maximum number of epochs allocated for the training parallel input CNN modelling was compared with two benchmark
was 700 but that value was almost never reached due to the use of models. The first was the single block CNNs presented in Ref. [27],
the Early Stopping technique (tf.keras.EarlyStopping() function). where the two data blocks were concatenated (to their original
This technique helps avoid overfitting by stopping the training form) in the variable domain to make it a single block data [27]. This
process if the validation metrics don’t improve after a certain single block CNN was also optimized using a grid search approach
consecutive number of epochs. for filter size in the convolutional layer, batch size and b over the
same hyperparameter intervals previously defined in section 3.2.
Optimal values were chosen based on the same criteria as pre-
sented in 3.2 with the difference that the final step involves a search
for minima in “batch size” vs “filter size” contour plot.
The multiblock CNN analysis method was also compared in
terms of accuracy with a popular multiblock predictive modelling
technique called sequential orthogonalized partial least-square
(SO-PLS) regression [17]. The SO-PLS at first builds a PLS regres-
sion with the first block of data to extract the scores related to the
property of interest. Later, the scores were used to orthogonalize
the data matrix from the second block and the response variable to
remove the already explained part of the property of interest. The
orthogonalized second block data was later used to build a new PLS
model. At last, all the scores from the two different blocks were
concatenated and used to build the final model. The SO-PLS
regression was implemented with the freely available codes from
MBA-GUI [1]. A key parameter to optimize in the SO-PLS was the
number of latent variables (LVs) for each data block. The used
approach was to try all possible combinations of LVs from all blocks
and later choose the one carrying the lowest error [3,8]. However,
in this work, to achieve a faster optimisation of the number of LVs, a
sequential optimisation was performed. In sequential optimisation,
at first, the total number of LVs for the first block were identified by
Fig. 1. A summary of the parallel CNNs architecture used for deep multiblock pre- increasing the LVs from 1 to 40 and monitoring the performance of
dictive modelling. the model on the validation set. The optimal number of LVs for the
3
Puneet Mishra and D. Passos Analytica Chimica Acta 1163 (2021) 338520
first block was selected as the elbow point in the error plot. The closer to each other. Around b ¼ 0:01 the difference between the
scores of the first block were then used to orthogonalize the second RMSE of the calibration and validation was minimal and it was
block and the property on interest. Later, the optimal number of LVs considered a good compromise point between model performance
for the 2nd block was found by varying the LVs from 1 to 40 and (low validation RMSE) and low overfitting (the smaller difference
monitoring the performance of the model on the orthogonalized between RMSEs). For b < 0:01, the validation RMSE was lower but
validation set. Once again, the optimal LVs for the 2nd block was the higher difference to calibration RMSE indicates that the model
selected as the elbow point of the error plot. Finally, the multiblock was overfitting more, hence, losing its capacity of generalizing well
SO-PLS model with optimal LVs was built and tested on the inde- when applied to the test set.
pendent test set. In all cases, the performance of the models was After choosing b ¼ 0:01, the optimal filter and batch sizes were
judged based on the RMSE. identified by identifying common minima for the RMSE on cali-
bration (Fig. 4A) and validation (Fig. 4B) set. Kernel filter size ¼ 25
and batch size ¼ 64 were identified as optimal, as highlighted in
3. Results and discussion
Fig. 4B. The model based on these optimal parameters was tested
on the independent test set and RMSE ¼ 0.855% was obtained
3.1. Spectra and reference data
(Fig. 5).
The results of the benchmark SO-PLS modelling are shown in
The mean spectra for two blocks i.e., Vis and NIR for mango fruit
Fig. 6. It can be noted that the SO-PLS identified 15 (Figs. 6A) and 8
are shown in Fig. 2. Further, the reference dried matter distribu-
(Fig. 6B) LVs in the Vis and NIR data blocks, respectively. Finally, the
tions for calibration (red), validation (blue) and test (green) sets are
model based on the optimal LVs was tested on the independent test
shown in Fig. 2C. In the Vis spectra (Fig. 2A), some key peaks at
set and RMSE ¼ 1.03% was reached. The performance of the SO-PLS
500 nm and 670 nm can be noted. These peaks are related to the
was poorer compared to the single block CNN modelling performed
colour of the outer peel which can range from green to yellow to
on the concatenated data. However, the performance of the SO-PLS
red depending on the fruit cultivar and the maturity stage. During
was better compared to the single block PLS analysis (on NIR data)
ripening the green colour of the outer peel changes toward red
presented on the same data set in earlier studies [33,34,36]. Hence,
tones due to chlorophyll degradation. Hence, indirectly, the colour
the SO-PLS analysis demonstrates that combining the Vis infor-
of the outer peel correlates with the maturity stage of the fruit, and
mation with the NIR could improve the model performance.
thus, also to the DM (%) in the fruit. In the NIR spectra (Fig. 2B), the
main peak at 960 nm related to the 3rd overtone of the OH bond
related to H2O can be noted. The overtone related to the OH is due
to the high moisture in the fruit and is inversely related to the dried
mater in the fruit (dried matter ¼ 1 - moisture). With the distri-
bution of reference DM (Fig. 2C), it can be noted that DM range for
the test set was higher compared to the training and validation set.
The data in the Vis range has different width peak such as the peak
near 670 nm (Fig. 2A) is much thinner compared to the broad peak
at 960 nm (Fig. 2B). Hence, in practical term for this study case,
utilising the same convolutional filter size for Vis and NIR data may
not be an optimal solution and exploration toward optimal con-
volutional filter size was required for different data blocks.
Fig. 2. Mean spectra mango for calibration, validation and test set. (A) Visible, and (B) near-infrared. (C) The histogram of reference dry matter (DM %) for calibration, validation and
test set.
4
Puneet Mishra and D. Passos Analytica Chimica Acta 1163 (2021) 338520
Fig. 4. A summary of root mean squared error (RMSE) obtained from the grid search for optimal filter (or kernel) and batch sizes. (A) Calibration set, and (B) validation set. The
optimal hyperparameters correspond to common minima in both maps and are marked as “Optimal” in (B).
5
Puneet Mishra and D. Passos Analytica Chimica Acta 1163 (2021) 338520
Fig. 6. A summary of SO-PLS model. (A) Latent variable modelled form the visible data block, (B) complementary latent variables extracted from near-infrared data block, and (C)
performance of model on test set. The vertical lines in (A, B) shows the optimal latent variables extracted from visible and near-infrared spectral data.
Fig. 8. A summary of root mean squared error (RMSE) obtained from the grid search for optimal kernel size for visible and near-infrared data (A) calibration set, and (B) validation
set. Three sample models (Model 1, 2 and 3) were selected showing minima in both calibration and validation sets.
6
Puneet Mishra and D. Passos Analytica Chimica Acta 1163 (2021) 338520
Fig. 9. A summary of performance of three multiblock models (A) Model 1, (B) Model 2, and (C) Model 3.
Fig. 10. A summary of mean activations of the CNN for the single block (A) and multiblock CNN case (B).
7
Puneet Mishra and D. Passos Analytica Chimica Acta 1163 (2021) 338520
CRediT authorship contribution statement Extensions, Data Handling in Science and Technology, Elsevier2019, pp. 157-
177.
[18] P. Mishra, A. Biancolillo, J.M. Roger, F. Marini, D.N. Rutledge, New data pre-
Puneet Mishra: Conceptualization, Methodology, Software, processing trends based on ensemble of multiple preprocessing techniques,
rio Passos: Conceptual-
Writing e original draft, Data curation. Da Trac. Trends Anal. Chem. (2020), 116045.
ization, Software, Methodology, Writing e review & editing. [19] T. Skotare, D. Nilsson, S. Xiong, P. Geladi, J. Trygg, Joint and unique multiblock
Analysis for integration and calibration transfer of NIR Instruments, Anal.
Chem. 91 (2019) 3516e3524.
Declaration of competing interest [20] M. Alinaghi, H.C. Bertram, A. Brunse, A.K. Smilde, J.A. Westerhuis, Common
and distinct variation in data fusion of designed experimental data, Metab-
olomics 16 (2019) 2.
The authors declare that they have no known competing [21] A. Biancolillo, F. Marini, J.-M. Roger, SO-CovSel, A novel method for variable
financial interests or personal relationships that could have selection in a multiblock framework, J. Chemometr. 34 (2020), e3120.
appeared to influence the work reported in this paper. [22] A.K. Smilde, J.A. Westerhuis, R. Boque , Multiway multiblock component and
covariates regression models, J. Chemometr. 14 (2000) 301e331.
[23] J.A. Westerhuis, T. Kourti, J.F. MacGregor, Analysis of multiblock and hierar-
References chical PCA and PLS models, J. Chemometr. 12 (1998) 301e321.
[24] K.H. Liland, T. Næs, U.G. Indahl, ROSAda fast extension of partial least squares
[1] P. Mishra, J.M. Roger, D.N. Rutledge, A. Biancolillo, F. Marini, A. Nordon, regression for multiblock data analysis, J. Chemometr. 30 (2016) 651e662.
D. Jouan-Rimbaud-Bouveresse, MBA-GUI, A Chemometric Graphical User [25] S. Park, E. Ceulemans, K. Van Deun, Sparse common and distinctive covariates
Interface for Multi-Block Data Visualisation, Regression, Classification, Vari- regression, J. Chemometr. (2020), e3270 n/a.
able Selection and Automated Pre-processing, Chemometrics and Intelligent [26] W. Ng, B. Minasny, M. Montazerolghaem, J. Padarian, R. Ferguson, S. Bailey,
Laboratory Systems, 2020, 104139. A.B. McBratney, Convolutional neural network for simultaneous prediction of
[2] P. Mishra, J.M. Roger, D. Jouan-Rimbaud-Bouveresse, A. Biancolillo, F. Marini, several soil properties using visible/near-infrared, mid-infrared, and their
A. Nordon, D.N. Rutledge, Recent trends in multi-block data analysis in che- combined spectra, Geoderma 352 (2019) 251e267.
mometrics for multi-source data integration, Trac. Trends Anal. Chem. (2021), [27] C. Cui, T. Fearn, Modern practical convolutional neural networks for multi-
116206. variate regression: applications to NIR calibration, Chemometr. Intell. Lab.
[3] P. Mishra, F. Marini, B. Brouwer, J.M. Roger, A. Biancolillo, E. Woltering, E.H.- Syst. 182 (2018) 9e20.
v. Echtelt, Sequential fusion of information from two portable spectrometers [28] S. Malek, F. Melgani, Y. Bazi, One-dimensional convolutional neural networks
for improved prediction of moisture and soluble solids content in pear fruit, for spectroscopic signal regression, J. Chemometr. 32 (2018), e2977.
Talanta 223 (2021), 121733. [29] E.J. Bjerrum, M. Glahder, T. Skov, Data Augmentation of Spectral Data for
[4] A. Biancolillo, R. Bucci, A.L. Magrì, A.D. Magrì, F. Marini, Data-fusion for mul- Convolutional Neural Network (CNN) Based Deep Chemometrics, 2017 arXiv
tiplatform characterization of an Italian craft beer aimed at its authentication, preprint arXiv:1710.01927.
Anal. Chim. Acta 820 (2014) 23e31. [30] X.J. Yu, H.D. Lu, D. Wu, Development of deep learning method for predicting
[5] R. Vitale, O.E. de Noord, J.A. Westerhuis, A.K. Smilde, A. Ferrer, Divide, et al., firmness and soluble solid content of postharvest Korla fragrant pear using
How disentangling common and distinctive variability in multiset data anal- Vis/NIR hyperspectral reflectance imaging, Postharvest Biol. Technol. 141
ysis can aid industrial process troubleshooting and understanding, (2018) 39e49.
J. Chemometr. (2020), e3266 n/a. [31] X. Yu, H. Lu, Q. Liu, Deep-learning-based regression model and hyperspectral
[6] P. Mishra, J.M. Roger, D.N. Rutledge, E. Woltering, SPORT pre-processing can imaging for rapid detection of nitrogen concentration in oilseed rape (Brassica
improve near-infrared quality prediction models for fresh fruits and agro- napus L.) leaf, Chemometr. Intell. Lab. Syst. 172 (2018) 188e193.
materials, Postharvest Biol. Technol. 168 (2020), 111271. [32] N. Anderson, K. Walsh, P. Subedi, Mango DMC and spectra Anderson et al,
[7] P. Mishra, J.M. Roger, F. Marini, A. Biancolillo, D.N. Rutledge, Parallel Pre- Mendley, Mendley data, 2020, 2020.
processing through Orthogonalization (PORTO) and its Application to Near- [33] N.T. Anderson, K.B. Walsh, J.R. Flynn, J.P. Walsh, Achieving robustness across
Infrared Spectroscopy, Chemometrics and Intelligent Laboratory Systems, season, location and cultivar for a NIRS model for intact mango fruit dry
2020, 104190. matter content. II. Local PLS and nonlinear models, Postharvest Biol. Technol.
[8] J.-M. Roger, A. Biancolillo, F. Marini, Sequential preprocessing through 171 (2021), 111358.
ORThogonalization (SPORT) and its application to near infrared spectroscopy, [34] N.T. Anderson, K.B. Walsh, P.P. Subedi, C.H. Hayes, Achieving robustness
Chemometr. Intell. Lab. Syst. 199 (2020), 103975. across season, location and cultivar for a NIRS model for intact mango fruit dry
[9] R. Bro, A.K. Smilde, Principal component analysis, Analytical Methods 6 (2014) matter content, Postharvest Biol. Technol. 168 (2020), 111202.
2812e2831. [35] P. Mishra, D. Passos, Realizing Transfer Learning for Updating Deep Learning
[10] S. Wold, M. Sjostrom, L. Eriksson, PLS-regression: a basic tool of chemo- Models of Spectral Data to Be Used in a New Scenario, Chemometrics and
metrics, Chemometr. Intell. Lab. Syst. 58 (2001) 109e130. Intelligent Laboratory Systems, 2021, 104283.
[11] P. Geladi, B.R. Kowalski, Partial least-squares regression: a tutorial, Anal. Chim. [36] P. Mishra, D. Passos, A Synergistic Use of Chemometrics and Deep Learning
Acta 185 (1986) 1e17. Improved the Predictive Performance of Near-Infrared Spectroscopy Models
[12] A.K. Smilde, I. Måge, T. Næs, T. Hankemeier, M.A. Lips, H.A.L. Kiers, E. Acar, for Dry Matter Prediction in Mango Fruit, Chemometrics and Intelligent
R. Bro, Common and distinct components in data fusion, J. Chemometr. 31 Laboratory Systems, 2021, 104287.
(2017), e2900. [37] P. Mishra, D.N. Rutledge, J.-M. Roger, K. Wali, H.A. Khan, Chemometric pre-
[13] Y. Song, J.A. Westerhuis, A.K. Smilde, Separating common (global and local) processing can negatively affect the performance of near-infrared spectros-
and distinct variation in multiple mixed types data sets, J. Chemometr. 34 copy models for fruit quality prediction, Talanta (2021), 122303.
(2020), e3197. [38] R. Lu, R. Van Beers, W. Saeys, C. Li, H. Cen, Measurement of optical properties
[14] I. Måge, A.K. Smilde, F.M. van der Kloet, Performance of methods that separate of fruits and vegetables: a review, Postharvest Biol. Technol. 159 (2020),
common and distinct variation in multiple data blocks, J. Chemometr. 33 111003.
(2019), e3085. [39] W. Saeys, N.N. Do Trong, R. Van Beers, B.M. Nicolai, Multivariate calibration of
[15] A. Biancolillo, T. Næs, R. Bro, I. Måge, Extension of SO-PLS to multi-way arrays: spectroscopic sensors for postharvest quality evaluation: a review, Post-
SO-N-PLS, Chemometr. Intell. Lab. Syst. 164 (2017) 113e126. harvest Biol. Technol. (2019) 158.
[16] T. Næs, O. Tomic, N.K. Afseth, V. Segtnan, I. Måge, Multi-block regression [40] K.B. Walsh, J. Blasco, M. Zude-Sasse, X. Sun, Visible-NIR ‘point’ spectroscopy in
based on combinations of orthogonalisation, PLS-regression and canonical postharvest fruit and vegetable assessment: the science behind three decades
correlation analysis, Chemometr. Intell. Lab. Syst. 124 (2013) 32e42. of commercial use, Postharvest Biol. Technol. 168 (2020), 111246.
[17] A. Biancolillo, T. Næs, M. Cocchi, Chapter 6 - the Sequential and Orthogonal- [41] B.G. Osborne, Near-Infrared Spectroscopy in Food Analysis, Encyclopedia of
ized PLS Regression for Multiblock Regression: Theory, Examples, and Analytical Chemistry, 2006.