QSPR Studies of Carbonyl Hydroxyl Polyene Indices
QSPR Studies of Carbonyl Hydroxyl Polyene Indices
QSPR Studies of Carbonyl Hydroxyl Polyene Indices
Abstract
One of the main disadvantages of the use of synthetic or semi-synthetic polymeric materials is their degradation and aging.
The purpose of this study was to use artificial neural networks (ANN) and multiple linear regressions (MLR) to predict the
carbonyl, hydroxyl, and polyene indices (ICO, IOH, and IOP), and viscosity average molecular weight (Mv ) of poly(vinyl chloride),
polystyrene, and poly(methyl methacrylate). These physicochemical properties are considered fundamental during the study
of photostabilization of polymers. From the five repeating units of monomers, the structure of the polymer studied is shown.
Quantitative structure-property relationship (QSPR) models obtained by using relevant descriptors showed good predictability.
Internal validation {R2, RMSE, and Q2LOO}, external validation {R2, RMSE, Q2pred, rm2, , k, and k’}, and applicability domain
were used to validate these models. The comparison of the results shows that the ANN models are more efficient than those
of the MLR models. Accordingly, the QSPR model developed in this study provides excellent predictions, and can be used to
predict ICO, IOH, IOP, and Mv of polymers, particularly for those that have not been tested.
Keywords
QSPR, photostabilization, polymers, artificial neural network, multiple linear regressions
Test Test
Yes No
ε2 ε1
Imperfect Test
No
prediction ε3
Yes
Test ε1: Corresponds to the test of (R → 1, E → 0, and Q2 → 1)
Test ε2: Corresponds to the test of (IR > 6 %) “weight method
Test ε3: Corresponds to the test of (Applicability domain “Williams plot”
*: We retain all the descriptors tested (DiFR + PiFR, DiF + PiF) > 6 % Comparison with
**: We only retain (DiFR + PiFR, DiF + PiF) > 6 %
we remove the (DiFR + PiFR, DiF + PiF) < 6 % Multiple linear regression (MLR)
: First pass loop with test ε1
: Second pass loop with test ε1 + ε2
Performance
(decision)
tors (21), WHIM descriptors (91), and 3D Autocorrelation 2.3 Selection of relevant descriptors
descriptors (80).
The pre-processing of the database is to eliminate the ir-
Simplified Molecular Input-Line Entry System (SMILES) no- relevant descriptors in order to avoid the phenomenon
tations of polymers were obtained from the ChemBio Ultra of over-fitting. Therefore, we must reduce the variables
Software. (descriptors) that do not have or have little influence on
4 H. MAOUZ et al.: QSPR Studies of Carbonyl, Hydroxyl, Polyene Indices, and Viscosity Average..., Kem. Ind. 69 (1-2) (2020) 1−16
Table 1 – List of structures of the unit polymers (UUUUU) used in the development of QSPR models
PVC (100 % wt/wt)14,33,34 PVC with 0.5 % (wt/wt) additive33 PVC with 0.5 % (wt/wt) additive33
PVC with 0.5% (wt/wt) additive 33 PVC with 0.5 % (wt/wt) additive33 PVC with 0.5 % (wt/wt) additive14
PVC with 0.5 % (wt/wt) additive14 PVC with 0.5% (wt/wt) additive 14 PVC with 0.5 % (wt/wt) additive14
PVC with 0.5 % (wt/wt) additive14 PVC with 0.5 % (wt/wt) additive34 PVC with 0.5% (wt/wt) additive34
PVC with 0.5 % (wt/wt) additive34 PS (100 % wt/wt)32 PS with 0.5 % (wt/wt) additive32
H. MAOUZ et al.: QSPR Studies of Carbonyl, Hydroxyl, Polyene Indices, and Viscosity Average..., Kem. Ind. 69 (1-2) (2020) 1−16 5
Table 1 – (continued)
PS with 0.5 % (wt/wt) additive32 PS with 0.5 % (wt/wt) additive32 PS with 0.5 % (wt/wt) additive32
PMMA (100 % wt/wt)35 PMMA with 0.5 % (wt/wt) additive35 PMMA with 0.5 % (wt/wt) additive35
PMMA with 0.5 % (wt/wt) additive35 PMMA with 0.5 % (wt/wt) additive35
the outputs of network (carbonyl index, hydroxyl index, of descriptors obtained after the selection was 107 for ICO,
polyene index, and viscosity average molecular weight). 102 for IOH, 67 for IPO, and 107 for Mv. (5) A program based
Several methods to simplify the database are available in on the stepwise method is used to select the most rele-
the literature; for example, principal component analysis vant descriptors from those obtained previously. Finally,
(PCA), curvilinear or orthogonalization method of Graam- the number of descriptors (Table 2) obtained after stepwise
Schmidt are used. In this present study, the method used selection was: 5 for ICO, 5 for IOH, 3 for IPO, and 4 for Mv.
to select the most significant descriptors was described by The relevant descriptors as well as the duration exposure
some authors.40–42 It takes place in several stages: (1) the (evaluated in hours) were used to develop the QSPR pre-
minimum and maximum are calculated for each descriptor diction model.
using STATISTICA software, then we remove the descrip-
tors that have the maximum and the minimum equal. (2)
The descriptor which has the same value for more than
75 % of the samples is eliminated. (3) Standard deviation 2.4 Model development
(SD) is calculated for each descriptor, and those with SD
values less than 0.05 are eliminated. (4) In this stage, we Descriptors obtained after feature pre-screening were used
used “Matlab” software; a diagonal matrix is then obtained to develop predictive models. Many approaches of model
which represents the correlation between the outputs and development are widely used. Two different approaches to
the descriptors retained. The descriptors are classified ac- developing QSPR prediction models were used.
cording to the decreasing value of the correlation coeffi-
cient. The descriptor with the highest correlation is taken
and compared to the other descriptors in the matrix. Those
2.4.1 Linear model
whose correlation coefficient value is greater than 0.75 are
eliminated in their turn. The same is repeated with the de- The linear model was developed by applying Multiple
scriptor ranked just after the first, and so on. The number Linear Regression (MLR). MLR are the most widely used
6 H. MAOUZ et al.: QSPR Studies of Carbonyl, Hydroxyl, Polyene Indices, and Viscosity Average..., Kem. Ind. 69 (1-2) (2020) 1−16
Table 2 – List of descriptors obtained after stepwise selection and duration of exposure
Parameters
Descriptor Description Category VIF t-Test
predicts
Autocorrelation
MATS2m Moran autocorrelation – lag 2 / weighted by mass 1.37 –11.96
descriptors
minHBd Minimum E-States for (strong) Hydrogen Bond donors E-state descriptors 1.20 –5.62
Autocorrelation
MATS2s Moran autocorrelation – lag 2 / weighted by I-state 1.44 –4.72
Carbonyl descriptors
index (ICO) Logarithmic coefficient sum of the last eigenvector from
VE3_Dzm Topological descriptors 1.34 5.06
Barysz matrix / weighted by mass
Moran autocorrelation – lag 7 / weighted by Autocorrelation
MATS7i 1.44 3.65
first ionization potential descriptors
t(h) duration of exposure duration of exposure 1.01 19.97
3D topological distance based autocorrelation – lag 5 / 3D Autocorrelation
TDB5i 1.38 –12.32
weighted by first ionization potential descriptors
geomShape Petitjean geometricshape index Geometrical descriptors 1.41 –9.69
minHBd Minimum E-States for (strong) Hydrogen Bond donors E-state descriptors 1.04 –5.82
Hydroxyl
index (IOH) Centered Broto-Moreau autocorrelation – lag 7 / weighted Autocorrelation
ATSC7m 1.53 5.18
by mass descriptors
2nd component shape directional WHIM index / weighted
P2i WHIM descriptors 1.28 –3.85
by relative first ionization potential
t(h) duration of exposure duration of exposure 1.01 17.9
Smallest absolute eigenvalue of Burden modified matrix –
SpMin2_Bhv Burden descriptors 2.13 –9.33
n 2 / weighted by relative van der Waals volumes
Logarithmic coefficient sum of the last eigenvector from
VE3_Dzi Topological descriptors 1.11 6.70
Polyene Barysz matrix / weighted by first ionization potential
index (IPO) Logarithmic Randic-like eigenvector-based index from
VR3_Dzv Barysz matrix / weighted by Topological descriptors 2.08 –7.01
van der Waals volumes
t(h) duration of exposure duration of exposure 1.02 17.03
Radial distribution function – 025 / weighted by relative
RDF25p RDF descriptors 3D 1.12 21.56
polarizabilities
3D topological distance based autocorrelation – lag 4 / Autocorrelation
TDB4i 1.03 –4.99
Viscosity weighted by first ionization potential descriptors
average
Minimum E-State descriptors of strength for potential
molecular minHBint8 E-state descriptors 1.12 4.71
Hydrogen Bonds of path length 8
weight (Mv )
Centred Broto-Moreau autocorrelation – lag 3 / weighted Autocorrelation
ATSC3c 1.01 2.67
by charges descriptors
t(h) duration of exposure duration of exposure 1.00 –16.19
and known modelling methods, and used as the basis for 2.4.2 Nonlinear model
a number of multivariate methods.42 MLR is a commonly
used method in QSPR due to its simplicity, transparency, Nonlinear model was then developed by submitting the
reproducibility, and easy interpretability. MLR consists of relevant descriptors to a statistical learning method: the
a quantitative relationship between a group of variables Xi Artificial Neural Network (ANN). ANN is particularly well
(descriptors) and a response Y, as shown in Eq. (1): suited for QSPR/QSAR models because of their capability
to take out nonlinear information from the data set.43 MLP-
ANN is considered the easiest and most commonly used
(1) ANN type in literature.42 The architecture of an MLP-ANN
consists of an input layer encompassing the inputs, one or
where Y is the response or dependent variable (outputs), more hidden layers (intermediate), and an output layer
Xi represents the molecular descriptors (inputs), and a0 is a including the outputs. The layers are connected to each
constant (intercept). MLR calculations were performed us- other linearly by the weights corresponding to the neurons
ing STATISTICA v. 8.0 (StatSoft, Inc.) and XLSTAT software. in the neighbouring layers upstream and downstream. In
H. MAOUZ et al.: QSPR Studies of Carbonyl, Hydroxyl, Polyene Indices, and Viscosity Average..., Kem. Ind. 69 (1-2) (2020) 1−16 7
this work, the tangent sigmoid (tansig), the sigmoid log (log- To see the contribution of each parameter in the explana-
sig), and the exponential transfer function were used as a tion of the dependent variable Y in the MLR models, we
transfer function of the hidden layer, while the exponential used the test of significance of each parameter t-Student
function and the linear function (Purelin) were used as a “T-test” statistic. From this statistic, it is possible to test one
transfer function for the output layer. The number of hid- by one the nullity of the different parameters of the mul-
den neurons was optimized (from 3 to 30) by trial and error tiple linear regression models, and build confidence inter-
procedure in the training process. One output neuron was vals on these parameters, very useful during the interpreta-
used to represent the experimental values of ICO, IOH, IPO, tion phase of the model.52–53
and Mv. The network was trained using the quasi-Newton
BFGS (Broyden–Fletcher–Goldfarb–Shanno) algorithm.
(8)
(6)
(9)
0.14
0.25
0.12
0.2
0.1
0.15
R2 = 0.9917 0.08
0.1
R2 = 0.9940
0.06
0.05
0.04
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
experimental carbonyl index, ICO experimental hydroxyl index, IOH
x 105
2.5
d)
calculated viscosity average molecular weight by ANN, MV
training set
0.04 1 set 2
training 3 4 5 6 7 c) 8 9 test set
0.55 test set perfect fit
perfect fit 2
0.5
calculated polyene index by ANN, IPO
0.45
0.4 1.5
0.35
0.3
1
0.25
R2 = 0.9939
R2 = 0.9948
0.2
0.5
0.15
0.1
0.05 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0 0.5 1 1.5 2 2.5
x 105
experimental polyene index, IPO experimental viscosity average molecular weight, MV
Fig. 2 – Plot of observed vs. predicted: (a) ICO, (b) IOH, (c) IPO, and (d) Mv values from the ANN model
Threshold
> 0.6 > 0.5 > 0.6 > 0.5 > 0.5 < 0.2 0.85 < k < 1.15 0.85 < k′ < 1.15
value50,56
10 H. MAOUZ et al.: QSPR Studies of Carbonyl, Hydroxyl, Polyene Indices, and Viscosity Average..., Kem. Ind. 69 (1-2) (2020) 1−16
RI ⁄ %
RI ⁄ %
15 10
10 5
5 0
0
m
BD
e
i
i
P2
)
B5
ap
t(h
C7
H
TD
Sh
S
in
zm
7i
2m
AT
Bd
2s
)
om
t(h
m
S
S
AT
H
S
AT
ge
3_
AT
in
M
M
m
VE
M
RI ⁄ %
RI ⁄ %
20 15
10
10
5
0
0
c
8
i
5p
B4
)
C3
nt
t(h
t(h) SpMin2_Bhv VE3_Dzi VR3_Dzv
F2
Bi
TD
S
AT
H
RD
in
m
Fig. 3 – Plot of the relative importance of the descriptors for ANN models: carbonyl index, hydroxyl index, polyene index, and
viscosity average molecular weight
2 2
0 0
−2
−2 h* = 0.2059
h* = 0.2035
−4
−4
−6
−6 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
leverage leverage
Polyene index Viscosity average molecular weight
5 training
c) test 5 d) training
4 critical leverage (h*) test
limits of the normal values (+3) critical leverage (h*)
limits of the normal values (−3) 4 limits of the normal values (+3)
3 limits of the normal values (−3)
3
2
standardized residual
standardized residual
2
1 1
0 0
−1 −1
−2 −2
h* = 0.1579
h* = 0.1923 −3
−3
−4
−4
−5
−50 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
leverage leverage
Fig. 4 – Projection of the training set and the test set in the Williams plot for ANN: (a) ICO, (b) IOH, (c) IPO, and (d) Mv
H. MAOUZ et al.: QSPR Studies of Carbonyl, Hydroxyl, Polyene Indices, and Viscosity Average..., Kem. Ind. 69 (1-2) (2020) 1−16 11
Table 4 – External validation of the ANN and MLR models
of the test set is underestimated. These eleven response of the statistical parameters for the ANN model can be not-
outliers (3 for ICO, 4 for IOH, 1 for IPO, and 3 for Mv) could ed. In conclusion, the ANN model developed in this study
be associated with errors in the experimental values. The meets all OECD principles for QSPR validation, and can be
results of the ANN models correspond to the third prin- used to predict ICO, IOH, IOP, and Mv of polymers, particularly
ciple of the Organization for Economic Cooperation and those that have not been tested, and thus help reduce ex-
Development (OECD). perimental determination of these indices, which involves
costly experimental studies.
laires. Modélisation QSAR (Doctoral dissertation) (2015). of methods to study the contribution of variables in artificial
53. S. Chtita, Modélisation de molécules organiques hétérocy- neural network models, Ecol. Model. 160 (2003) 249–264,
cliques biologiquement actives par des méthodes QSAR/ doi: https://doi.org/10.1016/S0304-3800(02)00257-0.
QSPR. Recherche de nouveaux médicaments (Doctoral dis- 56. S. Bitam, M. Hamadache, S. Hanini, QSAR model for pre-
sertation) (2017). diction of the therapeutic potency of N-benzylpiperidine
54. G. D. Garson, Interpreting neural network connection derivatives as AChE inhibitors, SAR. QSAR. Environ. Res.
weights, Artif. Intell. Expert. 6 (1991) 46–51. 28 (2017) 471–489, doi: https://doi.org/10.1080/106293
6X.2017.1331467.
55. M. Gevrey, I. Dimopoulos, S. Lek, Review and comparison
SAŽETAK
QSPR studije karbonilnih, hidroksilnih, polienskih indeksa i
prosječne molekulske težine polimera pod fotostabilizacijom
pristupom ANN i MLR
Hadjira Maouz,a* Latifa Khaouane,a Salah Hanini,a Yamina Ammi,a,b
Mabrouk Hamadache,a and Maamar Laidi a
Jedan od glavnih nedostataka upotrebe sintetičkih ili polusintetičkih polimernih materijala je nji-
hova razgradnja i starenje. Svrha ove studije je primjena umjetnih neuronskih mreža (ANN) i više-
strukih linearnih regresija (MLR) za predviđanje karbonilnih, hidroksilnih i polienskih indeksa (ICO,
IOH i IOP) i prosječne molekulske mase viskoznosti (Mv ) poli(vinil-klorida), polistirena i poli(metil
metakrilata). Ta fizikalno-kemijska svojstva smatraju se važnim tijekom proučavanja fotostabiliza-
cije polimera. Iz pet ponavljajućih jedinica monomera prikazana je struktura ispitivanog polimera.
Kvantitativni modeli odnosa strukture-svojstava (QSPR) dobiveni primjenom relevantnih deskrip-
tora pokazali su dobru predvidljivost. Za potvrdu tih modela provedene su: interna provjera {R2,
RMSE i Q2LOO}, vanjska provjera {R2, RMSE, Q2pred, rm2, , k i k’} i domena primjenjivosti. Us-
poredba rezultata pokazuje da su modeli ANN učinkovitiji od modela MLR. Prema tome, model
QSPR razvijen u ovoj studiji pruža izvrsna predviđanja i može se primjenjivati za predviđanje ICO,
IOH, IOP i Mv polimera, posebno za one koji nisu testirani.
Ključne riječi
QSPR, fotostabilizacija, polimeri, umjetna neuronska mreža, višestruke linearne regresije
a
Laboratory of Biomaterials and Transport Izvorni znanstveni rad
Phenomena (LBMPT), University of Médéa, Prispjelo 17. lipnja 2019.
Quartier Aïn d’Heb, 26000, Algeria Prihvaćeno 7. rujna 2019.
b
University Center, Faculty of Science
and Technology, Department of Process
Engineering, Relizane, Algeria
H. MAOUZ et al.: QSPR Studies of Carbonyl, Hydroxyl, Polyene Indices, and Viscosity Average..., Kem. Ind. 69 (1-2) (2020) 1−16 15
Supplementary material
Table S1 (a) – Correlation matrix of the descriptors used for the carbonyl index
t(h) MATS2m minHBd MATS2s VE3_Dzm MATS7i
t(h) 1 –0.056 –0.015 0.053 0.041 –0.038
MATS2m –0.056 1 0.014 –0.447 0.176 –0.200
minHBd –0.015 0.014 1 0.254 –0.029 0.297
MATS2s 0.053 –0.447 0.254 1 0.072 0.187
VE3_Dzm 0.041 0.176 –0.029 0.072 1 –0.441
MATS7i –0.038 –0.200 0.297 0.187 –0.441 1
Table S1 (b) – Correlation matrix of the descriptors used for the hydroxyl index
t(h) TDB5i geomShape minHBd ATSC7m P2i
t(h) 1 0.062 0.034 –0.016 0.080 0.015
TDB5i 0.062 1 –0.198 0.093 0.396 0.063
geomShape 0.034 –0.198 1 –0.091 0.205 –0.333
minHBd –0.016 0.093 –0.091 1 –0.108 0.039
ATSC7m 0.080 0.396 0.205 –0.108 1 0.213
P2i 0.015 0.063 –0.333 0.039 0.213 1
Table S1 (c) – Correlation matrix of the descriptors used for the polyene index
t(h) SpMin2_Bhv VE3_Dzi VR3_Dzv
t(h) 1 0.097 0.061 0.109
SpMin2_Bhv 0.097 1 0.306 0.717
VE3_Dzi 0.061 0.306 1 0.263
VR3_Dzv 0.109 0.717 0.263 1
Table S1 (d) – Correlation matrix of the descriptors used for the viscosity average molecular weight
t(h) RDF25p TDB4i minHBint8 ATSC3c
t(h) 1 –0.007 0.009 –0.005 0.000
RDF25p –0.007 1 0.069 0.313 –0.016
TDB4i 0.009 0.069 1 –0.063 –0.119
minHBint8 –0.005 0.313 –0.063 1 0.004
ATSC3c 0.000 –0.016 –0.119 0.004 1
16 H. MAOUZ et al.: QSPR Studies of Carbonyl, Hydroxyl, Polyene Indices, and Viscosity Average..., Kem. Ind. 69 (1-2) (2020) 1−16