entropy
Article
Causality Analysis with Information Geometry: A Comparison
Heng Jie Choong 1, * , Eun-jin Kim 1
1
2
*
and Fei He 2
Centre for Fluid and Complex Systems, Coventry University, Coventry CV1 5FB, UK;
[email protected]
Centre for Computational Science and Mathematical Modelling, Coventry University, Coventry CV1 5FB, UK;
[email protected]
Correspondence:
[email protected]
Abstract: The quantification of causality is vital for understanding various important phenomena
in nature and laboratories, such as brain networks, environmental dynamics, and pathologies. The
two most widely used methods for measuring causality are Granger Causality (GC) and Transfer
Entropy (TE), which rely on measuring the improvement in the prediction of one process based
on the knowledge of another process at an earlier time. However, they have their own limitations,
e.g., in applications to nonlinear, non-stationary data, or non-parametric models. In this study, we
propose an alternative approach to quantify causality through information geometry that overcomes
such limitations. Specifically, based on the information rate that measures the rate of change of the
time-dependent distribution, we develop a model-free approach called information rate causality
that captures the occurrence of the causality based on the change in the distribution of one process
caused by another. This measurement is suitable for analyzing numerically generated non-stationary,
nonlinear data. The latter are generated by simulating different types of discrete autoregressive
models which contain linear and nonlinear interactions in unidirectional and bidirectional time-series
signals. Our results show that information rate causalitycan capture the coupling of both linear and
nonlinear data better than GC and TE in the several examples explored in the paper.
Keywords: causality; information geometry; Transfer Entropy; Granger Causality; information causal
rate; signal processing; nonlinear models; non-stationary; probability distribution
Citation: Choong, H.J.; Kim, E.-j.; He,
F. Causality Analysis with Information
Geometry: A Comparison. Entropy
2023, 25, 806. https://doi.org/
10.3390/e25050806
Academic Editor: Hector Zenil
Received: 6 March 2023
Revised: 12 May 2023
Accepted: 13 May 2023
Published: 16 May 2023
Copyright: © 2023 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1. Introduction
The study of causality aims to explore cause-and-effect relationships between processes. In general, causality can be investigated in two categories: causal discovery and
causal analysis. Causal discovery shows the inherent causal relationship within the dataset
by analyzing and creating causal models based on graph theory. The construction of these
models can be achieved via algorithmic dynamics and evaluating the algorithms against
the observed data [1–3]. Generally, the complexity of the models’ networks is characterized
and measured using graph-theoretic measures. A recent approach using classical information theory, such as Shannon entropy, has been considered as an alternative measurement,
but its high dependence on the distribution can lead to spurious disparate values for the
same complexity, as shown in [4].
Alternatively, causal analysis, which is the focus of this paper, studies the potential
changes in a system caused by another. In practical terms, causal discovery requires more
detailed background information about the provided dataset to accurately conduct the
study. However, the causal analysis does not require detailed underlying knowledge of the
connectivity embedded within the dataset [3,5]. Among the various analysis methods used
to quantify causality, Granger causality (GC) and Transfer Entropy (TE) are extensively
used in various fields, including economics, climate, education, meteorology, neuroscience,
and more [5–16]. For instance, Rebecca M. Pullon et al. utilized GC to demonstrate the
decrease in cortical information flow throughout the brain as the subject loses responsiveness [10]; Yunyuan Gao et al. used TE to investigate the coupling strength of the brain
Entropy 2023, 25, 806. https://doi.org/10.3390/e25050806
https://www.mdpi.com/journal/entropy
Entropy 2023, 25, 806
2 of 59
between the motor and sensory areas [15]. Both of these methods are derived from the
same notion of examining the reliability of a present process towards the past process(es).
Despite sharing the same notion, their approaches to quantifying causality are completely
different. GC evaluates the dependency based on signal models, which means modelling
the signals incorrectly can lead to the wrong conclusion about the causality [11,17,18].
On the other hand, TE is model-free and employs the statistical dependency approach to
access causality. However, it is subject to the estimation of the probability distribution,
such as the number of bin size and the high dimensionality in the distribution due to the
number of lags [19,20].
More specifically, inspired by Norbert Wiener, Clive Granger introduced GC [6,21],
which suggested that the causality of two stochastic processes (x1 (t) and x2 (t)) exists
if and only if one of the processes is able to predict another. For instance, if including
the past observation of x1 (t) along with x2 (t) improves the prediction of the current
state of x2 (t) compared to using only past information of x2 (t), then this would suggest
that x1 (t) is Granger-causing x2 (t). This predictivity is quantitatively studied by comparing the errors of the autoregressive model (linear regressive model) that is used in
modelling the signals [6,18]. As natural time series such as electroencephalogram (EEG)
often exhibit oscillatory aspects, it is interesting to study the spectrum of the GC. John
Geweke has discussed the evaluation of GC in the frequency domain instead of the time domain [17,22,23]. Additionally, the time-varying frequency domain of the GC that is studied
in [18,24,25] is able to show both the spectrum of the GC and the period of causation occurrence. The frequency domain of GC is the spectral of GC which shows the amplitude of GC
correlated to a specific frequency of the entire signals, while the time-varying frequency of
GC shows when the coupling of the signals occurs alongside the spectral of GC.
When utilizing the GC, it is necessary to model the provided signals using a linear
autoregressive model [17,18]. However, this linear model cannot accommodate possible
nonlinear and non-stationary effects within real-world signals, such as the brain EEG
signals [26,27]. Therefore, instead of studying the causality through deterministic or parametric linear models, causality can be statistically quantified using information-theoretic
measures like TE [19]. In the event of two stochastic processes, TE measures the additional information required from one process to achieve the current realization of another
process [28]. Therefore, TE is not limited to any specific model or linearity assumption.
Although TE uses a different approach in evaluating the causality, it can yield similar
results to GC for a Gaussian process, as discussed analytically by Lionel Barnett et al. [29].
Alternatively, the causality can be investigated by examining the changes in the
probability distribution that are caused by the inclusion of the other variables [30]. In a
statistical space, the change in distribution is known as distance and can be quantified via
Kullback–Leibler (KL) divergence. However, KL divergence cannot be well-defined as a
distance, as it is not symmetrical and not path-dependent. Hence, information rate and
information length are used with the intention to have the symmetrical and path-dependent
measurement of the distance [30–33]. Information rate quantifies the rate of the temporal
change in distributions, while information length measures the total statistical change
in the process along the time [30,32]. Refs. [30,31] proposed an information-geometric
causality measure, the so-called information rate causality based on the information rate,
by quantifying causality through the effects of the information rate (see also [34]) and
applied it to a solvable, linear Gaussian process (Kramer model) which has an exact timedependent distribution.
In this paper, we will further develop this information rate causality for analyzing the
numerically generated time-series signals that are in general time-varying and nonlinear
and compare it with the GC and TE. Specifically, we will generate the signals by simulating
the (discrete) autoregressive equations which contain unidirectional or bidirectional interactions. For the GC, both the frequency domain and time varying of GC will be presented. For
the TE, the signals are evaluated as a whole at different lags. Next, the net TE is calculated
net = T
by taking the difference in the TE from both directions. For instance, T1,2
1→2 − T2→1
Entropy 2023, 25, 806
3 of 59
represents the net TE of signals 1 and 2. The net TE enables us to identify the appropriate
lag(s) that yield significant results in TE. Using the knowledge of the appropriate lag(s)
from the net TE, the TE is calculated through the window sliding approach, with each
small window of the signals evaluating TE at the designated lag. Similarly, information
rate causality will be evaluated through the window sliding approach with the information
rate calculated within each window. In the scenario where the signals are coupled, it is
expected that there will be spikes in the information rate causality, indicating changes in
the distribution of one signal caused by another. Note that the methodology and concepts
of the analyses used in this paper are mainly from the previous works [17–19,28,30,32].
The remainder of this paper is organized as follows. In Section 2, we briefly review
the concepts for GC, TE, and information rate causality along with their implementation.
The three different types of autoregressive models—unidirectional, interchange unidirectional, and bidirectional models—are introduced and analyzed in Sections 3 to 5, respectively. Section 6 contains our conclusions. Appendices A–C contain some background
theory to make the paper self-contained. As a reference, the basic model of continuous
coupling is studied and analyzed in Appendix D.
2. Methodologies
2.1. Granger Causality
The general idea behind GC is to evaluate the dependency of one process on another
process [17,18]. This dependency is calculated by comparing the errors/noises of the
modelled signals through the autoregressive model. Consider two stochastic processes,
namely x1 (t) and x2 (t), which can be modelled using the information from their respective
signals (as shown in Equations (1) and (2)) or including some information from each other
(as shown in Equations (3) and (4)).
n
x1 ( t ) =
∑ a i x 1 ( t − i ) + ǫ1 ( t ) ,
(1)
∑ c i x 2 ( t − i ) + ǫ2 ( t ) ,
(2)
i =1
n
x2 ( t ) =
x1 ( t ) =
x2 ( t ) =
i =1
n
n
i =1
n
i =1
n
i =1
i =1
∑ ai∗ x1 (t − i) + ∑ bi∗ x2 (t − i) + ǫ1∗ (t),
(3)
∑ ci∗ x1 (t − i) + ∑ di∗ x2 (t − i) + ǫ2∗ (t).
(4)
Here, { ai , ci , ai∗ , bi∗ , ci∗ , di∗ } ∈ R are the constant parameters that represent the fractions
of contribution from the past observations towards x1 (t) and x2 (t). ǫi∈[1,2] is the uncorrelated external additive noise needed for modelling the processes with variance Σii,i∈[1,2] .
∗
ǫi∗∈[1,2] is the correlated noise with the variance Σij,
(i,j)∈[1,2] , which can be represented by
∗
∗
Σ11 Σ12
the covariance matrix Σ =
∗
∗ .
Σ21
Σ22
The total interdependence or total causality index (Fx1 · x2 ) between x1 (t) and x2 (t)
is composed of two directional causalities (Fx1 → x2 and Fx2 → x1 ) and one instantaneous
causality (Fx1 ↔ x2 ). These are defined in Equations (5) to (8).
Σ Σ
Σ11 Σ22
= log ∗ ∗ 11 22 ∗ 2 = Fx1 → x2 + Fx2 → x1 + Fx1 ↔ x2 ,
|Σ|
Σ11 Σ22 − (Σ12 )
Σ
= log 11
∗ ,
Σ11
Fx1 · x2 = log
Fx1 → x2
(5)
(6)
Entropy 2023, 25, 806
4 of 59
Σ22
∗ ,
Σ22
Σ∗ Σ∗
= log 11 22 .
|Σ|
Fx2 → x1 = log
(7)
Fx1 ↔ x2
(8)
In conjunction, the spectral of the GC can be evaluated in the frequency domain
(ω = 2π f , f ≡ frequency) using the spectral matrix S(ω), transfer matrix H (ω), and the
covariance matrix Σ. They are related according to Equation (9).
S (ω )
S(ω) = 11
S21 (ω )
D
E
S12 (ω )
= x(ω) x† (ω) = H (ω)ΣH † (ω).
S22 (ω )
(9)
x1 ( ω )
H11 (ω ) H12 (ω )
Here,
x(ω)
=
,
H (ω)
=
=
x2 ( ω )
H21 (ω ) H22 (ω )
1 − ∑in=1 ai∗ e− jωi − ∑in=1 bi∗ e− jωi
; h·i is defined as an ensemble average and † is de− ∑in=1 ci∗ e− jωi 1 − ∑in=1 di∗ e− jωi
noted as the complex conjugate transpose of a matrix.
Similar to the analysis in the time domain, the total independence in the frequency
domain between x1 (ω ) and x2 (ω ) (Ix1 · x2 ) also consists of two directional causalities (Ix1 → x2
and Ix2 → x1 ) and one instantaneous causality (Ix1 ↔ x2 ). They are expressed as Equations (10)
to (13).
S11 (ω )S22 (ω )
= Ix1 → x2 + Ix2 → x1 + Ix1 ↔ x2 .
|S(ω)|
S22 (ω )
,
= log
Ĥ22 (ω )Σ22 Ĥ 22 (ω )
S11 (ω )
= log
,
H̃11 (ω )Σ11 H̃ 11 (ω )
H̃11 (ω )Σ11 H̃ 11 (ω ) Ĥ22 (ω )Σ22 Ĥ 22 (ω )
= log
.
|S(ω)|
Ix1 · x2 = log
Ix1 → x2
Ix2 → x1
Ix1 ↔ x2
Σ∗
(10)
(11)
(12)
(13)
Σ∗
12
Here, Ĥ22 (ω ) = H22 (ω ) + Σ12
∗ H21 ( ω ) and H̃11 ( ω ) = H11 ( ω ) + Σ∗ H12 ( ω ); [·] is de22
11
noted as the complex conjugate of the element.
In this paper, the frequency domain and the time-varying frequency domain of
GC will be calculated through the non-parametric method. The general idea of this
method is to decompose the spectral matrix S(ω) into the transfer matrix H (ω) and
the covariance of the noises Σ. This decomposition can be achieved using Wilson’s algorithm [18,24,25,35,36]. The spectral matrices for the frequency domain and the time-varying
frequency domain are expressed as Equations (14) and (15), respectively.
D
E
xi ( ω ) x j ( ω )
S(ω) =
, i, j ∈ R,
(14)
T
E
D
si (t, ω )s j (t, ω )
S(t, ω) =
, i, j ∈ R.
(15)
T
Here, T denotes the total period of the signal. x (ω )n|n∈[i,j] is the discrete Fourier
transform of the signal and sn|n∈[i,j] (t, ω ) is the short-time Fourier transform of the signal
x (t)n|n∈[i,j] . Note that [·] denotes the complex conjugate. The Hann function window is
used when evaluating the short-time Fourier transform. The window will move through
the whole series with half of the data points being overlapped. The general flow of the
procedure is shown in Figure 1a.
Entropy 2023, 25, 806
5 of 59
window
sliding direction
(a)
(b)
# lags
(c)
window
: sampling window
sliding direction of
window
(
window)
Figure 1. The procedure of implementing the causality analyses used in this paper , each colour of the
lines in the image represents a different simulation. In this study, each window contains 0.5 s of data
points and overlaps with the previous window by 0.25 s. The causality analyses are conducted within
each window. (a) illustrates the components (S(. . .), H (. . . )) calculated within the windows for the
non-parametric GC analysis. Refer to Section 2.1 to know the corresponding components. (b) shows
the estimation of distributions p(. . .) for TE evaluation. The distributions p(. . .) are estimated based
(1)
(1)
on the samples xn , xn+1 , and yn . Refer to Section 2.2 for the definition of TE. (c) demonstrates
the evaluation of information rate causality. Each window (labelled as 1st window) is divided into
two windows with the first half labelled as 2nd window. The distribution p( x (t), y(ti )) estimates
the evolution of distribution x (t) while y(ti ) is fixed at the 2nd window. Refer to Section 2.3 for the
definition of information rate causality.
Entropy 2023, 25, 806
6 of 59
2.2. Transfer Entropy
TE is a model-free approach used to calculate causality by evaluating the dependency
between processes. The general expression of TE is given by Equation (17), which measures
the additional information required for the realization of one state of the process (x1 (t))
depending on the past states of the processes (e.g., x1 (t − 1) and x2 (t − 1)) [19,28,37].
TEx2 (t)→ x1 (t) =
∑
(k)
(k)
(l )
p( xn+1 , xn , yn ) log2
x ∈ x1 (t),y∈ x2 (t)
∑
=
(k)
x ∈ x1 (t),y∈ x2 (t)
(16)
(k)
p ( x n +1 | x n )
(k)
(l )
p( xn+1 , xn , yn ) log2
(l )
p ( x n +1 | x n , y n )
(l )
(k)
p ( x n +1 , x n , y n ) p ( x n )
(k)
(k)
(l )
p ( x n +1 , x n ) p ( x n , y n )
.
(17)
(k)
Here, xn+1 represents the state of x1 (t) at the (n + 1)-th moment, xn represents the
(l )
state at n-th moment of x1 (t) consisting of k previous states of x1 (t), yn represents the
state at n-th moment of x2 (t) consisting of l previous states. Note that the previous states
of k and l are arbitrary depending on the interest of the study. For instance, in a collective
(2)
observed stochastic process X = ( x1 , x2 , . . . , x10 ), x8 = ( xn , xm ) |n, m ∈ [1, 8] → ( x1 , x2 )
or ( x3 , x8 ), etc. Thus, TE quantifies the additional information needed for the realization of
x1 (t) at (n + 1) from x2 (t) with the assumption that x1 (t) is independent of x2 (t). If x2 (t)
(k)
(l )
(k)
has no impact on the future evolution of x1 (t), and hence p( xn+1 | xn , yn ) = p( xn+1 | xn ),
then TY → X = 0.
For the TE calculation, the probability distribution p(. . .) will be estimated via histogram with the bin size of 5, as this bin size is determined by the cubic root of Rice’s
(1)
(1)
rule to accommodate the 3-dimensional of joint probability within TE (p( xn+1 , xn , yn )).
Using a larger number of bin sizes will not be able to properly depict the distribution and
it will bring a spurious result in calculating the TE. To simplify the analysis, the number
of past values of k and l is set to 1 (k = l = 1) in this paper. Two different evaluations
will be conducted for TE. First, assuming that the processes are stationary, the net TE
(TEx1 (t),x2 (t) = TEx1 (t)→ x2 (t) − TEx2 (t)→ x1 (t) ) is evaluated on the entire processes at different lags. This will enable us to choose the appropriate past value(s)/lag(s) in evaluating the
TE based on the significance of the net TE. Second, using the knowledge of the appropriate
lag from the net TE, the TE is next evaluated via a sliding window approach. This sliding
window approach is used to calculate the TE at each instance of the interested window.
In this sliding window approach, a small number of data points of the stochastic processes
is sampled within the window and the TE is calculated. These calculations are sketched in
Figure 1b.
2.3. Information Rate Causality
In a dimensionless statistical manifold, the distance between two probability distributions is defined by their statistical difference. One commonly used measure of this
difference is the KL divergence/relative entropy, but this measure is not symmetrical and
not path-dependent. For the time-dependent probability distribution, the changes in the
distribution along with time can be measured through the information rate and information
length. The information rate (Γ(t)) quantifies the rate of change of the distribution, while
information length (L) measures the total change of the distribution. These measures are
defined and expressed as Equations (18) and (19) [30–32].
Γ2 ( t ) =
L=
Z
Z
dx p( x, t)[∂t ln p( x, t)]2 = 4
Z
2
q
dx ∂t p( x, t) ,
dt Γ(t).
Here p( x, t) is the probability distribution of a stochastic process x at the time t.
(18)
(19)
Entropy 2023, 25, 806
7 of 59
For studying causality, we consider two stochastic processes (x1 (t) and x2 (t)) and the
information rate for a joint probability distribution is evaluated. The causality between
these processes can be quantified by the changes in the information rate (statistical changes
or the changes in distribution) while having another process remain static. For instance,
the information rate of x2 (t) causing x1 (t) is defined as Equation (21) and called information
rate causality in this paper.
Z
dx1 dx2 p( x1 (t), x2 (ti ))[∂t ln p( x1 (t), x2 (ti ))]2
2
q
Z
= 4 dx1 dx2 ∂t p( x1 (t), x2 (ti )) .
Γ2 (t) x2 (t)→ x1 (t) =
(20)
(21)
Referring to Figure 1c, here ti is denoted as the reference time that remains static in the
process. Therefore, p( x1 (t), x2 (ti )) is a probability distribution that is sampled by having
the process x2 (t) remain static at time ti while the process x1 (t) evolves along the time
t. Since information rate can only quantify the changes in the distribution, therefore the
information rate causality is evaluated within each window of interest instead of the whole
process of the time series to determine whether the coupling of the processes is still persists.
To accommodate the calculation, a discretized version of Equation (21) is used and it is
expressed as Equation (22).
2
Γ ( t ) x2 → x1 = 4
∑
x1 ∈ X1 ,x2 ∈ X2
∆x1 ∆x2
[∆t]2
q
p( x1 (t + ∆t), x2 (ti )) −
q
p( x1 (t), x2 (ti ))
2
(22)
For calculating the information rate, the joint probability distribution for two processes
is estimated via the histogram with the bin size (b) determined by the square root of Rice’s
rule, as expressed in Equation (23). Rice’s rule is used because it is able to appropriately
determines the bin size to sample the obtained data [38,39] for 1-dimensional distribution.
Hence, the square root of Rice’s rule is to accommodate the 2-dimensional distribution in
this case.
q √
2 3 n , n = number of sampled data.
(23)
b=
In this paper, we evaluate the information rate causality by employing Equation (22)
to examine the impact of one process, which remains static in time, on the distribution
of another process within a specific time window. To estimate the joint probability distribution utilized in Equation (22), a histogram is employed with a bin size determined by
Equation (23). Figure 1c illustrates the overall procedure of this evaluation.
2.4. Summary of the Procedure: Data And Analysis
We further develop information rate causality (refer to Section 2.3) for the analysis
of our numerically generated data and compare it with the non-parametric GC (refer
to Section 2.1) and window sliding TE (refer to Section 2.2). Our numerical data are
generated by simulating different discrete autoregressive models covering both linear and
nonlinear models. The simulation is conducted for 10,000 trials with each trial running for
25 s (physical time) at 200 Hz (physical sampling frequency). Different coupling/causal
relationships between the signals are considered in this paper, such as unidirectional (refer
to Section 3), interchanging unidirectional (refer to Section 4), and bidirectional (refer to
Section 5). Note that the noncoupling cases are also considered in this paper to check
the robustness of the causality analyses; specifically, we refer to the uncoupling of x1 (t)
towards x2 (t) after the physical time 10 s. Using simulated signals, the causality analyses
(GC, TE, information rate causaltiy) are conducted according to the sketch shown in Figure 1.
To be consistent with the data points used for each window, each sampling window
will contain 0.5 s of data points with an overlap of 0.25 s (0.5 × 0.5 s), which is equivalent to
100 data points for one simulation. Since the models are simulated for 10,000 trials, each
Entropy 2023, 25, 806
8 of 59
window will consist of 1 × 106 sample points (100 points × 10, 000 trials). Note that this
window is not applied to the frequency domain of GC and the net TE as both are calculated
based on the whole series.
2.5. Summary of the Different Models and Key Results
Prior to the discussion of different models in Sections 3 to 5 with the implementation
of different analysis methods, we provide the key results/findings in Table 1.
Table 1. Summary of the performance of GC, TE, and information rate causality for different models
discussed in Sections 3 to 5. In general, GC performs well for linear and stationary signals; TE
performs well only if the lags are chosen correctly; information rate causality performs well and can
recover the underlying lags of the signals.
Method
GC
TE
Information rate causality
Unidirectional (Section 3)
Interchange (Section 4)
Linear
Nonlinear
Linear
performs well
fails in
nonstationary
signals
Nonlinear
Bidirectional (Section 5)
Linear
performs well
Nonlinear
fails to capture the
bidirectional
coupling of signals
performs well if correct lags are chosen or else produces spurious results
performs well and able to recover the underlying lags of coupling
3. Unidirectional Model
In this part, unidirectional causality, where the first process (x1 (t)) influencing the
second process (x2 (t)), is considered for both linear and nonlinear autoregressive models
given by Equations (24) and (25), respectively. The causality between the processes occurs
as the processes are coupled with each other through the Heaviside step function (H (. . .)).
Linear model:
x1 (t) = 0.55x1 (t − 1) + ǫ1 (t),
x2 (t) = H (τ − 10) x1 (t − 2) + ǫ2 (t).
(24)
Nonlinear model:
x1 (t) = 0.55x1 (t − 1)e1−| x1 (t−1)| + ǫ1 (t),
x2 (t) = H (τ − 10) x1 (t − 2) + ǫ2 (t).
(25)
Here, t is the time steps and τ is the physical time. Both t and τ are related as
τ = [physical frequency] × t. H (τ − 10) is a Heaviside step function that allows the coupling to occur starting at the physical time of 10 s. ǫn|n∈[1,2] is the Gaussian noise that perturbates the systems. In this study, the noises are set to have zero mean (hǫn i = 0|n ∈ [1, 2])
along with specific covariance/variance (Σnm = h(ǫn − hǫn i)(ǫm − hǫm i)i|n, m ∈ [1, 2]),
Σ
Σ12
which will be expressed in covariance matrix ( 11
). Due to the exponential term
Σ21 Σ22
in the nonlinear model, the process will have non-stationary oscillation when the power
of the exponential term becomes positive (eR → diverge). Hence, two different values of
noise covariance will be considered to study the stationary and non-stationary oscillation
of the processes. They are labelled
and
as
large noise
with the respective
small noise
1.0 0.0
0.005 0.0
Σ11 Σ12
Σ11 Σ12
matrices expressed as
=
=
and
. Note
0.0 1.0
0.0
0.005
Σ21 Σ22
Σ21 Σ22
that the cross-covariance of the noises is set to zero (Σ12 = Σ21 = 0.0) to ensure the coupling is purely due to the intrinsic interaction in the model but not through the mutual
noises. The linear model will have stationary oscillation for either large or small noises,
and hence only one noise will be used to simulate the linear model and large noise is chosen.
The general structure of the models is shown in Figure 2.
Entropy 2023, 25, 806
9 of 59
To explore all possible combinations of cases, we investigate the coupling and noncoupling cases for both linear and nonlinear models. To this end, we investigate the following
six difference cases: [linear: couple], [linear: noncouple], [nonlinear: large noise, coupling],
[nonlinear: large noise, noncoupling], [nonlinear: small noise, coupling], and [nonlinear:
small noise, noncoupling].
noise,
all time
time,
noise,
Figure 2. Model of the flow of information of the processes x1 (t) and x2 (t) for Equations (24) and (25).
In this paper, the equations are simulated with the physical time of 25 s and sampling frequency of
200 Hz (5000 realizations) with either large noise or small noise. The coupling between the processes
occurs at physical time of 10 s.
The result of the simulation for the Equations (24) and (25) is shown in Figure 3.
Notice that the signals for the nonlinear model exhibit non-stationary oscillation when
the perturbating noise is small, as shown in Figure 3e,f. This is due to the divergence
of the positive power of the exponential term of Equation (25) (e1−| x1 (t−1)| = eR →
diverge |ǫ1 (t) ≡ small noise) and the influences of the previous state towards the current
state (x1 (t) ← 0.55x1 (t − 1)). The rest of the simulated signals have stationary oscillation,
as shown in the phase space where the observed states are localized around ( x1 , x2 ) = (0, 0).
(a) linear: coupling
(b) linear: noncouple
Figure 3. Cont.
Entropy 2023, 25, 806
10 of 59
(c) nonlinear: large noise, couple
(d) nonlinear: large noise, noncouple
(e) nonlinear: small noise, couple
(f) nonlinear: small noise, noncouple
Figure 3. Result of the simulation based on Equations (24) and (25) (refer to Figure 2 for graphical
explanation of the process). The [top left, red] is the result of x1 and [bottom, green] is the result of x2 ;
[right blue] is the phase space of x1 and x2 . Note that the coupling of the processes occurs at 10 s.
Entropy 2023, 25, 806
11 of 59
3.1. Granger Causality
As shown in Figure 4, the coupling of x1 to x2 can be captured well through nonparametric GC analysis for stationary linear signals, as shown in Figure 4a. Concurrently,
the detection of the causality works well for stationary nonlinear signals, as shown in
Figure 4c. This is because the nonlinear exponential term is well-approximated to a constant value as the spectral matrix decomposed to the transfer matrix and covariance matrix
of noise via Wilson’s algorithm. Note that the amplitude of the GC decreases for this
nonlinear stationary signal. This is due to the contribution of the exponential term (e−R )
that eventually affects the S22 (ω ) element of the spectral matrix (refer to Equation (11)) and
hence the amplitude of the causality (Ix1 → x2 ). For non-stationary signals, the frequency
domain of GC is still able to show the causality within the signals, but the time-varying
frequency of GC cannot represent well the period of causation, as shown in Figure 4e.
From the figure, we can see that the nonparametric GC gives a false result, as it suggests
that the coupling between the processes occurs throughout the time. Similarly, the nonparametric GC analysis works fine in evaluating the non-causality cases for stationary cases but
not so well for the non-stationary case (refer to Figure 4b,d,f. GC analysis fails to work on
the signals oscillating non-stationary, as the signals cannot be modelled well via the linear
autoregressive equations (refer to Equations (3) and (4)).
(a) linear: couple
(b) linear: noncouple
Figure 4. Cont.
Entropy 2023, 25, 806
12 of 59
(c) nonlinear: large noise, couple
(d) nonlinear: large noise, noncouple
(e) nonlinear: small noise, couple
Figure 4. Cont.
Entropy 2023, 25, 806
13 of 59
(f) nonlinear: small noise, noncouple
Figure 4. Result of the spectral and time-varying frequency of GC. Note that for each subfigure,
the [left] shows the spectral (frequency domain) of GC, [blue] indicates x2 causes x1 , while [orange]
indicates x1 causes x2 ; [top, right] shows the time-varying frequency of GC of x2 to x1 ; [bottom, right]
shows the time-varying frequency of GC of x1 to x2 . The results shown are based on the processes in
Equations (24) and (25). Refer to Figure 2 for the graphical explanation of the process.
3.2. Transfer Entropy
The results from TE analysis are shown in Figure 5. By first assuming that the signals are stationary, the net TE (TEX,Y = TEX →Y − TEY → X ) is first calculated to find
the suitable lag for evaluating the sliding window TE later. Figure 5a,c,e show that
the net TE between the signals is significant at the lag of 1. This is because the simulated models (Equations (24) and (25)) have the coupling occurring at one lag (compares
x1 (t) := x1 (t − 1) and x2 (t) := x1 (t − 2)). Next, the window sliding TE is conducted
through the conditional probability (multidimensional probability) that contains the data
from one previous lag (refer to Equation (17)) at each window of the evaluation of TE.
As shown in Figure 5a,c,e, the window sliding TE is able to show the time (10 s physical
time) when the causality occurs. Similarly, the TE is also able to capture the non-causality
situation well for all the cases, as shown in Figure 5b,d,f. Even though TE is able to detect
the causality between the signals for either linear stationary or nonlinear non-stationary
cases, it requires extra inputs/assumptions (such as bin sizes, number of lags, and the
dimension of the multidimensional probability) to work properly/accurately. The failure
of TE in capturing the causality for the linear model (Equation (24)) is shown in Figure 6
when the lag is set to 9 instead of 1 when evaluating the TE.
(a) linear: couple
Figure 5. Cont.
Entropy 2023, 25, 806
14 of 59
(b) linear: noncouple
(c) nonlinear: large noise, couple
(d) nonlinear: large noise, noncouple
(e) nonlinear: small noise, couple
Figure 5. Cont.
Entropy 2023, 25, 806
15 of 59
(f) nonlinear: small noise, noncouple
Figure 5. Result of the net TE and window sliding TE. Note that for each subfigure, [left, 4 × 4]
depicts the net TE for the whole signals; while the [right] shows the window sliding TE. The results
shown are based on the processes in Equations (24) and (25). Refer to Figure 2 for the graphical
explanation of the process.
Figure 6. Failure of TE in capturing the causality between the signals x1 and x2 for [linear: couple]
(refer to Equation (24)) when lag 9 is used for evaluation, as compared to Figure 5a, where the TE
accurately shows that the coupling occurs after 10 s.
3.3. Information Rate Causality
Both GC and TE detect causality by examining the increase in the predictability of
the stochastic processes (measured by the inverse of entropy) when the past information
is included. In comparison, information rate causality evaluates causality by quantifying
the rate of change of the probability distribution of one variable conditioned on another,
as noted previously. The results are shown in Figures 7 and 8 where the changes of the
distribution due to the causality between the processes are well-reflected via the changes in
the information rate (Γ). For the cases of noncoupling, the information rate is not changing
and remains constant, which would suggest that none of the signals cause the changes
to the joint distribution (refer to Figure 7b,d,f). Hence, no causality occurs between the
signals. Similarly, prior to the coupling of the signals, the causality is not seen, and hence
the information rate again remains constant. Notice that the information rate does not
stay at zero when no causality happens, it is due to the noises/spikes of the estimated
distribution, as shown in Figure S1.
In the presence of the couplings, the changes in the distribution due to the causality
among the signals can be observed via the changes in the information rate. For instance,
the changes in the information rate due to the causality of x1 (t) to x2 (t) can be observed
in Figure 7a,c,e (zoom-in: Figure 8a,c,e). Referring to the models, the signals are coupled
with the lag difference of one (x1 (t) = x1 (t − 1) and x2 (t) = x1 (t − 2)), and hence the
peak of the information rate is observed after one lag. From Figure 8a,c,e, the changes
in the information rate prior to the coupling (at physical time 10 s) are observed as the
evaluation window of [9.75 s to 10.25 s] picks up the causality of the signals from [10 s to
10.25 s]. Referring to the starting time (near 0 s) of Figure 7e,f, there is the presence of a
sharp difference in information rate. This is due to the non-stationary oscillation of x1 (t)
that causes the high disturbance of the distribution.
Entropy 2023, 25, 806
16 of 59
(a) linear: couple
(b) linear: noncouple
(c) nonlinear: large noise, couple
Figure 7. Cont.
Entropy 2023, 25, 806
17 of 59
(d) nonlinear: large noise, noncouple
(e) nonlinear: small noise, couple
(f) nonlinear: small noise, noncouple
Figure 7. Result of information rate causality for Equations (24) and (25).
Entropy 2023, 25, 806
18 of 59
(a) linear: couple
(b) linear: noncouple
(c) nonlinear: large noise, couple
Figure 8. Cont.
Entropy 2023, 25, 806
19 of 59
(d) nonlinear: large noise, noncouple
(e) nonlinear: small noise, couple
(f) nonlinear: small noise, noncouple
Figure 8. Zoom-in of Figure 7 for the information rate causality between 9.6 s and 10.4 s. Note that
the change in the causality occurs at 10 s.
To demonstrate the capability of information rate causality to detect lag differences
within a signal, we simulated and analyzed Equation (26), which has the coupling starting
at 10 s with a lag difference of 9 (compares x1 = x1 (t − 1) and x2 = x1 (t − 10)). Figure 9
shows the section of the information rate from 9.6 s to 10.4 s. The figure reveals that the peak
of the information rate appears at the 9th time step (0.045 s) at every window of evaluation,
which accurately reflects the lag difference specified in Equation (26). This result suggests
Entropy 2023, 25, 806
20 of 59
that information rate causality can effectively capture the underlying causality of the signals
at each evaluation window.
x1 (t) = 0.55x1 (t − 1) + ǫ1 (t),
x2 (t) = H (τ − 10) x1 (t − 10) + ǫ2 (t).
(26)
Figure 9. The information rate causality of Equation (26) between 9.6 s and 10.4 s where the causality
occurs at 10 s.
4. Interchange of Unidirectional Model
In this part, similar models from Section 3 are used with slight modifications to study the
interchange of the unidirectional causality. The models are modified to Equations (27) and (28)
for linear and nonlinear models, respectively. The modification with the Heaviside step
functions (H (. . .)) in the models portrays that the signal x2 (t) causing x1 (t) occurs prior to
physical time (τ) 10 s, while the signal x1 (t) causes x2 (t) after 10 s. The general structure of
Equations (27) and (28) is illustrated in Figure 10.
Linear model:
x1 (t) = H (10 − τ )0.55x2 (t − 1) + ǫ1 (t),
x2 (t) = H (τ − 10) x1 (t − 2) + ǫ2 (t).
(27)
Nonlinear model:
x1 (t) = H (10 − τ )0.55x2 (t − 1)e1−| x2 (t−1)| + ǫ1 (t),
x2 (t) = H (τ − 10) x1 (t − 2) + ǫ2 (t).
(28)
Here t is the time-step and τ is the physical time. The Heaviside step functions are used
with H (10 − τ ) suggesting the coupling occurs from the beginning of the simulation till the
physical time 10 s conversely H (τ − 10), suggesting the coupling occurs after 10 s. Similar
to Section 3, 6 possible cases are simulated based on Equations (27) and (28). The simulated
signals will be evaluated by GC, TE, and information rate causality.
The result of the simulations is shown in Figure 11, which suggests that all the signals
oscillate stationary. Unlike Equation (25), in which the signal x1 (t) is perturbated by its
previous state by a factor of 0.55 (0.55x1 (t − 1)) and the small noise (ǫ1 (t)), Equation (28)
is perturbated by the past state of x2 (t) (x2 (t − 2)) instead. Hence, the power of the
exponential will always be negative; consequently, all the simulated signals have stationary
oscillations. Note that the noncoupling is referring to the decoupling of x1 (t) to x2 (t)
(H (τ − 10) = 0).
Entropy 2023, 25, 806
21 of 59
noise,
AFTER time,
noise,
BEFORE time,
Figure 10. Model of the flow of information of the processes x1 (t) and x2 (t) for Equations (27)
and (28). In this paper, the equations are simulated with the physical time of 25 s and sampling
frequency of 200 Hertz (5000 realizations) with either large noise or small noise. Note that the process
x2 (t) coupling with x1 (t) occurs before 10 s and the interchange of coupling occurs after 10 s. In the
context of noncoupling, it is referring to H (τ − 10) = 0.
(a) linear: couple
(b) linear: noncouple
(c) nonlinear: large noise, couple
Figure 11. Cont.
Entropy 2023, 25, 806
22 of 59
(d) nonlinear: large noise, noncouple
(e) nonlinear: small noise, couple
(f) nonlinear: small noise, noncouple
Figure 11. Result of simulation based on Equations (27) and (28) (refer to Figure 10 for graphical
explanation of the process). The [top left, red] is the result of x1 and [bottom, green] is the result of x2 ;
[right, blue] is the phase space of x1 and x2 . Note that in the context of noncoupling, it is referring to
H (τ − 10) = 0.
4.1. Granger Causality
Based on the models, the signal x2 (t) will first cause x1 (t) for the beginning of the
process until the physical time of 10 s, and x1 (t) causes x2 (t) after the 10 s. This pattern
is captured in the time-varying frequency of GC, as shown in Figure 12 for all the cases.
Prior to 10 s, the x1 (t) is caused by x2 (t) by a factor of 0.55, and hence a similar factor of
GC is shown in the time-varying frequency domain. Similarly, the coupling between the
signals can also be observed in the frequency domain of GC. Notice that the [nonlinear:
Entropy 2023, 25, 806
23 of 59
small noise] cases are able to show the significance of x2 (t) causing x1 (t) prior to 10 s
because the small noise and the exponential term had x2 (t) causing x1 (t) increased to about
1.5 (e1 × 0.55 ≈ 1.5).
(a) linear: couple
(b) linear: noncouple
(c) nonlinear: large noise, couple
Figure 12. Cont.
Entropy 2023, 25, 806
24 of 59
(d) nonlinear: large noise, noncouple
(e) nonlinear: small noise, couple
(f) nonlinear: small noise, noncouple
Figure 12. Result of the spectral and time-varying frequency of GC. Note that for each subfigure,
the [left] shows the spectral (frequency domain) of GC, [blue] indicates x2 causes x1 , while [orange]
indicates x1 causes x2 ; [top, right] shows the time-varying frequency of GC of x2 to x1 ; [bottom, right]
shows the time-varying frequency of GC of x1 to x2 . The results shown are based on the processes in
Equations (27) and (28). Refer to Figure 10 for the graphical explanation of the process. Note that the
noncoupling is referring to H (τ − 10) = 0 for all the cases.
4.2. Transfer Entropy
TE is able to show the correction direction of the causality between the signals if the
correct lag is chosen. Therefore, having models (like Equations (27) and (28)) with the
coupling occurring at different time lags cannot properly show the direction of the causality
via the window sliding TE. For instance, the window sliding TE shows that x2 (t) is causing
Entropy 2023, 25, 806
25 of 59
the x1 (t) when lag 0 is chosen, while the reverse is shown when lag 1 is used, as shown
in Figure 13 (for lag 0) and Figure 14 (for lag 1). This is depicted well in the underlining
models (Equations (27) and (28)), in which x1 (t) causes x2 (t) one lag later. As a result, TE
cannot show the whole causality of the signals by just one lag.
(a) linear: couple
(b) linear: noncouple
(c) nonlinear: large noise, couple
(d) nonlinear: large noise, noncouple
Figure 13. Cont.
Entropy 2023, 25, 806
26 of 59
(e) nonlinear: small noise, couple
(f) nonlinear: small noise, noncouple
Figure 13. Result of the net TE and window sliding TE with the lag of 0. Note that for each subfigure,
[left, 4 × 4] depicts the net TE for the whole signals, while the [right] shows the window sliding TE.
The results shown are based on the processes in Equations (27) and (28). Refer to Figure 10 for the
graphical explanation of the process. Note that the noncoupling is referring to H (τ − 10) = 0 for all
the cases.
(a) linear: couple
(b) linear: noncouple
Figure 14. Cont.
Entropy 2023, 25, 806
27 of 59
(c) nonlinear: large noise, couple
(d) nonlinear: large noise, noncouple
(e) nonlinear: small noise, couple
(f) nonlinear: small noise, noncouple
Figure 14. Result of the net TE and window sliding TE with the lag of 1. Note that for each subfigure,
[left, 4 × 4] depicts the net TE for the whole signals, while the [right] shows the window sliding TE.
The results shown are based on the processes in Equations (27) and (28). Refer to Figure 10 for the
graphical explanation of the process. Note that the noncoupling is referring to H (τ − 10) = 0 for all
the cases.
4.3. Information Rate Causality
Alternatively, the information rate causality is able to show the coupling of the signals along with the lag of coupling in each window of the evaluation. As discussed,
the information rate causality evaluates the causality by observing the changes in the
probability distribution of one signal after it is influenced by another signal. As shown
in Figures 15 and 16, the information rate causality can captures the interchange of the
coupling among the signals of the simulated models (Equations (27) and (28)). As discussed
Entropy 2023, 25, 806
28 of 59
in TE, signal x2 (t) is causing x1 (t) at a lag of 0, and x1 (t) causes x2 (t) at a lag of 1. These
lags can be observed at each window of the information rate evaluation. For instance,
as shown in Figure 16, the information rate of 2 to 1 (Γ2 2 to 1) has a spike at the beginning
of the time, while the information rate of 1 to 2 (Γ2 1 to 2) has a spike after one time step (one
lag). Hence, the information rate causality is capable of showing the underlying causality
of the signals.
(a) linear: couple
(b) linear: noncouple
(c) nonlinear: large noise, couple
Figure 15. Cont.
Entropy 2023, 25, 806
29 of 59
(d) nonlinear: large noise, noncouple
(e) nonlinear: small noise, couple
(f) nonlinear: small noise, noncouple
Figure 15. Result of information rate causality for Equations (27) and (28). Note that the interchange
of the causality occurs at 10 s.
Entropy 2023, 25, 806
30 of 59
(a) linear: couple
(b) linear: noncouple
(c) nonlinear: large noise, couple
Figure 16. Cont.
Entropy 2023, 25, 806
31 of 59
(d) nonlinear: large noise, noncouple
(e) nonlinear: small noise, couple
(f) nonlinear: small noise, noncouple
Figure 16. Zoom-in of Figure 15 for the information rate causality between 9.6 s and 10.4 s. Note that
the interchange of the causality occurs at 10 s.
5. Bidirectional Model
In this section, the bidirectional causality is studied by modifying the linear and
nonlinear models from Section 3 as Equations (29) and (30), respectively. These modified
models have the signals x2 (t) continuously influencing x1 (t) at all times t ≥ 1. In addition,
at time 10 s, the Heaviside step function (H (τ − 10)) enables the coupling of x1 (t) and
Entropy 2023, 25, 806
32 of 59
x2 (t), which allows x1 (t) to influence x2 (t) while the previous coupling remains intact.
The general structure of the Equations (29) and (30) is illustrated in Figure 17.
Linear model:
x1 (t) = 0.55x2 (t − 1) + ǫ1 (t),
x2 (t) = H (τ − 10) x1 (t − 2) + ǫ2 (t).
(29)
Nonlinear model:
x1 (t) = 0.55x2 (t − 1)e1−| x2 (t−1)| + ǫ1 (t),
x2 (t) = H (τ − 10) x1 (t − 2) + ǫ2 (t).
(30)
Here, t is the time step and τ is the physical time; H (τ − 10) is the Heaviside step
function that allows the coupling of the signal starting at a physical time of 10 s. Similar to
the previous model, 6 cases are simulated and the signals are used to evaluate through GC,
TE, and information rate causality. Recall that the noncoupling cases here are referring to
the decoupling of x1 (t) to x2 (t) (H (τ − 10) = 0).
noise,
time,
noise,
All time
Figure 17. Model of the flow of information of the processes x1 (t) and x2 (t) for Equations (29)
and (30). In this paper, the equations are simulated with physical time of 25 s and sampling frequency
of 200 Hz (5000 realizations) with either large noise or small noise. Note that the x2 (t) coupled with
x1 (t) throughout the process and the occurrence of bidirectional causality happens after 10 s. In the
context of noncoupling, it is referring to H (τ − 10) = 0.
The simulation result from all aforementioned cases is shown in Figure 18. From the
observation of the individual signal and the phase space, the scenario of [nonlinear: small
noise, couple] has non-stationary oscillation after the coupling, as shown in Figure 18e. This
is due to the divergence caused by the nonlinear exponential term embedded within the
simulated equation (e1−|x2 (t−1)| = eR) along with the factor of 0.55 of the previous state of
x2 (t) (0.55x2 (t − 1) . . .). On the other hand, [nonlinear: small noise, noncouple] has a stationary
oscillation, as shown in Figure 18f; this is because x1 (t) only receives the input from x2 (t), which
keeps the power of the exponential term negative and hence results in a stationary oscillation.
(a) linear: couple
Figure 18. Cont.
Entropy 2023, 25, 806
33 of 59
(b) linear: noncouple
(c) nonlinear: large noise, couple
(d) nonlinear: large noise, noncouple
(e) nonlinear: small noise, couple
Figure 18. Cont.
Entropy 2023, 25, 806
34 of 59
(f) nonlinear: small noise, noncouple
Figure 18. Result of simulation based on Equations (29) and (30) (refer to Figure 17 for graphical
explanation of the process). The [top left, red] is the result of x1 ; [bottom, green] shows the result
of x2 ; [right, blue] shows the phase space of x1 and x2 . Note that the noncoupling in the context is
referring to H (τ − 10) = 0.
5.1. Granger Causality
From the GC analysis, the coupling from x2 (t) to x1 (t) cannot be captured well in
it except for the [nonlinear: small noise] cases. This result is due to a similar reason as
discussed in Section 4, i.e., the coupling between the signals is affected by a factor of 1.5
(e1 × 0.55 ≈ 1.5). Therefore, the rest of the cases cannot show the significant contrast of
the causality from [x2 (t) to x1 (t)] as shown in Figure 19. Furthermore, referring to the
time-varying GC, the bidirectional coupling/causality between x1 (t) and x2 (t) that occurs
after 10 s are not able to capture all the coupling cases.
(a) linear: couple
(b) linear: noncouple
Figure 19. Cont.
Entropy 2023, 25, 806
35 of 59
(c) nonlinear: large noise, couple
(d) nonlinear: large noise, noncouple
(e) nonlinear: small noise, couple
Figure 19. Cont.
Entropy 2023, 25, 806
36 of 59
(f) nonlinear: small noise, noncouple
Figure 19. Result of the spectral and time-varying frequency of GC. Within each subfigure, the [left]
shows the spectral (frequency domain) of GC, with [blue] showing that x2 causes x1 (Ix2 → x1 ) and
[orange] indicating that x1 causes x2 (Ix1 → x2 ). The [top, right] figure shows that the time-varying
frequency of GC of x2 causes x1 and [bottom, right] shows that x1 causes x2 . The results shown are
based on the expression in Equations (29) and (30). Refer to Figure 17 for the graphical explanation of
the process. Note that the noncoupling is referring to H (τ − 10) = 0 for all the cases.
5.2. Transfer Entropy
The results of TE shown in Figures 20 and 21 once again are quite similar to those
shown in Section 4. The window sliding TE, evaluated with lag 0, shows that signal
x2 (t) causes x1 (t) throughout the time, as shown in Figure 20. For the coupling cases,
the amplitude of the TE changes when x1 (t) feedbacks to x2 (t) (after time 10 s). Similar
to Section 4, the window sliding TE only shows that signal x1 (t) is causing x2 (t) when
evaluated at lag 1 for the coupling cases shown in Figure 21. Hence, TE cannot fully capture
the causality between the signals with just one lag.
(a) linear: couple
(b) linear: noncouple
Figure 20. Cont.
Entropy 2023, 25, 806
37 of 59
(c) nonlinear: large noise, couple
(d) nonlinear: large noise, noncouple
(e) nonlinear: small noise, couple
(f) nonlinear: small noise, noncouple
Figure 20. Result of the net TE and window sliding TE with the lag at 0. Within each subfigure,
the [left, 4 × 4] figures show the net TE for the whole signals, while the [right] shows the window
sliding TE. The results shown are based on the expression in Equations (29) and (30). Refer to
Figure 17 for the graphical explanation of the process. Note that the noncoupling is referring to
H (τ − 10) = 0 for all the cases.
Entropy 2023, 25, 806
38 of 59
(a) linear: couple
(b) linear: noncouple
(c) nonlinear: large noise, couple
(d) nonlinear: large noise, noncouple
Figure 21. Cont.
Entropy 2023, 25, 806
39 of 59
(e) nonlinear: small noise, couple
(f) nonlinear: small noise, noncouple
Figure 21. Result of the net TE and window sliding TE with the lag at 1. Within each subfigure,
the [left, 4 × 4] figures show the net TE for the whole signals, while the [right] shows the window
sliding TE. The results shown are based on the expression in Equations (29) and (30). Refer to
Figure 17 for the graphical explanation of the process. Note that the noncoupling is referring to
H (τ − 10) = 0 for all the cases.
5.3. Information Rate Causality
The information rate causality analysis evaluates the causality through the change
in probability distribution of an original signal after including another signal assuming
it causes the original signal. The information rate causality is able to show the coupling
among the signals well, as shown in Figures 22 and 23. Similar to Section 4, prior to time
10 s, the causality from signal x2 (t) to x1 (t) can be depicted with the presence of the peak
of Γ2 2 to 1. The peak presented at the beginning of the evaluation window suggests that
the coupling occurs at lag 0. After 10 s, the mutual feedback between the signals is shown
within the evaluation window of the information rate causality. This mutual causality
between the signals is observed through the alternating oscillations of the peak occurrence
in the information rate within the window. Hence, this can reflect well the underlying
model of the signals. Note that this is different than Section 4, which is an interchange of
unidirectional causality that has the feedback solely from one signal to the other without
retaining information from itself.
Entropy 2023, 25, 806
40 of 59
(a) linear: couple
(b) linear: noncouple
(c) nonlinear: large noise, couple
Figure 22. Cont.
Entropy 2023, 25, 806
41 of 59
(d) nonlinear: large noise, noncouple
(e) nonlinear: small noise, couple
(f) nonlinear: small noise, noncouple
Figure 22. Result of information rate causality for Equations (29) and (30). Note that the bidirectional
causation begins at 10 s. The results shown are based on the expression in Equations (29) and (30).
Refer to Figure 17 for the graphical explanation of the process. Note that the noncoupling is referring
to H (τ − 10) = 0 for all the cases.
Entropy 2023, 25, 806
42 of 59
(a) linear: couple
(b) linear: noncouple
(c) nonlinear: large noise, couple
Figure 23. Cont.
Entropy 2023, 25, 806
43 of 59
(d) nonlinear: large noise, noncouple
(e) nonlinear: small noise, couple
(f) nonlinear: small noise, noncouple
Figure 23. Zoom-in of Figure 22 for the information rate causality between 9.6 s and 10.4 s. Note
that the bidirectional causation begins at 10 s. The results shown are based on the expression in
Equations (29) and (30). Refer to Figure 17 for the graphical explanation of the process. Note that the
noncoupling is referring to H (τ − 10) = 0 for all the cases.
6. Conclusions
In this paper, we proposed an alternative causal analysis method to the popular GC
and TE for causality quantification. Specifically, based on information rate, which measures
the rate of temporal change in the time-dependent distribution, we developed a model-free,
Entropy 2023, 25, 806
44 of 59
information-geometric measure of causality—information rate causality—that is suitable
for analyzing numerically generated non-stationary, nonlinear data. To compare the GC,
TE, and information rate causality, we applied the methods to numerical data generated
by simulating different types of discrete autoregressive models which contain linear and
nonlinear interactions. In Section 3, we showed that information rate causality performs
equally well compared to GC in the standard linear stationary signals. Later, we showed
that the GC did not perform well for non-stationary and/or nonlinear signals. Furthermore,
it failed to capture mutual feedback between the signals, as shown specifically in Section 5 in
all the coupling cases. TE performed slightly better than GC, since it is model-independent.
However, the information on the lag is needed to properly evaluate the TE, as shown in
Sections 4 and 5. In comparison with GC and TE, information rate causality has shown
to be able to uncover the underlying mechanism of causality in the signals, such as the
interchanging oscillatory feedback between the signals that is discussed in Section 5, which
was not captured by either GC or TE.
While our results in this paper were obtained based on the interaction of two time
series of different types of (non)linearity and (non)stationary, they have at least pointed
out some limitations of GC and TE that can be resolved by employing information rate
causality. It remains as future work to extend this work and to investigate a larger class of
time series data, including real data such as EEG signals.
Supplementary Materials: The following supporting information can be downloaded at: https:
//www.mdpi.com/article/10.3390/e25050806/s1, Figure S1: Estimation of the joint probability
distribution via histogram.
Author Contributions: Conceptualization, E.-j.K.; Methodology; E.-j.K., F.H. and H.J.C.; software,
H.J.C.; Writing–original draft, H.J.C.; Writing–review & editing, E.-j.K.; Supervision, E.-j.K. and F.H.
All authors have read and agreed to the published version of the manuscript.
Funding: E.-j.K.and F.H. are partly supported by EPSRC Grant EP/W036770/1.
Institutional Review Board Statement: Not applicable.
Data Availability Statement: The data that support the findings of this study are available from the
corresponding author upon reasonable request.
Conflicts of Interest: The authors declare no conflict of interest.
Appendix A. Granger Causality
Appendix A.1. Autoregressive Process
Given a stochastic process (x (t)), it can be modelled through an autoregressive (AR)
model with available/appropriate n number of observations from the process (e.g., AR(n)
known as autoregressive of nth order). The general expression of the AR(n) model is
expressed as Equation (A1).
n
x (t) =
∑ α i x ( t − i ) + ǫ ( t ),
(A1)
i =1
such that t is the time step, and αi is the constant coefficient which suggested the
weight of contribution from the ith past observation towards the process x (t). ǫ(t) is the
additive noise that has the variance Σ, which indicated the proportion of the external
influence towards the process. Hence, a process that is strongly related to past observations
will have small noise (e.g., small Σ).
Appendix A.2. Granger Causality: Time Domain
Given two stochastic processes (x1 (t) and x2 (t)), they can be modelled independently,
as shown in Equation (A1) and expressed as Equations (A2) and (A3).
Entropy 2023, 25, 806
45 of 59
n
x1 ( t ) =
∑ a i x 1 ( t − i ) + ǫ1 ( t ) ;
(A2)
∑ c i x 2 ( t − i ) + ǫ2 ( t ) ,
(A3)
i =1
n
x2 ( t ) =
i =1
such that ǫ1 (t) and ǫ2 (t) are the noises with variances of Σ11 and Σ22 , respectively; ai
and ci are the constant parameters that suggested the fractions of the contribution from the
past observations towards x1 (t) and x2 (t), respectively.
If there is a present causal relationship between two processes, they can be modelled
with the past values from each other, as shown in Equations (A4) and (A5).
x1 ( t ) =
x2 ( t ) =
n
n
i =1
n
i =1
n
i =1
i =1
∑ ai x1 (t − i) + ∑ bi x2 (t − i) + ǫ1∗ (t),
(A4)
∑ ci x1 (t − i) + ∑ di x2 (t − i) + ǫ2∗ (t).
(A5)
Here ai , bi , ci , and di are the constants parameters. The noises (ǫ1∗ (t) and ǫ2∗ (t)) would
have the variance presented as the covariance matrix Σ shown in Equation (A6):
Σ=
∗
Σ11
∗
Σ21
∗
Σ12
∗ .
Σ22
(A6)
∗ and Σ∗ are the variance of the ǫ∗ ( t ) and ǫ∗ ( t ),
Similar to Equations (A2) and (A3), Σ11
22
2
1
∗
∗
respectively, while Σ12 = Σ21 is the covariance of the noises (ǫ1∗ (t) and ǫ2∗ (t)).
The causal relationship between x1 (t) and x2 (t) can determined by evaluating the
∗ < Σ
noises from Equations (A2) to (A5). For instance, Σ11
11 would indicate that the
predictability of x1 (t) increases as the past information of x2 (t) is included. Hence, this
would suggest that there is a causal relationship from x2 (t) to x1 (t) which is known
∗ < Σ would suggest x ( t )
as GC (x2 (t) Granger-causes x1 (t), Fx2 → x1 ). Similarly, Σ22
22
1
∗ = Σ∗ would suggest that
Granger-causes x2 (t), Fx1 → x2 . The covariance of the noises Σ12
21
the processes are correlated by one another via a common external force that drives the
processes; consequently, it shows an instantaneous causality (Fx1 ↔ x2 ).
The idea behind GC is to evaluate the total interdependence (also known as total
causality index) between x1 (t) and x2 (t), which consists of two directional causalities
(Fx1 → x2 and Fx2 → x1 ) and one instantaneous causality (Fx1 ↔ x2 ). The total interdependence
between x1 (t) and x2 (t) is defined as Equation (A7).
Fx1 · x2 = log
Σ11 Σ22
Σ Σ
= log ∗ ∗ 11 22 ∗ 2 = Fx1 → x2 + Fx2 → x1 + Fx1 ↔ x2 ,
|Σ|
Σ11 Σ22 − (Σ12 )
(A7)
such that |Σ| is the determinant of the covariance matrix Σ. If either one of the signals
∗ <Σ ,
Granger-causes or correlates to one another, the value of Fx1 · x2 > 0 because the Σ11
11
∗
∗
Σ22 < Σ22 , or Σ12 6= 0. On the contrary, Fx1 · x2 = 0 indicates there is no GC relationship and
correlation between the signals.
The directional causalities and instantaneous causality are defined as Equations (A8)
to (A10).
Σ11
∗ ,
Σ11
Σ
= log 22
∗ ,
Σ22
Σ∗ Σ∗
= log 11 22 .
|Σ|
Fx1 → x2 = log
(A8)
Fx2 → x1
(A9)
Fx1 ↔ x2
(A10)
Entropy 2023, 25, 806
46 of 59
Similar to Equation (A7), the individual causalities from Equations (A8) to (A10) have
the range of [0, ∞), with F(... ) = 0 indicating the absence of causality relationship.
Appendix A.3. Granger Causality: Frequency Domain
The spectral of the GC is studied at the frequency domain of the AR model. Unlike the
time domain, the frequency domain of GC required both the constant parameters and the
variance of the noises for the evaluation. To understand the formulation, the lag operator
(Lk ) is used and defined as Equation (A11).
L k x ( t ) = x ( t − k ).
(A11)
Hence, by applying the lag operator expressed in Equation (A11) to Equations (A4)
and (A5), they can be rewritten as
(
x1 (t) = ∑in=1 ai Li x1 (t) + ∑in=1 bi Li x2 (t) + ǫ1∗ (t),
(A12)
x2 (t) = ∑in=1 ci Li x1 (t) + ∑in=1 di Li x2 (t) + ǫ2∗ (t),
(
1 − ∑in=1 ai Li x1 (t) + − ∑in=1 bi Li x2 (t) = ǫ1∗ (t),
(A13)
− ∑in=1 ci Li x1 (t) + 1 − ∑in=1 di Li x2 (t) = ǫ2∗ (t),
#
("
#"
# "
a ( L ) b ( L ) x1 ( t )
ǫ1∗ (t)
,
(A14)
= ∗
c ( L ) d ( L ) x2 ( t )
ǫ2 ( t )
Fourier transform →
("
n
("
a(ω )
b(ω )
c(ω )
d(ω )
x1 ( ω )
x2 ( ω )
#
=
"
#"
x1 ( ω )
x2 ( ω )
a(ω )
b(ω )
c(ω )
d(ω )
#
=
"
# −1 "
ǫ1∗ (ω )
ǫ2∗ (ω )
ǫ1∗ (ω )
ǫ2∗ (ω )
#
#
≡ A ( ω ) x ( ω ) = ǫ ∗ ( ω ),
=
"
H11 (ω )
H12 (ω )
H21 (ω )
H22 (ω )
x ( ω ) = H ( ω ) ǫ ∗ ( ω ),
#"
ǫ1∗ (ω )
ǫ2∗ (ω )
(A15)
#
,
(A16)
(A17)
such that A(ω) is known as the coefficient matrix. The elements in the coefficient
matrix are expressed as
n
a(ω ) = 1 − ∑ ai e− jωi ,
(A18)
b(ω ) = − ∑ bi e− jωi ,
(A19)
c(ω ) = − ∑ ci e− jωi ,
(A20)
i =1
n
i =1
n
i =1
n
d(ω ) = 1 − ∑ di e− jωi ,
(A21)
i =1
which is obtained via the Fourier transform from the time domain to the frequency domain.
Note that j is the imaginary unit while ω is the angular frequency that is defined as
ω = 2π f , f = frequency. The matrix H (ω) is known as the transfer matrix. The spectral
matrix (S(ω)) can be obtained by proper ensemble averaging of Equation (A17), which is
expressed as Equation (A22):
D
E
S (ω ) S12 (ω )
= x(ω) x† (ω) = H (ω)ΣH † (ω),
(A22)
S(ω) = 11
S21 (ω ) S22 (ω )
Entropy 2023, 25, 806
47 of 59
such that h·i is defined as an ensemble average, and † is denoted as the complex
conjugate transpose of a matrix. The element Snm|n=m is known as auto spectra and
Snm|n6=m is known as cross spectra. Similar to Equation (A7), the total interdependence
between x1 (ω ) and x2 (ω ) is defined as Equation (A23), which consists of two directional
causalities (Ix1 → x2 and Ix2 → x1 ) and one instantaneous causality (Ix1 ↔ x2 ).
Ix1 · x2 = log
S11 (ω )S22 (ω )
= Ix1 → x2 + Ix2 → x1 + Ix1 ↔ x2 ,
|S(ω)|
(A23)
such that |S(ω)| is the determinant of spectral matrix S(ω). The individual causalities
are defined as
Ix1 → x2 = log
S22 (ω )
,
∗ (ω )
Ĥ22 (ω )Σ22 Ĥ22
(A24)
Ix2 → x1 = log
S11 (ω )
,
∗ (ω )
H̃11 (ω )Σ11 H̃11
(A25)
Ix1 ↔ x2
∗ ( ω ) Ĥ ( ω ) Σ Ĥ ∗ ( ω )
H̃11 (ω )Σ11 H̃11
22
22 22
.
= log
|S(ω)|
(A26)
Σ12
Σ12
H21 (ω ) and H̃11 (ω ) = H11 (ω ) + Σ
H12 (ω ); ∗ is deHere Ĥ22 (ω ) = H22 (ω ) + Σ
22
11
noted as the complex conjugate of the element [17].
As shown in Equations (A23) to (A26), the elements from the transfer matrix and
covariance matrix of the noises are needed to evaluate the GC in the frequency domain.
The elements can be obtained via the constant parameters of Equations (A4) and (A5),
which can be estimated via Yule–Walker equations [18,40]. This approach of evaluating
GC through the knowledge of constant parameters is known as the parametric method; it
has the disadvantage that the information of lags (number of past values) is needed when
estimating the constant parameters. Alternatively, the non-parametric method conducts the
GC analysis by directly decomposing the spectral matrix (S(ω)) into the transfer matrix
(H (ω)) and the covariance matrix of the noises (Σ). The decomposition is conducted via a
recursive algorithm known as the Wilson algorithm [18,35]. The derivation of the Wilson
algorithm will be discussed in the next section.
Appendix A.4. Wilson Algorithm
The spectral matrix S(ω) that is defined in the range [−π, π ] is known as the spectral
density function. It can be expressed as Equation (A27).
∞
S(ω) =
Ci e jωi ,
∑
(A27)
i =−∞
Ci =
1
2π
Z π
−π
S(ω)e− jωi dω,
(A28)
such that the Ck is known as the correlation coefficient obtained via inverse Fourier transform Equation (A28).
Following Wilson’s factorization theorem (theorem 1.1 in [35]), the spectral density
function is the product of generating function ψ(e jω ) with its complex conjugate transpose
that is defined on a unit circle |z| = 1:
S(ω) = ψ(e jω )ψ(e jω )† ,
(A29)
∞
ψ(e jω ) =
∑ Ai e jωi ,
(A30)
i =0
Ai =
1
2π
Z π
−π
ψ(e jω )e− jωi dω.
(A31)
Entropy 2023, 25, 806
48 of 59
Here Ak is the moving average coefficient, which can be obtained by inverse Fourier
transform as expressed in Equation (A31). With Equation (A30) being holomorphic extended to the inner of the unit circle (|z| < 1), it can be rewritten as Equation (A32).
∞
∞
ψ(e jω ) = ψ(z) =
∑ A i z i = A0 + ∑ A i z i .
(A32)
i =1
i =0
∞
∑ Ai zi
∴ S(z) = ψ(z)ψ(z)† =
i =0
!
∞
∑
′
Ai ′ ⊤ z
−i ′
i =0
!
∞
=
∞
∑∑
′
′
Ai Ai ′ ⊤ zi −i .
(A33)
i =0 i =0
Comparing Equations (A22) and (A33) when z = e jω = 0:
S(0) = ψ(0)ψ(0)† = H (0)ΣH † (0) = IΣI ⊤ = Σ = A0 A0 ⊤ ,
A0
(A34)
such that I is the identity matrix, and [·]⊤ indicates the transpose of the matrix. Since
⊤
−⊤ = I, the transfer matrix H ( ω ) can be obtained via
0 A0 A0
−1 A
S(z) = ψ(z) Iψ(z)† = ψ(z) A0 −1 A0 A0 ⊤ A0 −⊤ ψ(z)† = ψ(z) A0 −1 ΣA0 −⊤ ψ(z)† , (A35)
∴ H (ω) = ψ(e jω ) A0 −1 , H (ω)† = A0 −⊤ ψ(e jω )† .
(A36)
As shown, the transfer matrix H (ω) and covariance matrix of the noises Σ can be
obtained once the generating function ψ(e jω ) is known. G.T. Wilson introduces the solution
of ψ(e jω ) by iteration along with the plus operator ([·]+ ) of a given spectral density function
f (ω ), as expressed in Equations (A37) to (A39).
∞
f (ω ) =
∑
β i e jωi , β i =
i=∞
1
2π
Z π
∞
f (ω )e− jωi dω,
−π
[ f (ω )]+ = 0.5β 0 + ∑ β i e jωi ,
(A37)
(A38)
i =1
ψi+1 = ψi
ψi−1 S
|
ψi−1
{z
†
+I
[ f (ω )]+
+
.
(A39)
}
From the previous work [35], it has been shown that the iteration of Equation (A39)
will be uniquely converged if the initial generating function (ψ0 ) is an upper triangle of the
matrix which can be obtained via Cholesky factorization.
Appendix B. Transfer Entropy
Given a stochastic process (X), the average information/uncertainty needed to describe the process can be defined by Shannon entropy (IX ) and expressed as Equation (A40).
IX = −
∑
x∈X
p( x ) log2 p( x ),
(A40)
such that p( x ) is the discrete probability distribution at state x. The unit of the Shannon
entropy varies with the base of the logarithm (i.e., the unit is bits when the logarithm is
based 2) [28,41,42].
For the situation of two stochastic processes (X and Y), the statistical dependency
of these two processes can be evaluated via Mutual Information (MI) and defined as
Equation (A41).
MI =
∑
x ∈ X,y∈Y
p( x, y) log2
p( x, y)
.
p( x ) p(y)
(A41)
Entropy 2023, 25, 806
49 of 59
The definition of MI is derived from the Kullback–Leibler (KL) divergence. KL
divergence measures the differences between two probability distributions (p( x ) and q( x ))
non-symmetrically; it is defined as Equation (A42) [37].
DKL ( P|| Q) =
∑
x∈X
p( x ) log2
p( x )
.
q( x )
(A42)
Hence, MI measures the difference between the marginal probability distribution
(p( x ) p(y)) and joint probability distribution (p( x, y)) that quantifies the common information shared between the processes. Therefore, the MI will be zero when both processes are
statistically independent and no common information is presented between the processes
as the joint probability distribution will be equal to the product of marginal probability
distribution (p( x, y) = p( x ) p(y) → MI = ∑ x∈ X,y∈Y p( x, y) log2 1 = 0) [19,37].
As discussed above, MI only measured the commonly shared information, which
subsequently does not indicate the directional flow of information. On the contrary, Transfer
Entropy (TE) is able to reveal the directional flow of information by considering the
conditional probability p(·|·) and is defined as Equation (A43).
TEY → X =
∑
x ∈ X,y∈Y
(k)
(l )
(k)
p( xn+1 , xn , yn ) log2
(l )
p ( x n +1 | x n , y n )
(k)
p ( x n +1 | x n )
,
(A43)
(k)
such that xn+1 is the state of X at the (n + 1)-th moment; xn is the state at n-th
(l )
moment of X consists of k previous states of X; yn is the state at n-th moment of Y
consisting of l previous states. Note that the previous states of k and l are arbitrary and
depend on the interest of the study. For instance, given a collective observed stochastic
(2)
process X = ( x1 , x2 , . . . , x10 ), x8 = ( xn , xm ) |n, m ∈ [1, 8] → ( x1 , x2 ) or ( x3 , x8 ), etc. Hence,
TE measures the additional information needed for the realization of X at (n + 1) from Y
with the assumption that X is independent of Y. If Y does not affect the future evolution of
(k)
(l )
(k)
X, then p( xn+1 | xn , yn ) = p( xn+1 | xn ) and subsequently TY → X = 0. Using the general
p( x,y)
definition of conditional probability distribution (p( x |y) = p(y) ), Equation (A43) can be
rewritten as Equation (A44) [19].
TEY → X =
∑
x ∈ X,y∈Y
(k)
(l )
p( xn+1 , xn , yn ) log2
(k)
(l )
(k)
p ( x n +1 , x n , y n ) p ( x n )
(k)
(k)
(l )
p ( x n +1 , x n ) p ( x n , y n )
.
(A44)
Appendix C. Information Rate, Information Length, and Information Rate Causality
In a dimensionless statistical manifold, the distance of two probability distributions
is defined by their differences. One of the distances is discussed in Appendix B; this is
KL divergence/relative entropy with the expression as Equation (A42). As mentioned,
the KL divergence measured the differences non-symmetrically, and hence it is hardly
considered as a metric distance because it does not satisfy the triangle inequality [32].
The symmetric version of KL divergence is introduced as Jensen divergence and expressed
as Equation (A45). The square root of Jensen divergence has shown to be a metric distance
that satisfies triangle inequality [32,43,44].
D JS =
1
[ DKL ( P|| Q) + DKL ( Q|| P)].
2
(A45)
As depicted in Equations (A42) and (A45), the distance is measured at the ending
points (e.g., initial and final probabilities), and hence it is not path-dependent. The pathdependent distance can be evaluated with the information length (L), which is computed
through the knowledge of information rate (Γ). The detailed discussion and derivation are
shown in [32]. An essential brief discussion and derivation of Γ and L are shown in the
next part.
Entropy 2023, 25, 806
50 of 59
Consider a time-dependent probability distribution (p( x, t)) that evolves over time.
For instance, p( x, t) changes to p( x, t + dt) with the small time step of dt → 0. This
evolution of difference can be expressed as Equation (A46) with the dt as the leading order.
lim
dt→0
1
1
1
D ( p( x, t + dt)|| p( x, t)) = lim
DKL ( p( x, t)|| p( x, t + dt)) = lim
D JS ,
(dt)2 KL
dt→0 ( dt )2
dt→0 ( dt )2
1
dt→0 ( dt )2
lim
1
dt→0 ( dt )2
lim
Z
dx p( x, t + dt) ln
p( x, t + dt)
,
p( x, t)
(A47)
1
dx p( x, t) + (dt)∂t p + (dt)2 (∂tt p) + O((dt)3 )
2
Z
(dt)∂t p 1 (dt)2 (∂tt p)
ln 1 +
+
+ O((dt)3 ) ,
p( x, t)
2 p( x, t)
1
lim
dt→0 ( dt )2
1
lim
dt→0 ( dt )2
Z
(A46)
"
Z
dx p( x, t) + (dt)∂t p
#"
#
(dt)∂t p 1 (dt)∂t p 2
−
,
p( x, t)
2 p( x, t)
"
(A48)
(A49)
#
(dt)∂t p 1 (dt)∂t p 2
(dt)∂t p 1 (dt)∂t p 2
dx p( x, t)
−
−
p( x, t) + (dt)∂t p
(dt)∂t p ,
p( x, t)
2 p( x, t)
p( x, t)
2 p( x, t)
(A50)
(dt)2 (∂t p)2 1 (dt)3 (∂t p)3
1 (dt)2 (∂t p)2
+
−
dx (dt)∂t p −
,
2 p( x, t)
p( x, t)
2 ( p( x, t))2
(A51)
1
dt→0 ( dt )2
lim
Z
1
dt→0 ( dt )2
lim
lim 0
dt→0
Z
1 (dt)3 (∂t p)3
1 (dt)2 (∂t p)2
−
dx (dt)∂t p +
,
2 p( x, t)
2 ( p( x, t))2
Z
dx
∂t p
+
dt
Z
1 ( ∂ t p )2
−0
2 p( x, t)
Z
dx
1
1
D JS =
2
2
dt→0 ( dt )
Z
Z
dx ∂tt p( x, t) = 0 ∵
∴ lim
∵
dx
dx ∂t p( x, t) =
∴ Γ2 ( t ) =
Z
Z
1 (dt)(∂t p)3
,
2 ( p( x, t))2
Z
(A53)
1 2
Γ ( t ),
2
(A54)
dx p( x, t) = 1,
(A55)
2
q
dx ∂t p( x, t)
(A56)
dx p( x, t)(∂t ln p( x, t))2 ≡
dx p( x, t)[∂t ln p( x, t)]2 = 4
(A52)
Z
Note that Taylor series is implemented in Equation (A48) and Mercator series is used
in Equation (A49). The final result Γ(t) is defined as the information rate which measures
the rate of the statistical change (e.g., due to the new input of information) from one state
to another. Hence, the total statistical change in p( x, t) or the total path-dependent distance
of p( x, t) that evolves through the time is expressed as Equation (A57) and defined as
information length (L).
L=
Z
dt Γ(t).
(A57)
Entropy 2023, 25, 806
51 of 59
d f (t)
∂ f ( x,t)
To accommodate
R
Rthe computation, the continuous differentiation ( dt , ∂t ) and
integration ( dt f (t), ∂t f ( x, t)) are discretized, as shown in Equations (A58) to (A61).
f (t + ∆t) − f (t)
d f (t)
=
,
dt
∆t
f ( x, t + ∆t) − f ( x, t)
∂ f ( x, t)
=
,
∂t
∆t
Z
Z
(A58)
(A59)
N
∑ ∆t f (ti ),
dt f (t) =
ti =0
∆t = ti+1 − ti , N ∈ R
(A60)
N
∑ ∆t f (x, ti ),
∂t f ( x, t) =
ti =0
∆t = ti+1 − ti , N ∈ R.
(A61)
Hence, the discretized version of information rate (Γ), information length (L), and information rate causality (Γ xn → xm | n6=m&(n,m)∈[1,2] ), shown in Equations (A62) to (A64), are
used when computing the results that are included in the paper.
Γ2 = 4
L=
Γ2x2 → x1
∑
x∈X
∆x
[∆t]2
q
p( x, t + ∆t) −
∑ ∆t Γ(t),
2
p( x, t) ,
(A62)
(A63)
t∈ T
=4
q
∆x1 ∆x2
∑
x1 ∈ X1 ,x2 ∈ X2
[∆t]2
q
p( x1 (t + ∆t), x2 (ti )) −
q
2
p( x1 (t), x2 (ti )) .
(A64)
Appendix D. Basic Model—Stationary and Continuous Coupling
In this part, we investigate the coupling between two processes, namely x1 (t) and
x2 (t), where the first process (x1 (t)) influences the second process (x2 (t)) throughout the
entire process. We consider both linear and nonlinear cases, which are represented by
Equations (A65) and (A66), respectively.
Linear model:
x1 (t) = 0.55x1 (t − 1) + ǫ1 (t),
x 2 ( t ) = x 1 ( t − 2 ) + ǫ2 ( t ) .
(A65)
Nonlinear model:
x1 (t) = 0.55x1 (t − 1)e1−| x1 (t−1)| + ǫ1 (t),
x 2 ( t ) = x 1 ( t − 2 ) + ǫ2 ( t ) .
(A66)
Here, t is the time steps and ǫn|n∈[1,2] is the Gaussian white noise. The white noise
1.0 0.0
Σ11 Σ12
=
, Σnm =
has zero mean (hǫn i = 0|n ∈ [1, 2]) with unit variance (
0.0 1.0
Σ21 Σ22
h(ǫn − hǫn i)(ǫm − hǫm i)i|n, m ∈ [1, 2]). Additionally, we also consider the noncoupling
scenario where the process x2 (t) solely depends on the noise ǫ2 (t) (x2 (t) = ǫ2 (t) ∵ x1 (t −
2) = 0).
The simulation results for Equations (A65) and (A66) are presented in Figure A1.
From the figures, it can be observed that both the linear and nonlinear processes exhibit
stationary oscillation, regardless of whether they are coupled or noncoupled.
Entropy 2023, 25, 806
52 of 59
(a) linear: coupling
(b) linear: noncoupling
(c) nonlinear: coupling
Figure A1. Cont.
Entropy 2023, 25, 806
53 of 59
(d) nonlinear: noncoupling
Figure A1. Result of the simulation based on Equations (A65) and (A66). The [top left, red] is the
result of x1 , while the [bottom left, green] shows the result of x2 . The phase space of x1 and x2 is
shown in [right blue] figure. Note that the noncoupling here refers to the processes oscillating by
themselves without sharing any information with each other.
Appendix D.1. Granger Causality
The results of the GC are shown in Figure A2 for all the cases. Based on the results,
both the coupling and noncoupling cases can be depicted well via the nonparametric GC
approach (refer to Section 2.1). The continuous causality from x1 to x2 can be depicted
well in Figure A2a,c. Similarly, the noncoupling between the signals is shown well in
Figure A2b,d.
(a) linear: coupling
(b) linear: noncoupling
Figure A2. Cont.
Entropy 2023, 25, 806
54 of 59
(c) nonlinear: coupling
(d) nonlinear: noncoupling
Figure A2. Result of the spectral and time-varying frequency of GC. In each subfigure, the [left
panel] shows the spectral (frequency domain) of GC with [blue line] indicating that x2 causes
x1 , while [orange line] indicates that x1 causes x2 . The [top, right] subfigure illustrates the timevarying frequency of GC of x2 to x1 , while [bottom, right] shows x1 to x2 . The results are based on
Equations (A65) and (A66).
Appendix D.2. Transfer Entropy
The results of TE shown in Figure A3 also show similar findings as GC for both
coupling and noncoupling of the signals. As depicted in Figure A3a,c, TE can effectively
capture the coupling from x1 to x2 when appropriate lag is selected. The suitable lag can be
determined by observing the peak of Net TE (TEx1 ,x2 = TEx1 → x2 − TEx2 → x1 ), as it indicates
the lag at which a significant result can be expected. Similarly, the noncoupling cases are
depicted well in Figure A3b,d, where the TE between the two signals is extremely low, in
the order of magnitude of ×10−4 .
(a) linear: coupling
Figure A3. Cont.
Entropy 2023, 25, 806
55 of 59
(b) linear: noncoupling
(c) nonlinear: coupling
(d) nonlinear: noncoupling
Figure A3. Result of the net TE and window sliding TE. Note that for each subfigure, [left, 4 × 4]
depicts the net TE for the whole signals. The [right] subfigure shows the window sliding TE.
The results shown are based on Equations (A65) and (A66).
Appendix D.3. Information Rate Causality
Information rate causality is a quantitative measure of the changes in probability
distribution caused by one signal influencing another. In the cases of coupling, a high peak
is expected to indicate a significant change in the distribution due to the causal relationship
between the signals, as depicted in Figure A4a,c. The magnification of these figures is
shown in Figure A5a,c. In this scenario, the causation from x1 (t) to x2 (t) is represented
well by the peaks at an order of magnitude of ×104 . In noncoupling cases, as shown in
Figure A4b,d (with magnification shown in Figure A5b,d), information rate causality does
not reduce to zero due to the noise present in the estimation of the probability distribution
(refer to Figure S1). However, this noise does not affect the evaluation, as it does not result
in a high order of peaks observed in coupling cases.
Entropy 2023, 25, 806
56 of 59
(a) linear: coupling
(b) linear: noncoupling
(c) nonlinear: coupling
Figure A4. Cont.
Entropy 2023, 25, 806
57 of 59
(d) nonlinear: noncoupling
Figure A4. Result of information rate causality for Equations (A65) and (A66).
(a) linear: coupling
(b) linear: noncoupling
Figure A5. Cont.
Entropy 2023, 25, 806
58 of 59
(c) nonlinear: coupling
(d) nonlinear: noncoupling
Figure A5. Zoom-in of Figure A4 for the information rate causality between 9.6 s and 10.4 s.
Reference
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Zenil, H.; Kiani, N.A.; Marabita, F.; Deng, Y.; Elias, S.; Schmidt, A.; Ball, G.; Tegnér, J. An Algorithmic Information Calculus for
Causal Discovery and Reprogramming Systems. iScience 2019, 19, 1160–1172. [CrossRef] [PubMed]
Zenil, H.; Kiani, N.A.; Zea, A.A.; Tegnér, J. Causal deconvolution by algorithmic generative models. Nat. Mach. Intell. 2019,
1, 58–66. [CrossRef]
Nogueira, A.R.; Pugnana, A.; Ruggieri, S.; Pedreschi, D.; Gama, J. Methods and tools for causal discovery and causal inference.
Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2022, 12, e1449. [CrossRef]
Zenil, H.; Kiani, N.A.; Tegnér, J. Low-algorithmic-complexity entropy-deceiving graphs. Phys. Rev. E 2017, 96, 012308. [CrossRef]
[PubMed]
Danks, D.; Davis, I. Causal inference in cognitive neuroscience. WIRES Cogn. Sci. 2023, e1650 . [CrossRef]
Granger, C.W.J. Investigating Causal Relations by Econometric Models and Cross-spectral Methods. Econom. J. Econom. Soc.
1969, 37, 424–438. [CrossRef]
Guo, R.; Cheng, L.; Li, J.; Hahn, P.R.; Liu, H. A Survey of Learning Causality with Data: Problems and Methods. ACM Comput.
Surv. (CSUR) 2020, 53, 1–37. [CrossRef]
Kaufmann, R.K.; Stern, D.I. Evidence for human influence on climate from hemispheric temperature relations. Nature 1997, 388,
39–44. [CrossRef]
Imbens, G.W. Nonparametric Estimation of Average Treatment Effects under Exogeneity: A Review*. Rev. Econ. Stat. 2004, 86,
4–29. [CrossRef]
Pullon, R.M.; Yan, L.; Sleigh, J.W.; Warnaby, C.E. Granger causality of the electroencephalogram reveals abrupt global loss of
cortical information flow during propofol-induced loss of responsiveness. Anesthesiology 2020, 133, 774–786. [CrossRef]
Zhao, Y.; Billings, S.A.; Wei, H.; He, F.; Sarrigiannis, P.G. A new NARX-based Granger linear and nonlinear casual influence
detection method with applications to EEG data. J. Neurosci. Methods 2013, 212, 79–86. [CrossRef] [PubMed]
Ebert-Uphoff, I.; Deng, Y. Causal discovery for climate research using graphical models. J. Clim. 2012, 25, 5648–5665. [CrossRef]
Entropy 2023, 25, 806
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
59 of 59
Dehejia, R.H.; Wahba, S. Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs. J. Am.
Stat. Assoc. 1999, 94, 1053–1062. [CrossRef]
Shovon, M.H.I.; Nandagopal, N.; Vijayalakshmi, R.; Du, J.T.; Cocks, B. Directed Connectivity Analysis of Functional Brain
Networks during Cognitive Activity Using Transfer Entropy. Neural Process. Lett. 2017, 45, 807–824. [CrossRef]
Gao, Y.; Su, H.; Li, R.; Zhang, Y. Synchronous analysis of brain regions based on multi-scale permutation transfer entropy.
Comput. Biol. Med. 2019, 109, 272–279. [CrossRef]
Seth, A.K.; Barrett, A.B.; Barnett, L. Granger causality analysis in neuroscience and neuroimaging. J. Neurosci. 2015, 35, 3293–3297.
[CrossRef]
Ding, M.; Chen, Y.; Bressler, S.L. Granger Causality: Basic Theory and Application to Neuroscience; In Handbook of Time Series
Analysis: Recent Theoretical Developments and Applications; John Wiley & Sons: Hoboken, NJ, USA, 2006.
Lima, V.; Dellajustina, F.J.; Shimoura, R.O.; Girardi-Schappo, M.; Kamiji, N.L.; Pena, R.F.; Roque, A.C. Granger causality in the
frequency domain: Derivation and applications. Rev. Bras. Ensino Fis. 2020, 42 . [CrossRef]
van Milligen, B.P.; Birkenmeier, G.; Ramisch, M.; Estrada, T.; Hidalgo, C.; Alonso, A. Causality detection and turbulence in fusion
plasmas. Nucl. Fusion 2014, 54, 023011. [CrossRef]
Sigtermans, D. Is Information Theory Inherently a Theory of Causation? arXiv 2020, arXiv:2010.01932.
Bressler, S.L.; Seth, A.K. Wiener-Granger Causality: A well established methodology. Neuroimage 2011, 58, 323–329. [CrossRef]
Geweke, J. Measurement of Linear Dependence and Feedback Between Multiple Time Series Measurement of Linear Dependence
and Feedback Betwveen Multiple Time Series. J. Am. Stat. Assoc. 1982, 77, 304–313. [CrossRef]
Geweke, J.F. Measures of Conditional Linear Dependence and Feedback Between Time Series. J. Am. Stat. Assoc. 1984, 79,
907–915. [CrossRef]
Dhamala, M.; Rangarajan, G.; Ding, M. Estimating granger causality from fourier and wavelet transforms of time series data.
Phys. Rev. Lett. 2008, 100, 018701. [CrossRef]
Dhamala, M.; Rangarajan, G.; Ding, M. Analyzing information flow in brain networks with nonparametric Granger causality.
NeuroImage 2008, 41, 354–362. [CrossRef] [PubMed]
Klonowski, W. Everything you wanted to ask about EEG but were afraid to get the right answer. Nonlinear Biomed. Phys. 2009, 3,
2. [CrossRef] [PubMed]
He, F.; Yang, Y. Nonlinear System Identification of Neural Systems from Neurophysiological Signals. Neuroscience 2021, 458,
213–228. [CrossRef]
Schreiber, T. Measuring Information Transfer. Phys. Rev. Lett. 2000, 85, 461–464. [CrossRef]
Barnett, L.; Barrett, A.B.; Seth, A.K. Granger causality and transfer entropy Are equivalent for gaussian variables. Phys. Rev. Lett.
2009, 103, 238701. [CrossRef] [PubMed]
Kim, E.J.; Guel-Cortez, A.J. Causal information rate. Entropy 2021, 23, 1087. [CrossRef]
Kim, E.J. Information geometry and non-equilibrium thermodynamic relations in the over-damped stochastic processes. J. Stat.
Mech. Theory Exp. 2021, 2021, 93406. [CrossRef]
Kim, E.j. Information geometry, fluctuations, non-equilibrium thermodynamics, and geodesics in complex systems. Entropy 2021,
23, 1393. [CrossRef]
Ito, S.; Dechant, A. Stochastic Time Evolution, Information Geometry, and the Cramér-Rao Bound. Phys. Rev. X 2020, 10, 021056.
[CrossRef]
O’donnell, R.T.; Korb, K.B.; Allison, L. Causal KL: Evaluating Causal Discovery. arXiv 2021, arXiv:2111.06029.
Wilson, G.T. The Factorization of Matricial Spectral Densities. SIAM J. Appl. Math. 1972, 23, 420–426. [CrossRef]
Detto, M.; Molini, A.; Katul, G.; Stoy, P.; Palmroth, S.; Baldocchi, D. Causality and persistence in ecological systems: A
nonparametric spectral Granger causality approach. Am. Nat. 2012, 179, 524–535. [CrossRef] [PubMed]
Gencaga, D.; Knuth, K.H.; Rossow, W.B. A recipe for the estimation of information flow in a dynamical system. Entropy 2015,
17, 438–470. [CrossRef]
Sahann, R.; Muller, T.; Schmidt, J. Histogram binning revisited with a focus on human perception. In Proceedings of the 2021
IEEE Visualization Conference (VIS), New Orleans, LA, USA, 24–29 October 2021. [CrossRef]
Terrell, G.R.; Scott, D.W. Oversmoothed nonparametric density estimates. J. Am. Stat. Assoc. 1985, 80, 209–214. [CrossRef]
Schelter, B.; Winterhalder, M.; Timmer, J. Handbook of Time Series Analysis; Wiley-VCH Verlag GmbH & Co KGaA: Weinheim,
Germany, 2006; pp. 335–491.
Shannon, C.E. A Mathematical Theory of Communication PART III: MATHEMATICAL PRELIMINARIES. Bell Syst. Tech. J. 1948,
27, 623–656. [CrossRef]
Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [CrossRef]
Virosztek, D. The metric property of the quantum Jensen-Shannon divergence. Adv. Math. 2021, 380, 107595. [CrossRef]
Briët, J.; Harremoës, P. Properties of classical and quantum Jensen-Shannon divergence. Phys. Rev. A—At. Mol. Opt. Phys. 2009,
79, 052311. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.