Energies 16 06770 v2
Energies 16 06770 v2
Energies 16 06770 v2
Article
Battery and Hydrogen Energy Storage Control in a Smart
Energy Network with Flexible Energy Demand Using Deep
Reinforcement Learning
Cephas Samende 1, *, Zhong Fan 2 , Jun Cao 3 , Renzo Fabián 3 , Gregory N. Baltas 3 and Pedro Rodriguez 3,4
Abstract: Smart energy networks provide an effective means to accommodate high penetrations of
variable renewable energy sources like solar and wind, which are key for the deep decarbonisation of
energy production. However, given the variability of the renewables as well as the energy demand,
it is imperative to develop effective control and energy storage schemes to manage the variable
energy generation and achieve desired system economics and environmental goals. In this paper,
we introduce a hybrid energy storage system composed of battery and hydrogen energy storage to
handle the uncertainties related to electricity prices, renewable energy production, and consumption.
We aim to improve renewable energy utilisation and minimise energy costs and carbon emissions
while ensuring energy reliability and stability within the network. To achieve this, we propose a
multi-agent deep deterministic policy gradient approach, which is a deep reinforcement learning-
based control strategy to optimise the scheduling of the hybrid energy storage system and energy
demand in real time. The proposed approach is model-free and does not require explicit knowledge
and rigorous mathematical models of the smart energy network environment. Simulation results
Citation: Samende, C.; Fan, Z.; Cao, based on real-world data show that (i) integration and optimised operation of the hybrid energy
J.; Fabián, R.; Baltas, G.N.; Rodriguez,
storage system and energy demand reduce carbon emissions by 78.69%, improve cost savings by
P. Battery and Hydrogen Energy
23.5%, and improve renewable energy utilisation by over 13.2% compared to other baseline models;
Storage Control in a Smart Energy
and (ii) the proposed algorithm outperforms the state-of-the-art self-learning algorithms like the
Network with Flexible Energy
deep-Q network.
Demand Using Deep Reinforcement
Learning. Energies 2023, 16, 6770.
https://doi.org/10.3390/en16196770 Keywords: deep reinforcement learning; multi-agent deep deterministic policy gradient; battery
and hydrogen energy storage systems; decarbonisation; renewable energy; carbon emissions;
Academic Editors: Teke Gush and
deep-Q network
Raza Haider
Smart energy networks (SEN) (also known as micro-grids), which are autonomous
local energy systems equipped with RESs and energy storage systems (ESS) as well as vari-
ous types of loads, are an effective means of integrating and managing high penetrations of
variable RESs in the energy system [6]. Given the uncertainties with RES energy generation
as well as the energy demand, ESSs such as battery energy storage systems (BESS) have
been proven to play a crucial role in managing the uncertainties while providing reliable
energy services to the network [7]. However, due to low capacity density, BESSs cannot be
used to manage high penetration of variable RESs [8].
Hydrogen energy storage systems (HESS) are emerging as promising high-capacity
density energy storage carriers to support high penetrations of RESs. This is mainly due to
falling costs for electricity from RESs and improved electrolyser technologies, whose costs
have fallen by more than 60% since 2010 [9]. During periods of over-generation from the
RESs, HESSs convert the excess power into hydrogen gas, which can be stored in a tank.
The stored hydrogen can be sold externally as fuel such as for use in fuel-cell hybrid electric
vehicles [10] or converted into power during periods of minimum generation from the RES
to complement other ESSs such as the BESS.
The SEN combines power engineering with information technology to manage the
generation, storage, and consumption to provide a number of technical and economic
benefits, such as increased utilisation of RESs in the network, reduced energy losses and
costs, increased power quality, and enhanced system stability [11]. However, this requires
an effective smart control strategy to optimise the operation of the ESSs and energy demand
to achieve the desired system economics and environmental outcomes.
Many studies have proposed control strategies that optimise the operation of ESSs to
minimise utilisation costs [12–15]. Others have proposed control models for optimal sizing
and planning of the microgrid [16–18]. Other studies have modeled the optimal energy
sharing in the microgrid [19]. Despite a rich history, the proposed control approaches
are model-based, in which they require explicit knowledge and rigorous mathematical
models of the microgrid to capture complex real-world dynamics. Model errors and model
complexity make it difficult to apply and optimise the ESSs in real-time. Moreover, even if
an accurate and efficient model without errors exists, it is often a cumbersome and fallible
process to develop and maintain the control approaches in situations where uncertainties
of the microgrid are dynamic in nature [20].
In this paper, we propose a model-free control strategy based on reinforcement learning
(RL), a machine learning paradigm, in which an agent learns the optimal control policy
by interacting with the SEN environment [21]. Through trial and error, the agent selects
control actions that maximise a cumulative reward (e.g., revenue) based on its observation
of the environment. Unlike the model-based optimisation approaches, model-free-based
algorithms do not require explicit knowledge and rigorous mathematical models of the
environment, making them capable of determining optimal control actions in real-time
even for complex control problems like peer-to-peer energy trading [22]. Further, artificial
neural networks can be combined with RL to form deep reinforcement learning (DRL),
making model-free approaches capable of handling even more complex control problems
[23]. Examples of commonly used DRL-based algorithms are value-based algorithms such
as Deep Q-networks (DQN) [23] and policy-based algorithms such as the deep deterministic
policy gradient (DDPG) [24].
Recent studies on the optimised control of SENs having multiple ESSs like a hybrid
of a BESS and a HESS are proposed in [8,30–33]. In [8,30], a DDPG-based algorithm is
proposed to minimise building carbon emissions in an SEN that includes a BESS, an HESS,
and constant building loads. Similarly, operating costs are minimised in [31] using DDPG
and in [32] using DQN. However, these studies use a single control agent to manage the
multiple ESSs. Energy management of an SEN is usually a multi-agent problem where
an action of one agent affects the actions of others, making the SEN environment non-
stationary from an agent’s perspective [22]. Single agents have been found to perform
poorly in non-stationary environments [34].
A multi-agent-based control approach for the optimal operation of hydrogen-based
multi-energy systems is proposed in [33]. Despite the approach addressing the drawbacks
of the single agent, the flexibility of the electrical load is not investigated. With the introduc-
tion of flexible loads like heat pumps which run on electricity in SENs [35], the dynamics
of the electrical load are expected to change the technical economics and the environmental
impacts of the SEN.
Compared with the existing works, we investigate an SEN that has a BESS, an HESS,
and a schedulable energy demand. We explore the energy cost and carbon emission
minimisation problem of such an SEN while capturing the time-coupled storage dynamics
of the BESS and the HESS, as well as the uncertainties related to RES, varying energy prices,
and the flexible demand. A multi-agent deep deterministic policy gradient (MADDPG)
algorithm is developed to reduce the system cost and carbon emissions and to improve
the utilisation of RES while addressing the drawbacks of a single agent in a non-stationary
environment. To the authors’ knowledge, this study is the first to comprehensively apply
the MADDPG algorithm to optimally schedule the operation of the hybrid BESS and HESS
as well as the energy demand in a SEN.
1.2. Contributions
The main contributions of this paper are on the following aspects:
• We formulate the SEN system cost minimisation problem, complete with a BESS, an
HESS, flexible demand, and solar and wind generation, as well as dynamic energy
pricing as a function of energy costs and carbon emissions cost. The system cost
minimisation problem is then reformulated as a continuous action-based Markov
game with unknown probability to adequately obtain the optimal energy control
policies without explicitly estimating the underlying model of the SEN and relying on
future information.
• A data-driven self-learning-based MADDPG algorithm that outperforms a model-
based solution and other DRL-based algorithms used as a benchmark is proposed
to solve the Markov game in real-time. This also includes the use of a novel real-
world generation and consumption data set collected from the Smart Energy Network
Demonstrator (SEND) project at Keele University [36].
• We conduct a simulation analysis of a SEN model for five different scenarios to
demonstrate the benefits of integrating a hybrid of BESS and HESS and scheduling
the energy demand in the network.
• Simulation results based on SEND data show that the proposed algorithm can increase
cost savings and reduce carbon emissions by 41.33% and 56.3%, respectively, compared
with other bench-marking algorithms and baseline models.
The rest of the paper is organised as follows. A description of the SEN environment
is presented in Section 2. Formulation of the optimisation problem is given in Section 3.
A brief background to RL and the description of the proposed self-learning algorithm
is presented in Section 4. Simulation results are provided in Section 5, with conclusions
presented in Section 6.
Energies 2023, 16, 6770 4 of 20
Figure 1. Basic structure of the grid-connected smart energy network, which consists of solar, wind
turbines (WT), flexible energy demand, battery energy storage system (BESS), and hydrogen energy
storage system (HESS). The HESS consists of three main components, namely electrolyser (EL),
storage tank, and fuel cell (FC). Solid lines represent electricity flow. The dotted lines represent the
flow of hydrogen gas.
where ηc,t ∈ (0, 1] and ηd,t ∈ (0, 1] are dynamic BESS charge and discharge efficiency as
calculated in [38], respectively; Etb is the BESS energy (kWh), and ∆t is the duration of BESS
charge or discharge.
The BESS charge level is limited by the storage capacity of the BESS as
where Emin and Emax are lower and upper boundaries of the BESS charge level.
To avoid charging and discharging the BESS at the same time, we have
That is, at any particular time t, either Pc,t or Pd,t is zero. Further, the charging and
discharging power is limited by maximum battery terminal power Pmax as specified by
manufacturers as
0 ≤ Pc,t , Pd,t ≤ Pmax , ∀t (4)
During operation, the BESS wear cannot be avoided due to repeated BESS charge and
discharge processes. The wear cost can have a great impact on the economics of the SEN.
The empirical wear cost of the BESS can be expressed as [39]
t Cbca | Etb |
CBESS = (5)
Lc × 2 × DoD × Enom × (ηc,t × ηd,t )2
where Enom is the BESS nominal capacity, Cbca is the BESS capital cost, DoD is the depth of
discharge at which the BESS is cycled, and Lc is the BESS life cycle.
where Pel,t and Pf c,t are the electrolyser power input and fuel cell output power, respec-
tively; Ht (in Nm3 ) is the hydrogen gas level in the tank; rel,t (in Nm3 /kWh) and r f c,t
(in kWh/Nm3 ) are the hydrogen generation and consumption ratios associated with the
electrolyser and fuel cell, respectively.
The hydrogen level is limited by the storage capacity of the tank as
where Hmin and Hmax are the lower and upper boundaries imposed on the hydrogen level
in the tank.
As the electrolyser and the fuel cell cannot operate at the same time, we have
el and P fc
where Pmax max are the rated power values of the electrolyser and fuel cell, respectively.
Energies 2023, 16, 6770 6 of 20
If the HESS is selected to store the excess energy, the cost of producing hydrogen
through the electrolyser and later becoming fuel cell energy is given as [40]
el − f c
(Celca /Lel + Celom ) + (C ca om
f c /L f c + C f c )
Ct = (11)
η f c,t ηel,t
fc
C ca
fc
Ct = + C om
fc (12)
Lfc
The total cost of operating the HESS at time t can be expressed as follows:
el − f c
Ct
, if Pel,t > 0
t
CHESS = Ctf c , if Pf c,t > 0 (13)
0, otherwise
∆dt = Dt − dt ∀t (14)
As reducing the energy demand inconveniences the energy users, the ∆dt can be
constrained as follows:
where ζ (e.g., ζ = 30%) is a constant factor that specifies the maximum percentage of
original demand that can be reduced.
The inconvenience cost for reducing the energy demand can be estimated using a
convex function as follows:
2
t
Cinc. = α d d t − Dt ∀t (16)
where αd is a small positive number that quantifies the amount of flexibility to reduce the
energy demand, as shown in Figure 2. A lower value of αd indicates that less attention is
paid to the inconvenience cost and a larger share of the energy demand can be reduced
to minimise the energy costs. A higher value of αd indicates that high attention is paid to
the inconvenience cost, and the energy demand can be hardly reduced to minimise the
energy costs.
Energies 2023, 16, 6770 7 of 20
Figure 2. Impact of αd parameter on the inconvenience cost of the energy demand, when Dt = 250 kW
and when dt takes values from 0 to 450 kW.
where Pg,t is power import if Pg,t > 0, and power export otherwise. We assume that the
SEN is well-sized and that Pg,t is always within the allowed export and import power
limits.
Let πt and λt be the export and import grid prices at time t, respectively. As grid
electricity is the major source of carbon emissions, the cost of utilising the main grid to
meet the supply–demand balance in the SEN is the sum of both the energy cost and the
environmental cost due to carbon emissions as follows:
(
t λt Pg,t + µc Pg,t , if Pg,t ≥ 0
Cgrid = ∆t (18)
−πt | Pg,t |, otherwise
3. Problem Formulation
The key challenge in operating the SEN is how to optimally schedule the operation of
the BESS, the HESS, and the flexible energy demand to minimise energy costs and carbon
emissions as well as to increase renewable energy utilisation. The operating costs associated
with PV and wind generation are neglected for being comparatively smaller than those for
energy storage units and energy demand [12].
The action values are bounded according to their respective boundaries given by (4),
(9), (10), and (15).
As constraints given in (2) and (7) should always be satisfied, the second part of the
reward is a penalty for violating the constraints as follows:
(
(2) K, if (2) or (7) is violated
rt = − (22)
0, otherwise
T
R= ∑ γt rt (24)
t =0
where T is the time horizon and γ is a discount factor, which helps the agent to focus the
policy by caring more about obtaining the rewards quickly.
As electricity prices, RES energy generation, and demand are volatile in nature, it is
generally impossible to obtain with certainty the state transition probability function P
required to derive an optimal policy π (st | at ) needed to maximise R. To circumvent this
difficulty, we propose the use of RL as discussed in Section 4.
4. Reinforcement Learning
4.1. Background
An RL framework is made up of two main components, namely the environment
and the agent. The environment denotes the problem to be solved. The agent denotes the
learning algorithm. The agent and environment continuously interact with each other [21].
At every time t, the agent learns for itself the optimal control policy π (st | at ) through
trial and error by selecting control actions at based on its perceived state st of the environ-
ment. In return, the agent receives a reward rt and the next state st+1 from the environment
without explicitly having knowledge of the transition probability function P . The goal
of the agent is to improve the policy so as to maximise the cumulative reward R. The
environment has been described in Section 3. Next, we describe the learning algorithms.
4.2.1. DQN
The DQN algorithm was developed by Google DeepMind in 2015 [23]. It was de-
veloped to enhance a classic RL algorithm called Q-Learning [21] through the addition
of deep neural networks and a novel technique called experience replay. In Q-learning,
the agent learns the best policy π (st | at ) based on the notion of an action-value Q-function
as Qπ (s, a) = Eπ [ R|st = s, at = a]. By exploring the environment, the agent updates the
Qπ (s, a) estimates using the Bellman equation as an iterative update as follows:
4.2.2. DDPG
The DDPG algorithm is proposed to [24] to handle control problems with continuous
action spaces, which otherwise are impractical to be handled by Q-learning and DQN. The
DDPG consists of two independent neural networks: an actor and a critic network. The
actor network is used to approximate the policy π (st | at ). The input to the actor network
is the environment state st and the output is the action at . The critic network is used to
approximate the Q-function Q(st , at ) and is only used to train the agent, and the network
is discarded during the deployment of the agent. The input to the critic network is the
concatenation of the state st and the action at from the actor network, and the output is the
Q-function Q(st , at ).
Similar to the DQN, the DDPG stores an experience, et = hst , at , rt , st+1 i, in a replay
buffer D at each time step t to improve training and for better data efficiency. To add more
stability to the training, two target neural networks, which are identical to the (original)
actor network and (original) critic network are also created. Let the network parameters of
the original actor network, original critic network, target actor network, and target critic
0 0
network be denoted as θ µ , θ Q , θ µ , and θ Q , respectively. Before training starts, θ µ and θ Q
0 0 0 0
are randomly initialised, and the θ µ and θ Q are initialised as θ µ ← θ µ and θ Q ← θ Q .
j j j j B
To train the original actor and critic networks, a minibatch of B experiences hst , at , rt , st+1 i ,
j =1
are randomly sampled from D , where j ∈ B is the sample index. The original critic network
parameters θ Q are updated through gradient descent using the mean-square Bellman
error function:
1 B 2
L θ Q = ∑ y j − Q st , at ; θ Q
j j
(28)
B j =1
j j
where Q st , at ; θ Q is the predicted output of the original critic network and y j is its target
value expressed as
0 0 0 0
j j j
y j = rt + γQ s t +1 , µ ( s t +1 ; θ µ ); θ Q (29)
0 j 0
where µ (st+1 ; θ µ ) is the output (action) from the target actor-network and
0
j 0 j 0 0
Q st+1 , µ (st+1 ; θ µ ); θ Q is the output (Q-value) from the target critic network.
Energies 2023, 16, 6770 11 of 20
At the same time, parameters of the original actor network are updated by maximising
the policy objective function J (θ µ ):
B
1
∇θ µ J (θ µ ) =
B ∑ ∇θµ µ(s; θ µ )∇a Q s, a; θ Q (30)
j =1
j j µ
where s = st , a = µ(st ; θ ) is the output (action) from the original actor network and
Q
Q s, a; θ is the output (Q-value) from the original critic network.
After the parameters of the original actor network and original critic network are
updated, the parameters of the two target networks are updated through the soft update
technique as
( 0 0
θ Q ← τθ Q + (1 − τ )θ Q
0 0 (31)
θ µ ← τθ µ + (1 − τ )θ µ
where τ is the learning rate.
To ensure that the agent explores the environment, a random process [41] is used to
generate a noise Nt , which is added to every action as follows:
at = µ(st ; θ µ ) + Nt (32)
Figure 4. MADDPG structure and training process. The BESS agent and demand agent have the
same internal structure as the HESS agent.
5. Simulation Results
5.1. Experimental Setup
In this paper, real-world RES (solar and wind) generation and consumption data,
which are obtained from the Smart Energy Network Demonstrator (SEND), are used for
the simulation studies [36]. We use the UK’s time-of-use (ToU) electricity price as grid
electricity buying price, which is divided into peak price 0.234 GBP/kWh (4 pm–8 pm), flat
price 0.117 GBP/kWh (2 pm–4 pm and 8 pm–11 pm), and the valley price 0.07 GBP/kWh
(11 pm–2 pm). The electricity price for selling electricity back to the main grid is a flat price
πt = 0.05 GBP/kWh, which is lower than the ToU to avoid any arbitrage behaviour by the
BESS and HESS. A carbon emission conversion factor µc = 0.23314 kg CO2 /kWh, is used to
quantify the carbon emissions generated for using electricity from the main grid to meet the
energy demand in the SEN [42]. We set the initial BESS state of charge and hydrogen level
in the tank as E0 = 1.6 MWh and H0 = 5 Nm3 , respectively. Other technical–economic
parameters of the BESS and HESS are tabulated in Table 1. A day is divided into 48 time
slots, i.e., each time slot is equivalent to 30 min.
The actor and critic networks for each MADDPG agent are designed using hyper-
parameters tabulated in Table 2. We use the rectified linear unit (ReLU) as an activation
function for the hidden layers and the output of the critic networks. A Tanh activation
function is used in the output layer of each actor-network. We set the capacity of the
Energies 2023, 16, 6770 14 of 20
replay buffer to be K = 1 × 106 and the maximum training steps in an episode to be T = 48.
Algorithm 1 is developed and implemented in Python using PyTorch framework [43].
5.2. Benchmarks
We verify the performance of the proposed MADDPG algorithm by comparing it with
other three bench-marking algorithms:
• Rule-based (RB) algorithm: This is a model-based algorithm that follows the standard
practice of wanting to meet the energy demand of the SEN using the RES generation
without guiding the operation of the BESS, HESS, and flexible demands towards
periods of low/high electricity price to save energy costs. In the event that there is
surplus energy generation, the surplus is first stored in the short-term BESS, followed
by the long-term HESS, and any extra is sold to the main grid. If the energy demand
exceeds RES generation, the deficit is first provided by the BESS followed by the HESS,
and then the main grid.
• DQN algorithm: As discussed in Section 4, this is a value-based DRL algorithm, which
intends to optimally schedule the operation of the BESS, HESS, and flexible demand
using a single agent and a discretised action space.
• DDPG algorithm: This is a policy-based DRL algorithm, which intends to optimally
schedule the operation of the BESS, HESS, and flexible demand using a single agent
and a continuous action space, as discussed in Section 4.
As shown in Figure 5, all algorithms achieve convergence after 2000 episodes. The
DQN reaches convergence faster than MADDPG and DDPG due to the DQN’s discretised
and low-dimensional action space, making the determination of the optimal scheduling
policy relatively easier and quicker than the counterpart algorithms with continuous and
high-dimensional action spaces. As a discretised action space cannot accurately capture the
complexity and dynamics of the SEN energy management, the DQN algorithm converges
to the worst optimal policy given by the lowest average reward value (−16,572.5). On the
other hand, the MADDPG algorithm converges to a high average reward value (−6858.1),
which is slightly higher than the reward value (−8361.8) for the DDPG, mainly due to
enhanced cooperation between the operation of the controlled assets.
Figure 6. Control action results (for a 7-day period) by BESS, HESS, and flexible demand agents in
response to net demand.
Figure 7 shows that in order to minimise the multi-objective function given by P1, the
algorithm prioritises the flexible demand agent to aggressively respond to price changes
compared to the BESS and HESS agents. As shown in Figure 7, the scheduled demand
reduces sharply whenever the electricity price is the highest, and increases when the price
is lowest compared to the actions by the BESS and HESS.
Together, Figures 6 and 7 demonstrate how the algorithm allocates different priorities
to the agents to achieve a collective goal: minimise carbon costs, energy, and operational
costs. In this case, the BESS and HESS agents are trained to respond more aggressively to
changes in energy demand and generation, and maximise the benefits thereof like minimum
carbon emissions. On the other hand, scheduling the flexible demand guides the SEN
towards low energy costs.
Energies 2023, 16, 6770 16 of 20
Figure 7. Control action results (for a 7-day period) by BESS, HESS, and flexible demand agents with
response to ToU.
Table 3. Cost savings and carbon emissions for different SEN models. A tick (X) indicates that a
model is considered and a cross (×) indicates that a model is not considered.
As shown in Table 3, integrating BESS and HESS in the SEN as well as scheduling the
energy demand achieves the highest cost savings and reduction in carbon emission. For
example, the cost savings and carbon emissions are 23.5% and 78.69% higher and lower,
respectively, than those for the SEN model without BESS (i.e., the ‘No BESS’ model), mainly
due to improved RES utilisation for the proposed SEN model.
the highest RES utilisation, with 59.6% self-consumption and 100% self-sufficiency. This
demonstrates the potential of integrating HESS in future SENs for absorbing more RES,
thereby accelerating the rate of power system decarbonisation.
Table 4. Self-consumption and self-sufficiency for different SEN models. A tick (X) indicates that a
model is considered and a cross (×) indicates that a model is not considered.
(a) (b)
Figure 8. Performance of the MADDPG algorithm compared to the bench-marking algorithms for
(a) cost savings and carbon emissions; (b) self-consumption and self-sufficiency.
The MADDPG algorithm obtained the most stable and competitive performance in all
the performance metrics considered, i.e., cost savings, carbon emissions, self-consumption,
and self-sufficiency. This is mainly due to its multi-agent feature, which ensures a better
learning experience in the environment. For example, the MADDPG improved the cost
savings and reduced carbon emissions by 41.33% and 56.3%, respectively, relative to the RB
approach. The rival DDPG algorithm achieved the highest cost savings at the expense of
carbon emissions and self-sufficiency. As more controllable assets are expected in future
SENs due to the digitisation of power systems, multi-agent-based algorithms are therefore
expected to play a key energy management role.
having an energy demand that is sensitive to electricity prices is crucial for reducing carbon
emissions and promoting the use of RESs.
6. Conclusions
In this paper, we investigated the problem of minimising energy costs and carbon
emissions as well as increasing renewable energy utilisation in a smart energy network
(SEN) with BESS, HESS, and schedulable energy demand. A multi-agent deep deterministic
policy gradient algorithm was proposed as a real-time control strategy to optimally schedule
the operation of the BESS, HESS, and schedulable energy demand while ensuring that
the operating constraints and time-coupled storage dynamics of the BESS and HESS are
achieved. Simulation results based on real-world data showed increased cost savings,
reduced carbon emissions, and improved renewable energy utilisation with the proposed
algorithm and SEN. On average, the cost savings and carbon emissions were 23.5% and
78.69% higher and lower, respectively, with the proposed SEN model than baseline SEN
models. The simulation results also verified the efficacy of the proposed algorithm to
manage the SEN outperforming other bench-marking algorithms, including DDPG and
DQN algorithms. Overall, the results have shown great potential for integrating HESS in
SENs and using self-learning algorithms to manage the operation of the SEN.
References
1. Ritchie, H.; Roser, M.; Rosado, P. Carbon Dioxide and Greenhouse Gas Emissions, Our World in Data. 2020. Available online:
https://ourworldindata.org/co2-and-greenhouse-gas-emissions (accessed on 10 July 2023).
2. Allen, M.R.; Babiker, M.; Chen, Y.; de Coninck, H.; Connors, S.; van Diemen, R.; Dube, O.P.; Ebi, K.L.; Engelbrecht, F.; Ferrat, M.;
et al. Summary for policymakers. In Global Warming of 1.5: An IPCC Special Report on the Impacts of Global Warming of 1.5 °C above
Pre-Industrial Levels and Related Global Greenhouse Gas Emission Pathways, in the Context of Strengthening the Global Response to the
Threat of Climate Change, Sustainable Development, and Efforts to Eradicate Poverty; IPCC: Geneva, Switzerland, 2018.
3. Fuller, A.; Fan, Z.; Day, C.; Barlow, C. Digital twin: Enabling technologies, challenges and open research. IEEE Access 2020, 8,
108952–108971. [CrossRef]
Energies 2023, 16, 6770 19 of 20
4. Bouckaert, S.; Pales, A.F.; McGlade, C.; Remme, U.; Wanner, B.; Varro, L.; D’Ambrosio, D.; Spencer, T. Net Zero by 2050: A Roadmap
for the Global Energy Sector; International Energy Agency: Paris, France, 2021.
5. Paul, D.; Ela, E.; Kirby, B.; Milligan, M. The Role of Energy Storage with Renewable Electricity Generation; National Renewable Energy
Laboratory: Golden, CO, USA, 2010.
6. Harrold, D.J.; Cao, J.; Fan, Z. Renewable energy integration and microgrid energy trading using multi-agent deep reinforcement
learning. Appl. Energy 2022, 318, 119151. [CrossRef]
7. Arbabzadeh, M.; Sioshansi, R.; Johnson, J.X.; Keoleian, G.A. The role of energy storage in deep decarbonization of electricity
production. Nature Commun. 2019, 10, 3413. [CrossRef]
8. Desportes, L.; Fijalkow, I.; Andry, P. Deep reinforcement learning for hybrid energy storage systems: Balancing lead and hydrogen
storage. Energies 2021, 14, 4706. [CrossRef]
9. Qazi, U.Y. Future of hydrogen as an alternative fuel for next-generation industrial applications; challenges and expected
opportunities. Energies 2022, 15, 4741. [CrossRef]
10. Correa, G.; Muñoz, P.; Falaguerra, T.; Rodriguez, C. Performance comparison of conventional, hybrid, hydrogen and electric
urban buses using well to wheel analysis. Energy 2017, 141, 537–549. [CrossRef]
11. Harrold, D.J.; Cao, J.; Fan, Z. Battery control in a smart energy network using double dueling deep q-networks. In Proceedings of
the 2020 IEEE PES Innovative Smart Grid Technologies Europe (ISGT-Europe), Virtual, 26–28 October 2020; pp. 106–110.
12. Vivas, F.; Segura, F.; Andújar, J.; Caparrós, J. A suitable state-space model for renewable source-based microgrids with hydrogen
as backup for the design of energy management systems. Energy Convers. Manag. 2020, 219, 113053. [CrossRef]
13. Cau, G.; Cocco, D.; Petrollese, M.; Kær, S.K.; Milan, C. Energy management strategy based on short-term generation scheduling
for a renewable microgrid using a hydrogen storage system. Energy Convers. Manag. 2014, 87, 820–831. [CrossRef]
14. Enayati, M.; Derakhshan, G.; Hakimi, S.M. Optimal energy scheduling of storage-based residential energy hub considering smart
participation of demand side. J. Energy Storage 2022, 49, 104062. [CrossRef]
15. HassanzadehFard, H.; Tooryan, F.; Collins, E.R.; Jin, S.; Ramezani, B. Design and optimum energy management of a hybrid
renewable energy system based on efficient various hydrogen production. Int. J. Hydrogen Energy 2020, 45, 30113–30128.
[CrossRef]
16. Castaneda, M.; Cano, A.; Jurado, F.; Sánchez, H.; Fernández, L.M. Sizing optimization, dynamic modeling and energy management
strategies of a stand-alone pv/hydrogen/battery-based hybrid system. Int. J. Hydrogen Energy 2013, 38, 3830–3845. [CrossRef]
17. Liu, J.; Xu, Z.; Wu, J.; Liu, K.; Guan, X. Optimal planning of distributed hydrogen-based multi-energy systems. Appl. Energy 2021,
281, 116107. [CrossRef]
18. Pan, G.; Gu, W.; Lu, Y.; Qiu, H.; Lu, S.; Yao, S. Optimal planning for electricity-hydrogen integrated energy system considering
power to hydrogen and heat and seasonal storage. IEEE Trans. Sustain. Energy 2020, 11, 2662–2676. [CrossRef]
19. Tao, Y.; Qiu, J.; Lai, S.; Zhao, J. Integrated electricity and hydrogen energy sharing in coupled energy systems. IEEE Trans. Smart
Grid 2020, 12, 1149–1162. [CrossRef]
20. Nakabi, T.A.; Toivanen, P. Deep reinforcement learning for energy management in a microgrid with flexible demand. Sustain.
Energy Grids Netw. 2021, 25, 100413. [CrossRef]
21. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018.
22. Samende, C.; Cao, J.; Fan, Z. Multi-agent deep deterministic policy gradient algorithm for peer-to-peer energy trading considering
distribution network constraints. Appl. Energy 2022, 317, 119123. [CrossRef]
23. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.;
Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [CrossRef]
24. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep
reinforcement learning. arXiv Prep. 2015, arXiv:1509.02971.
25. Wan, T.; Tao, Y.; Qiu, J.; Lai, S. Data-driven hierarchical optimal allocation of battery energy storage system. IEEE Trans. Sustain.
Energy 2021, 12, 2097–2109. [CrossRef]
26. Bui, V.-H.; Hussain, A.; Kim, H.-M. Double deep q -learning-based distributed operation of battery energy storage system
considering uncertainties. IEEE Trans. Smart Grid 2020, 11, 457–469. [CrossRef]
27. Sang, J.; Sun, H.; Kou, L. Deep reinforcement learning microgrid optimization strategy considering priority flexible demand side.
Sensors 2022, 22, 2256. [CrossRef]
28. Gao, S.; Xiang, C.; Yu, M.; Tan, K.T.; Lee, T.H. Online optimal power scheduling of a microgrid via imitation learning. IEEE Trans.
Smart Grid 2022, 13, 861–876. [CrossRef]
29. Mbuwir, B.V.; Geysen, D.; Spiessens, F.; Deconinck, G. Reinforcement learning for control of flexibility providers in a residential
microgrid. IET Smart Grid 2020, 3, 98–107. [CrossRef]
30. Chen, T.; Gao, C.; Song, Y. Optimal control strategy for solid oxide fuel cell-based hybrid energy system using deep reinforcement
learning. IET Renew. Power Gener. 2022, 16, 912–921. [CrossRef]
31. Zhu, Z.; Weng, Z.; Zheng, H. Optimal operation of a microgrid with hydrogen storage based on deep reinforcement learning.
Electronics 2022, 11, 196. [CrossRef]
32. Tomin, N.; Zhukov, A.; Domyshev, A. Deep reinforcement learning for energy microgrids management considering flexible
energy sources. In EPJ Web of Conferences; EDP Sciences: Les Ulis, France, 2019; Volume 217, p. 01016.
Energies 2023, 16, 6770 20 of 20
33. Yu, L.; Qin, S.; Xu, Z.; Guan, X.; Shen, C.; Yue, D. Optimal operation of a hydrogen-based building multi-energy system based on
deep reinforcement learning. arXiv 2021, arXiv:2109.10754.
34. Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Abbeel, O.P.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive
environments. Adv. Neural Inf. Process. Syst. 2017, 30, . [CrossRef]
35. Wright, G. Delivering Net Zero: A Roadmap for the Role of Heat Pumps, HPA. Available online: https://www.heatpumps.org.
uk/wp-content/uploads/2019/11/A-Roadmap-for-the-Role-of-Heat-Pumps.pdf (accessed on 26 May 2023).
36. Keele University, The Smart Energy Network Demonstrator. Available online: https://www.keele.ac.uk/business/
businesssupport/smartenergy/ (accessed on 19 September 2023).
37. Samende, C.; Bhagavathy, S.M.; McCulloch, M. Distributed state of charge-based droop control algorithm for reducing power
losses in multi-port converter-enabled solar dc nano-grids. IEEE Trans. Smart Grid 2021, 12, 4584–4594. [CrossRef]
38. Samende, C.; Bhagavathy, S.M.; McCulloch, M. Power loss minimisation of off-grid solar dc nano-grids—Part ii: A quasi-consensus-
based distributed control algorithm. IEEE Trans. Smart Grid 2022, 13, 38–46. . [CrossRef]
39. Han, S.; Han, S.; Aki, H. A practical battery wear model for electric vehicle charging applications. Appl. Energy 2014, 113, 1100–1108.
[CrossRef]
40. Dufo-Lopez, R.; Bernal-Agustín, J.L.; Contreras, J. Optimization of control strategies for stand-alone renewable energy systems
with hydrogen storage. Renew. Energy 2007, 32, 1102–1126. [CrossRef]
41. Uhlenbeck, G.E.; Ornstein, L.S. On the theory of the brownian motion. Phys. Rev. 1930, 36, 823. [CrossRef]
42. RenSMART, UK CO2(eq) Emissions due to Electricity Generation. Available online: https://www.rensmart.com/Calculators/
KWH-to-CO2 (accessed on 20 June 2023).
43. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch:
An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32.
44. Luthander, R.; Widén, J.; Nilsson, D.; Palm, J. Photovoltaic self-consumption in buildings: A review. Appl. Energy 2015, 142, 80–94.
[CrossRef]
45. Long, C.; Wu, J.; Zhou, Y.; Jenkins, N. Peer-to-peer energy sharing through a two-stage aggregated battery control in a community
microgrid. Appl. Energy 2018, 226, 261–276. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.