Medium Access Using Distributed Reinforcement Learning For Iots With Low-Complexity Wireless Transceivers
Medium Access Using Distributed Reinforcement Learning For Iots With Low-Complexity Wireless Transceivers
Medium Access Using Distributed Reinforcement Learning For Iots With Low-Complexity Wireless Transceivers
Medium Access using Distributed Reinforcement Learning for IoTs with Low-Complexity
Wireless Transceivers
Abstract - This paper proposes a distributed Reinforcement The goal of this work is to develop a MAC layer protocol
Learning based protocol synthesis framework that can be used for based on reinforcement learning that can address these
synthesizing MAC layer protocols in wireless networks. The challenges. With the long-term objective of making the learning
proposed framework does not rely on complex hardware approach generalized, we start with an initial scenario of a
capabilities such as carrier sensing and its associated algorithmic network with its nodes running the simplest MAC logic without
complexities that are often not supported in wireless transceivers of relying on complex and energy-expensive lower-level
low-cost and low-energy IoT devices. Network protocols are first capabilities such as carrier sensing. The developed framework
formulated as Markov Decision Processes (MDP) and then solved would be useful for simple transceivers used in low-cost IoT
using reinforcement learning. A distributed and multi-Agent RL devices and wireless sensor nodes.
framework is used as the basis for protocol synthesis. Distributed
behavior makes the nodes independently learn optimal
The key idea behind distributed Reinforcement Learning
transmission strategies without having to rely on full network level (RL) for protocol design [1] is to formulate the protocol layer
information and direct knowledge of the other nodes’ behavior. logic in each network node as a Markov Decision Process
The nodes learn to minimize packet collisions such that optimal (MDP) and use RL as a temporal difference method to solve the
throughput can be attained and maintained for loading conditions MDP. The solution of this MDP is a set of transmission actions
that are higher than what the known benchmark protocols such as taken by individual nodes, thus resulting a network protocol.
ALOHA can support. In addition, the nodes are observed to be able Since the RL mechanisms learn online, the need for prior data
to learn to act optimally in the presence of heterogeneous loading collection and training is avoided in this approach. However, in
and network topological conditions. Finally, the proposed learning distributed RL, a key requirement is that an agent (i.e., a
approach allows the wireless bandwidth to be fairly distributed
among network nodes in a way that is not dependent on those
network node) needs to have as much network-level
heterogeneities. Via extensive simulation experiments, the paper information as possible in order to avoid collisions and share
demonstrates the performance of the learning paradigm and its wireless bandwidth in a fair manner. This becomes a problem
abilities to make nodes adapt their optimal transmission strategies in a partially connected network in which the information
on the fly in response to various network dynamics. availability to a node is usually limited only to its immediate
Index Terms — Reinforcement Learning (RL), Markov Decision neighborhood.
Process (MDP), Distributed Learning, Medium Access Control An obvious solution to this problem is to employ centralized
(MAC), Protocol Synthesis learning in which a centralized agent, with access to complete
I. INTRODUCTION network level information, can learn optimal node behavior
(i.e., a protocol) and downloads it to the individual nodes.
Protocol design in wireless networks is mostly driven by
However, this approach may not be feasible in decentralized
heuristics and past experience of network designers. The best
practical scenarios. In this paper, we develop a distributed RL-
practice for programming wireless MAC logic is to use known
based protocol synthesis approach that can work with
standardized protocols, including, ALOHA, CSMA, CSMA-
partial/local network information availability in partially
CA, and their higher derivatives, such as, WiFi, Bluetooth, and
connected networks. The approach is built on our prior work [1]
Zigbee. The specific family of protocol is chosen for an
which assumed full network-level information availability for
application based on the cost budget and lower layer hardware
fully connected networks. In this work, the distributed learning
capabilities for carrier sensing, collision detection, and also
mechanism is developed such that the RL agents (i.e., network
depending on the network properties such as round-trip
nodes) rely only on the localized network information available
propagation delay. A notable drawback of this design approach
in their immediate neighborhoods. It is shown the proposed
based on available standardized protocols is that the resulting
mechanism allows nodes to learn to reduce collisions so as to
logic often underperforms in application-specific non-standard
mimic the baseline ALOHA behavior for low network traffic,
scenarios caused by topology- and load-heterogeneity, and
reach the maximum ALOHA throughput, and unlike baseline
other factors. For example, with baseline ALOHA MAC, the
ALOHA, maintain that maximum value for higher traffic loads.
throughput of a network starts falling at higher loads due to
An important feature of the developed framework is that the
collisions. The situation is compounded in the presence of
wireless bandwidth is distributed in a fair manner irrespective
heterogeneity. In a network with arbitrary mesh topology, some
of the underlying network load topology, even when they are
nodes may be in more disadvantageous position as compared to
heterogenous.
others in terms of the collisions they experience. As a result,
This work has the following scope and contributions. First, a
those nodes receive less share of the available wireless
distributed learning framework for synthesizing MAC layer
bandwidth. The root cause of these limitations is the fact that
protocol with localized network information visibility is
the nodes in the network are statically programmed with
designed. Second, it is shown that the synthesized protocol can
standardized protocol logic and they lack the flexibility that can
obtain and maintain known benchmark throughputs of standard
result in abilities to learn optimal behavior in specific scenarios.
protocols (ALOHA in this case). Third, the proposed learning
2
is shown to work for arbitrary mesh network topologies. probability of transition from one state to another because of a
Finally, it is shown that distributed learning can handle specific action is defined by a set of transition probabilities 𝑇.
heterogeneity both in terms of load and network topology. For each action taken in a particular state, there is a numerical
The rest of the paper is organized as follows. In section II, the reward associated, which defines the physical benefit of taking
network models used in this work is presented. In section III, the action in that state. An important property of MDP is that
the MAC logic synthesis is posed as an MDP problem, and RL the probability of transition from the current state to the next
is introduced as a viable solution approach. In Section IV, the state is dependent only on the current state and action. Hence
proposed RL framework and all its components are discussed the attribution Markov. Formally, an MDP can be defined by
in detail. Section V presents all experimental results and their the tuple (𝑆, 𝐴, 𝑇, 𝑅), where S is the set of all possible states, A
interpretations. In Section VI, the existing related works are is the set of actions, T is the set of transition probabilities and R
reviewed. Finally, the work is summarized, and conclusions are is the reward function. Thus, for any dynamic system, whose
drawn in section VI. behavior can be modelled by an MDP, there exists a set of
II. NETWORK AND TRAFFIC MODEL optimal actions for each state that would yield the maximum
expected long-term reward [3].
Arbitrary mesh topologies of low-cost sensor nodes without
carrier-sensing abilities are considered. Fig. 1(a) depicts the Agent
generalized topology representation in which a solid line (Network
represents the physical node connectivity, and the dashed lines Nodes)
indicate the data paths. For example, nodes 𝑖 and 𝑘 transmits State
Reward Actions
(Congestion
packets to node 𝑗, and node 𝑗 transmits packets to nodes 𝑖 and (Throughput)
Status)
(Tx, Wait)
𝑘. The MAC layer traffic load model is created such that all
packets generated in a node is sent to its 1-hop neighbors Environment
(Network)
following a uniform random distribution. In other words, if a
node 𝑖 has 𝐾 one-hop neighbors and its MAC layer load is 𝑔𝑖
𝑔 Fig. 2. Reinforcement Learning for Network Protocol MDP
Erlang following Poisson distribution, node i sends 𝑖 Erlangs
𝐾
Reinforcement Learning (RL) [4] is a class of algorithms for
amount of traffic to each of its one-hop neighbors. The
solving an MDP. The advantage of reinforcement learning over
objective of the proposed learning mechanism is to make the
other MDP solving techniques (e.g., dynamic programming,
nodes learn transmission strategies that can obtain and maintain
etc.) is that RL does not always require an exact mathematical
optimal (and fair) distribution of wireless bandwidth by
model of the system. One such model-free RL approach is to
reducing packet collisions across heterogenous topologies and
deploy Q-learning [5], in which the agent learns to take an
loading conditions.
optimal set of actions by repeatedly taking all the possible
action in each state and evaluating their rewards. The best set
N
of actions will be the ones that will maximize the expected long-
term reward. During the learning process, the agents maintain a
Q-table, whose entries are Q-values for each state-action pair.
1 3
(a) For a specific state, the action with the highest Q-value from the
table is preferred by the agent. At the end of each decision
2
1 2 3 N epoch, the Q values for state-action pair (𝑠, 𝑎) are updated as
shown in Eq. (1), where 𝑟, 𝛼, 𝛾 𝑎𝑛𝑑 𝑠′ indicate the reward
(c) received, learning rate, discount factor, and the next state
(b)
caused by action a, respectively.
Fig. 1. Network and data sharing model for (a) generalized, (b) partially
connected and (c) fully connected topology. 𝑄(𝑠, 𝑎) ← 𝑄(𝑠, 𝑎) + 𝛼[𝑟(𝑠, 𝑎) + 𝛾 × max 𝑄(𝑠 ′ , 𝑎′ )
∀𝑎′∈𝐴
Fig. 1(b) and 1(c) are the minimally connected and fully − 𝑄(𝑠, 𝑎)] (1)
connected examples of the generalized topology depicted in
The dynamic behavior of a wireless network in different
Fig. 1(a). In 1(b) nodes send data only two of its 1-hop
operating conditions can be represented as a Markov Decision
neighbors, whereas in 1(c) a node sends data to all other nodes
Process. An annotation of different components of an MDP,
in the network. As described in Section IV, the network model
representing a wireless network protocol, is shown in Fig. 2. In
includes the availability of piggybacking for sending very low
this formulation, the nodes of the network represent distributed
data-rate control information using the data packets themselves.
agents that individually take actions to transmit with various
Such control information is used for local information sharing
probabilities or to wait. The environment with which an
needed by the Reinforcement Learning mechanism.
agent/node interact is the wireless network itself. The
III. PROTOCOL SYNTHESIS AS A MARKOV DECISION PROCESS congestion levels in the network represent individual agent
A Markov Decision Process (MDP) is used for modelling perceivable state space of the environment, which changes
discrete stage sequential decision making in a stochastic depending on the independent agents’ actions. The reward can
environment [2]. In an MDP, the system transition takes place be a function of different performance metrics (such as,
stochastically among a set of states 𝑆 = {𝑆1 , 𝑆2 , 𝑆3 , … , 𝑆𝑁 }. throughput, energy efficiency etc.) that need to be optimized.
State transitions take place as a result of an agent’s actions
defined in the action space 𝐴 = {𝐴1 , 𝐴2 , … . 𝐴𝑀 }. The
3
IV. DISTRIBUTED RL WITH LOCALIZED INFORMATION neighbor of node-j) is the intended receiver. Node-i computes
The developed learning framework is termed as Distributed the quantity 𝑠𝑗→𝑖 by keeping track of the non-collided
RL with Localized Information (DRLI) for synthesizing transmissions from node 𝑗 that are intended for node 𝑖. Node-i
wireless MAC behavior. With DRLI, each wireless node acts as periodically sends 𝑠𝑗→𝑖 information to node-j by piggybacking
an independent learning agent and implements a specific flavor it within its wireless MAC PDUs. Node 𝑗 periodically computes
of learning, namely, Hysteretic Q-Learning [6]. A specific its own overall throughput s𝑗 = ∑∀𝑖 𝑠𝑗→𝑖 , where i represents all
feature of DRLI is that unlike many prior work [1, 7], the its 1-hop neighbors. Node 𝑗 also periodically sends its own
learning agents/nodes do not rely on global network recently computed throughput 𝑠𝑗 to all its 1-hop neighbors using
information. Instead, they use localized network information piggybacking. A typical MAC layer PDU from node-i is shown
from the immediate neighborhood for learning effective MAC in Fig. 3 below.
layer behavior in terms of transmission policies.
A. Hysteretic Q-Learning Source Destination
Address Address DATA
It is a distributed Q-learning algorithm used in a multi-agent
co-operative setting to achieve a specific goal as a team. In
Hysteretic Q-Learning, the agents do not participate in explicit Piggybacked
communication and are unaware of each other’s actions during information
the learning process. The primary difference of this approach as Fig. 3: MAC layer PDU from node 𝑖
compared with traditional Q-learning is in its use of two
The above process is periodically executed across all 1-hop
learning rates 𝛼 and 𝛽. The Q-table update rule are:
neighbors. The process creates an information distribution
𝛿 = 𝑟 + (𝛾) × max 𝑄(𝑠 ′ , 𝑎′ ) − 𝑄(𝑠, 𝑎) model such that at any given point in time, each network node-
∀𝑎′∈𝐴
𝑄(𝑠, 𝑎) + 𝛼 × 𝛿, 𝑖𝑓 𝛿 ≥ 0 (2) j possesses the throughput information of all its one-hop
𝑄(𝑠, 𝑎) ← { neighbors. As explained in the following section, the RL reward
𝑄(𝑠, 𝑎) + 𝛽 × 𝛿, 𝑒𝑙𝑠𝑒
The parameter 𝛿 in Eq (2) controls which learning rate is to function is developed using this 1-hop neighborhood
be used in the Q-table update equation in a particular epoch. In throughput information.
a multi-agent setting, an agent may be penalized even for an It is important to note that the lack of global network
optimal action, if the actions taken by the other agents in the information visibility is different from the concept of Partially
team were bad. Thus, the rewards and penalties received are Observable MDP (POMDP [9, 10]). In POMDP, the agents lack
dependent not only on their own actions, but also on other visibility to their current states, and they infer the states based
agents’ actions. That is the reason behind the usage of two on a set of observations after the actions [11]. In this work, on
learning rates in Hysteretic Q-Learning. As can be seen from the other hand, an agent/node has visibility about its own state,
the equation, a positive value of the parameter 𝛿 leads to the use which is defined by the 1-hop congestion status. It is just that
of 𝛼 as the learning rate and a negative value of 𝛿 makes the the lack of global information visibility forces the state to be
algorithm use 𝛽 as the learning rate, where 𝛼 and 𝛽 are the defined based on local information, which can have
increase and decrease rates of the Q values respectively. The performance implications for the MDP solutions.
parameter 𝛿 is positive if the actions taken were beneficial for C. Distributed Reinforcement Learning Components
attaining the desired optimum of the system and vice-versa. Each node runs an RL agent, which is used for solving its
Hence, 𝛽 is chosen such that it is always less than 𝛼 in order to own MDP based on local information model as described in
assign less importance to the penalties. Section IV B. Following are the primary components of the
In DRLI learning framework, nodes act as hysteretic learning proposed distributed RL based learning framework.
agents, and iteratively learn their individual optimal Action space: The actions are defined as probabilities of
transmission strategies in a distributive manner. The objective packet transmissions. In order to keep the action space discrete,
is to maximize the network performance and fairness. the transmission probability is divided into 20 equal steps in the
B. Learning with Localized Network Information range [0, 1]. Formally stated, an action 𝑎 ∈ {𝑎1 , 𝑎2 , … … , 𝑎20 }
In a cooperative setting, where a team of learning agents represents the probability with which a packet is transmitted
share a common objective, solving an MDP in a distributed following the Hysteretic RL logic as explained in Section IV A.
fashion is easier with global information availability. However, State space: The state space is defined as the network
in the context of a wireless network, it is not always feasible for congestion level estimated from packet collision status. The
an RL agent (i.e., a node) to have global information such as state perceived by an RL agent running in a node at an RL epoch
network wide congestion and throughput. The problem is is defined as the probability of packet collisions experienced by
particularly compounded for partially connected mesh the node during that epoch. As done for the action space, the
topologies. states are also discretized in 24 equal steps in the range [0, 1].
In the proposed learning framework, a reward function for an Hence, the state of a node 𝑖 at an epoch 𝑡 is denoted as 𝑠̂𝑖 (𝑡) ∈
agent/node is designed such that information only from its one- 𝑆̂𝑖 = {𝑠̂𝑖,1 , 𝑠̂𝑖,2 , … … , 𝑠̂𝑖,24 }, where 𝑆̂𝑖 is the state space for node
hop neighbor nodes is used. Such information is gathered using 𝑖. Each state 𝑠𝑖,𝑘 represents a level of collision observed by
an in-band mechanism [8] using which the relevant control node-i. The collision level is quantified as:
information is piggybacked over regular MAC layer PDUs. 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑐𝑘𝑒𝑡 𝑐𝑜𝑙𝑙𝑖𝑠𝑖𝑜𝑛𝑠
Consider a scenario in which 𝑠𝑗 is the current throughput of 𝑃𝐶 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑚𝑖𝑡𝑡𝑒𝑑 𝑝𝑎𝑐𝑘𝑒𝑡𝑠 (3)
node-j, and 𝑠𝑗→𝑖 is the part of 𝑠𝑗 for which node-𝑖 (i.e., a 1-hop
4
Rewards: The observables of a node 𝑖 is defined as the known-best performance in the absence of carrier sensing and
information about its own throughput 𝑠𝑖 and its one-hop network time-synchronization (i.e., slotting) capabilities. The
neighbors’ throughput 𝑠𝑗 (for all neighbors j), where 𝑠𝑖 and 𝑠𝑗 learning hyperparameters for DRLI-MAC are shown in Table
are expressed in packets/packet duration, or Erlang. This II. The experiments are conducted using a C-language based
throughput information is exchanged using piggybacking as simulator.
explained in Subsection A. TABLE II: BASELINE EXPERIMENTAL PARAMETERS
TABLE I: REWARD STRUCTURE FORMULATION
No. of packets per epoch 1000
# ∆𝑠𝑖 − 𝝐 ∆𝑓𝑖 𝑅𝑖 𝛼 0.9
𝛽 0.1
1 + + +50 𝛾 0.95
2 + - -30
∈ 1.0 × 𝑒 −𝑒𝑝𝑜𝑐ℎ 𝐼𝐷/1000
Fig. 4. (a) Load vs Throughput for 3-node linear network; 𝑠𝑖 , 𝑔𝑖 are the individual throughputs and loads respectively for node 𝑖, 𝑆 is the network
throughput, (b) Convergence plots for actions of node 2 and (c) Convergence plots for effective load and throughput of node 2 and network throughput
5
: ALOHA
: DRLI-MAC : ALOHA
: DRLI-MAC : ALOHA : DRLI-MAC
: ALOHA
is more evident in heterogeneous traffic conditions as shown in distributed manner. Also, to be noted that the learning relies
Fig. 5. With ALOHA access strategy, there is a high variation only on localized network information, thus rendering the
of throughputs among the three nodes for heterogeneous load technique highly scalable.
distribution. In contrast, with DRLI-MAC, the differences in
throughputs of individual nodes are significantly smaller. In
each of the three plots in Fig. 5, the loads from node-1 (𝑔1 ) and
node-2 (𝑔2 ) are kept fixed at different values, and the node-level
throughput variations are observed for varying load from node-
3 (𝑔3 ). These represent the scenarios: 𝑔1 ≤ 𝑔̂, 𝑔2 ≤ 𝑔̂, 𝑔1 ≤
𝑔̂, 𝑔2 > 𝑔̂ 𝑜𝑟 𝑔1 > 𝑔̂, 𝑔2 ≤ 𝑔̂, and 𝑔1 > 𝑔̂, 𝑔2 > 𝑔̂. It can be
observed that with DRLI-MAC, the RL agents in nodes learn to
adjust the transmit probability such that the available wireless
bandwidth is fairly distributed. Also notable is the fact that the
DRLI-MAC logic can hold the optimal throughput for higher
network loads, even under heterogeneous loading conditions.
0.35
Load change
Load change
0.3
Fig. 8. (a) Surface plot indicating the network throughput variation of ALOHA with different 𝑔1 (= 𝑔4 )𝑎𝑛𝑑 𝑔2 (= 𝑔3 ) combinations, (b) ALOHA
throughput with fair bandwidth distribution, (c) Load vs Throughput of DRLI-MAC for 4-node partially connected network; 𝑠𝑖 , 𝑔𝑖 are the throughputs and
loads respectively for node 𝑖, 𝑆 is the network throughput, 𝑠̂𝑖 is the maximum ALOHA individual throughput with fair bandwidth distribution.
negligible. However, this is not a desirable behavior, since in also allows the nodes to hold the optimal throughput for higher
addition to maximizing network throughput, the objective is to traffic in the network.
distribute the throughput fairly. The dashed line in this 3-D plot
indicates the situation in which the throughput is fairly
distributed among the nodes. A projection of this line on the
𝑔1 (= 𝑔4 ) plane is shown in Fig. 8(b). From these plots, it can
be observed that the desired behavior (maximum throughput
with fair bandwidth distribution) is achieved when 𝑔1 = 𝑔4 ≈
0.2 𝑎𝑛𝑑 𝑔2 = 𝑔3 ≈ 0.3. The optimum individual node
throughput in this situation is obtained as 𝑠̂𝑖 = 0.058.
It can be observed from Fig. 8(c) that for regular ALOHA,
the node throughputs not only reduce at higher loads, but they
are also unfairly distributed. Because of higher exposure to
collisions (i.e., due to higher topology degrees), nodes 2 and 3
have lower throughputs compared to nodes 1 and 3. In addition
to addressing this fairness issue, DRLI-MAC is able to achieve
the optimal throughput (𝑠̂𝑖 ) obtained by the known benchmark
protocol ALOHA, as shown in Fig. 8(b). The RL framework
Fig. 11: Performance of DRLI-MAC in large networks with (a) 8 nodes and (b) 12 nodes of arbitrary topology
: DRLI-MAC
: ALOHA
: DRLI-MAC
network-level information and the transmission strategies of all learning multiple access for heterogeneous wireless networks." IEEE
Journal on Selected Areas in Communications 37.6 (2019): 1277-1290.
the other nodes. This makes the proposed paradigm generalized
19. Ali, Rashid, et al. "(ReLBT): A Reinforcement learning-enabled listen
for networks with arbitrary mesh topologies. Future work on before talk mechanism for LTE-LAA and Wi-Fi coexistence in IoT."
this topic will include: a) making learning generalized to Computer Communications 150 (2020): 498-505.
incorporate stateless RL so that the system does not degrade for 20. Park, Huiung, et al. "Multi-Agent Reinforcement-Learning-Based Time-
Slotted Channel Hopping Medium Access Control Scheduling Scheme."
dense networks with large node degrees, b) incorporating
IEEE Access 8 (2020): 139727-139736.
carrier-sensing in learning for synthesizing robust CSMA 21. Zhenzhen Liu and I. Elhanany, "RL-MAC: A QoS-Aware Reinforcement
family of protocols with better fairness and ability to handle Learning based MAC Protocol for Wireless Sensor Networks," 2006
different types of heterogeneity compared to the existing hand- IEEE International Conference on Networking, Sensing and Control, Ft.
Lauderdale, FL, 2006, pp. 768-773, doi: 10.1109/ICNSC.2006.1673243.
crafted CSMA protocols, and c) incorporating time-slotting to
cater to a broader set of network applications that are sensitive
to non-throughput performance indices including energy and
delay.
VIII. REFERENCES
1. Dutta, Hrishikesh, and Subir Biswas. "Towards Multi-agent
Reinforcement Learning for Wireless Network Protocol Synthesis." 2021
International Conference on COMmunication Systems & NETworkS
(COMSNETS). IEEE, 2021.
2. White III, Chelsea C., and Douglas J. White. "Markov decision
processes." European Journal of Operational Research 39.1 (1989): 1-16.
3. Leon-Garcia, Alberto. "Probability, statistics, and random processes for
electrical engineering." (2017).
4. Van Otterlo, Martijn, and Marco Wiering. "Reinforcement learning and
markov decision processes." Reinforcement Learning. Springer, Berlin,
Heidelberg, 2012. 3-42.
5. Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine
learning 8.3-4 (1992): 279-292.
6. Matignon, Laëtitia, Guillaume J. Laurent, and Nadine Le Fort-Piat.
"Hysteretic q-learning: an algorithm for decentralized reinforcement
learning in cooperative multi-agent teams." 2007 IEEE/RSJ International
Conference on Intelligent Robots and Systems. IEEE, 2007.
7. Xu, Yi-Han, et al. "Reinforcement Learning (RL)-based energy efficient
resource allocation for energy harvesting-powered wireless body area
network." Sensors 20.1 (2020): 44.
8. Pi, Xurong, Yueming Cai, and Guoliang Yao. "An energy-efficient
cooperative MAC protocol with piggybacking for wireless sensor
networks." 2011 International Conference on Wireless Communications
and Signal Processing (WCSP). IEEE, 2011.
9. Cassandra, Anthony R., Leslie Pack Kaelbling, and Michael L. Littman.
"Acting optimally in partially observable stochastic domains." Aaai. Vol.
94. 1994.
10. Kaelbling, Leslie Pack, Michael L. Littman, and Anthony R. Cassandra.
"Planning and acting in partially observable stochastic domains."
Artificial intelligence 101.1-2 (1998): 99-134.
11. Cassandra, Anthony R., Leslie Pack Kaelbling, and Michael L. Littman.
"Acting optimally in partially observable stochastic domains." Aaai. Vol.
94. 1994.
12. Modiano, Eytan. "A novel architecture and medium access control
protocol for WDM networks." Optical Networks and Their Applications.
Optical Society of America, 1998.
13. Y. Chu, P. D. Mitchell and D. Grace, "ALOHA and Q-Learning based
medium access control for Wireless Sensor Networks," 2012 International
Symposium on Wireless Communication Systems (ISWCS), Paris, 2012,
pp. 511-515, doi: 10.1109/ISWCS.2012.6328420.
14. Savaglio, Claudio, et al. "Lightweight reinforcement learning for energy
efficient communications in wireless sensor networks." IEEE Access 7
(2019): 29355-29364.
15. S. H. Park, P. D. Mitchell and D. Grace, "Reinforcement Learning Based
MAC Protocol (UW-ALOHA-Q) for Underwater Acoustic Sensor
Networks," in IEEE Access, vol. 7, pp. 165531-165542, 2019, doi:
10.1109/ACCESS.2019.2953801.
16. S. H. Park, P. D. Mitchell and D. Grace, "Performance of the ALOHA-Q
MAC Protocol for Underwater Acoustic Networks," 2018 International
Conference on Computing, Electronics & Communications Engineering
(iCCECE), Southend, United Kingdom, 2018, pp. 189-194, doi:
10.1109/iCCECOME.2018.8658631.
17. Lee, Taegyeom, and Ohyun Jo Shin. "CoRL: Collaborative
Reinforcement Learning-Based MAC Protocol for IoT Networks."
Electronics 9.1 (2020): 143.
18. Yu, Yiding, Taotao Wang, and Soung Chang Liew. "Deep-reinforcement