CoRide: Joint Order Dispatching and Fleet Management
for Multi-Scale Ride-Hailing Platforms
Jiarui Jin1 , Ming Zhou1 , Weinan Zhang1 , Minne Li2 , Zilong Guo1 , Zhiwei Qin3 , Yan Jiao3 ,
Xiaocheng Tang3 , Chenxi Wang3 , Jun Wang2 , Guobin Wu4 , Jieping Ye3
arXiv:1905.11353v2 [cs.MA] 30 Aug 2019
1 Shanghai Jiao Tong University, 2 University College London, 3 DiDi AI Labs, 4 DiDi Research
{jinjiarui97,mingak,wnzhang,gzlong}@sjtu.edu.cn,{minne.li,jun.wang}@cs.ucl.ac.uk,{qinzhiwei,yanjiao,
tangxiaocheng,wangchenxi,wuguobin,yejieping}@didiglobal.com
ABSTRACT
How to optimally dispatch orders to vehicles and how to trade
off between immediate and future returns are fundamental questions for a typical ride-hailing platform. We model ride-hailing as a
large-scale parallel ranking problem and study the joint decisionmaking task of order dispatching and fleet management in online
ride-hailing platforms. This task brings unique challenges in the
following four aspects. First, to facilitate a huge number of vehicles
to act and learn efficiently and robustly, we treat each region cell
as an agent and build a multi-agent reinforcement learning framework. Second, to coordinate the agents from different regions to
achieve long-term benefits, we leverage the geographical hierarchy
of the region grids to perform hierarchical reinforcement learning.
Third, to deal with the heterogeneous and variant action space
for joint order dispatching and fleet management, we design the
action as the ranking weight vector to rank and select the specific
order or the fleet management destination in a unified formulation.
Fourth, to achieve the multi-scale ride-hailing platform, we conduct
the decision-making process in a hierarchical way where a multihead attention mechanism is utilized to incorporate the impacts
of neighbor agents and capture the key agent in each scale. The
whole novel framework is named as CoRide. Extensive experiments
based on multiple cities real-world data as well as analytic synthetic
data demonstrate that CoRide provides superior performance in
terms of platform revenue and user experience in the task of citywide hybrid order dispatching and fleet management over strong
baselines.
CCS CONCEPTS
· Computing methodologies → Multi-agent reinforcement
learning; · Theory of computation → Multi-agent reinforcement
learning; · Applied computing → Transportation.
KEYWORDS
Hierarchical Reinforcement Learning; Multi-agent Reinforcement
Learning; Ride-Hailing; Order Dispatching; Fleet Management
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from
[email protected].
CIKM ’19, November 3ś7, 2019, Beijing, China
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6976-3/19/11. . . $15.00
https://doi.org/10.1145/3357384.3357978
ACM Reference Format:
Jiarui Jin, Ming Zhou, Weinan Zhang, Minne Li, Zilong Guo, Zhiwei Qin,
Yan Jiao, Xiaocheng Tang, Chenxi Wang, Jun Wang, Guobin Wu and Jieping
Ye. 2019. CoRide: Joint Order Dispatching and Fleet Management for MultiScale Ride-Hailing Platforms. In Proceedings of the 28th ACM International
Conference on Information and Knowledge Management (CIKM ’19), November 3ś7, 2019, Beijing, China. ACM, New York, NY, USA, 10 pages.
https://doi.org/10.1145/3357384.3357978
1
INTRODUCTION
Online ride-hailing platforms such as Uber and Didi Chuxing have
substantially transformed our lives by sharing and reallocating
transportation resources to highly promote transportation efficiency. In a general view, there are two major decision-making
tasks for such ride-hailing platforms, namely (i) order dispatching,
i.e., to match the orders and vehicles in real time to directly deliver
the service to the users [24, 43, 45], and (ii) fleet management, i.e.,
to reposition the vehicles to certain areas in advance to prepare for
the later order dispatching [15, 21, 26].
Apparently, the decision-making of matching an order-vehicle
pair or repositioning a vehicle to an area needs accounting for the
future situation of the vehicle’s location and the orders nearby.
Thus, much of work has modeled order dispatching and fleet management as a sequential decision-making problem and solved it
with reinforcement learning (RL) [15, 30, 36, 39]. Most of the previous work deals with either order dispatching or fleet management
without regarding the high correlation of these two tasks, especially for large-scale ride-hailing platforms in large cities, which
leads to sub-optimal performance. In order to achieve near-optimal
performance, inspired by thermodynamics, we simulate the whole
ride-hailing platform as dispatch (order dispatching) and reposition (fleet management). As illustrated in Figure 1, we resemble
vehicle and order as different molecules and aim at building up
the system stability via reducing their number by dispatch and
reposition. To address this complex criterion, we provide two novel
views: (i) interconnecting order dispatching and fleet management,
and (ii) joint considering intra-district (grid-level) and inter-district
(district-level) allocation. With such a practical motivation, we focus on modeling joint order dispatching and fleet management with
multi-scale decision-making system. There are several significant
technical challenges to learn highly efficient allocation policy for
the real-time ride-hailing platform:
Large-scale Agents. A fundamental question in any ride-hailing
system is how to deal with a large number of orders and vehicles.
One alternative is to model each available vehicle as an agent [21,
Vehicle
Order
Forthcoming
Order
Dispatch
Downtown
Reposition
Dispatch for
Immediate
Reward
Reposition for
Future Reward
in District-level
(inter-district)
Reposition for
Future Reward
in Grid-level
(intra-district)
Uptown
Figure 1: Ride-hailing task in thermodynamics view.
37, 39]. However, such setting needs to maintain thousands of
agents interacting with the environment, which brings a huge
computational cost. Instead, we utilize the region grid world (as
will be further discussed in Figure 2), which regards each region as
an agent, and naturally model ride-hailing system in a hierarchical
learning setting. This formulation allows decentralized learning
and control with distributed implementation.
Immediate & Future Rewards. A key challenge in seeking an
optimal control policy is to find a trade-off between immediate
and future rewards in terms of accumulated driver income (ADI).
Greedily matching vehicles with long-distance orders can receive
high immediate gain at a single order dispatching stage, but it
would harm order response rate (ORR) and future revenue especially
during rush hour because of its long drive time and unpopular
destination. Recent attempts [21, 37, 39] deployed RL to combine
instant order reward from online planning with future state-value
as the final matching value. However, the coordination between
different regions is still far from optimal. Inspired by hierarchical RL
[34], we introduce the geographical hierarchical structure of region
agents. We treat large district as manager agent and small grid as
worker agent, respectively. The manager operates at a lower spatial
and temporal dimension and sets abstract goals which are conveyed
to its workers. The worker takes specific actions and interacts with
environment coordinated with manager-level goal and worker-level
message. This decoupled structure facilitates very long timescale
credit assignment [34] and guarantees balance between immediate
and future revenue.
Heterogeneous & Variant Action Space. Traditional RL models
require a fixed action space [17]. If we model picking an order as
an RL action, there is no guarantee of a fixed action space as the
available orders keep changing. Zhang et al. [43] proposed to learn
a state-action value function to evaluate each valid order-vehicle
match, then use a combinatorial optimization method such as KuhnMunkres (KM) algorithm [18] to filter the matches. However, such
a method faces another important challenge that order dispatching
and fleet management are different tasks, which results in heterogeneous action spaces. To address this issue, we redefine action as
the weight vector for ranking orders and fleet management, where
the fleet controls are regarded as fake orders, and all the orders are
ranked and matched with vehicles in each agent. Thus, it bypasses
the issue of heterogeneous and variant action space as well as high
computational costs.
Multi-Scale Ride-Hailing. Xu et al. [39] introduced a policy evaluation based RL method to learn the dynamics for each grid. As
its result shows, orders and vehicles often centralize at different
districts (e.g. uptown and downtown in Figure 1). How to combine
large hotspots in the city (inter-district) with small ones in districts (intra-district) is another challenge without much attention.
In order to take both inter-district and intra-district allocation into
consideration, we adopt and extend attention mechanism in a hierarchical way (as will be further discussed in Figure 3). Compared
with learning value function for each grid homogeneously [39],
this attention-based structure can not only capture the impacts
of neighbor agents, but also learn to distinguish the key grid and
district in worker (grid) and manager (district) scales respectively.
Wrapping all modules together, we propose CoRide, a hierarchical multi-agent reinforcement learning framework to resolve the
aforementioned challenges. The main contributions are listed as
follows:
• We propose a novel framework that learns to collaborate in hierarchical multi-agent setting for ride-hailing platform.
• We conduct extensive experiments based on real-world data of
multiple cities, as well as analytic synthetic data, which demonstrate that CoRide provides superior performance in terms of ADI
and ORR in the task of city-wide hybrid order dispatching and
fleet management over strong baselines.
• To the best of our knowledge, CoRide is the first work (i) to apply
the hierarchical reinforcement learning on ride-hailing platform;
(ii) to address the task of joint order dispatching and fleet management of online ride-hailing platforms; (iii) to introduce and study
multi-scale ride-hailing task.
This structure conveys several benefits: (i) In addition to balancing long-term and short-term reward, it also facilitates adaptation
in a dynamic real-world situation by assigning different goals to
worker. (ii) Instead of considering all of the matches between available orders and vehicles globally, these tasks are distributed to each
worker and manager agent and fulfilled in a parallel way.
2
RELATED WORK
Decision-making for Ride-hailing. Order dispatching and fleet
management are two major decision-making tasks for online ridehailing platforms, which have acquired much attention from academia
and industry during the recent few years.
To tackle these challenging transportation problems, automatically ruled-based approaches addressed order dispatching problem
with either centralized or decentralized settings. To improve global
performance, Zhang et al. [43] proposed a novel model based on
centralized combinatorial optimization by concurrently matching
multiple vehicle-order pairs within a short time window. However,
this approach needs to compute all available vehicle-order matches
and requires feature engineering, which would be infeasible and
prevent it to be adopted in the large-scale taxi-order dispatching
situation. With the decentralized setting, Seow et al. [24] addressed
this problem with a collaborative multi-agent taxi dispatching system. However, this method requires rounds of direct communications between agents, so it is limited to a local area with a small
number of vehicles.
Instead of rule-based approaches, which require additional handcrafted heuristics, the current trending direction is to incorporate
reinforcement learning algorithms in complicated traffic management problems. Xu et al. [39] proposed a learning and planning
method based on reinforcement learning to optimize vehicle utilization and user experience in a global and more farsighted view. In
[21], the authors leveraged the graph structure of the road network
and expanded distributed DQN formulation to maximize entropy
in the agents’ learning policy with soft Q-learning, to improve
performance of fleet management. Wei et al. [37] introduced a reinforcement learning method, which takes the uncertainty of future
requests into account and can make a look-ahead decision to help
the operator improve the global level-of-service of a shared-vehicle
system through fleet management. To capture the complicated stochastic demand-supply variations in high-dimensional space, Lin
et al. [15] proposed a contextual multi-agent actor-critic framework
to achieve explicit coordination among a large number of agents
adaptive to different contexts in fleet management system.
Different from all aforementioned methods, our approach is the
first, to the best of our knowledge, to consider the joint modeling of
order dispatching and fleet management and also the only current
work introducing and studying the multi-scale ride-hailing task.
Hierarchical Reinforcement Learning. Hierarchical reinforcement learning (HRL) is a promising approach to extend traditional
reinforcement learning (RL) methods to solve tasks with long-term
dependency or multi-level interaction patterns [5, 6]. Recent works
have suggested that several interesting and standout results can be
induced by training multi-level hierarchical policy in a multi-task
setup [8, 25] or implementing hierarchical setting in sparse reward
problems [23, 34].
The options framework [22, 28, 29] formulates the problem with
a two-level hierarchy, where the low-level - option - is a sub-policy
with a termination condition. Since traditional options framework
suffers from prior knowledge on designing options, jointly learning
high-level policy with low-level policy has been proposed by [2].
However, this actor-critic HRL approach needs to either learning
sub-policies for each time step or one sub-policy for the whole
episode. Therefore, the performance of the whole module often
prone to learning useful sub-policies. To guarantee gaining effective sub-policies, one alternative direction is to provide auxiliary
rewards for low-level policy: hand-designed rewards based on prior
domain knowledge [11, 31] or mutual information [4, 7, 10]. Given
having access to one well-designed and suitable reward is often
a luxury, Vezhnevets et al. [34] proposed FeUdal Networks (FuN),
where a generic reward is utilized for low-level policy learning,
thus avoid designing hand-craft rewards. Several works extend and
improve FuN with off-policy training [19], form of hindsight [12]
and representation learning [20].
Our work is also developed from FuN [34], originally inspired by
feudal RL [5]. FuN employs only one pair of manager and worker
and connects them with a parameterized goal and intrinsic reward. Instead, we model CoRide with multiple managers. Unlike
our method, in FuN the manager and worker modules are set to
one-to-one, share the same observation, and operate at the different
temporal but same spatial resolution. In CoRide, there are multiple
workers learning to collaborate under one manager while the managers are also coordinating. The manager takes a joint observation
of all workers, and each worker produces action based on specific
observation and sharing goal. In other words, FuN [34] is actually
a special case of CoRide, where a single manager along with its
only worker is employed. Stepping on this one-to-many setting,
the manager can not only operate with long timescale credit but
act at a lower spatial resolution. Recently, Ahilan and Dayan [1]
introduced a novel architecture named FMH for cooperation in
multi-agent RL. Different from this proposed method, CoRide not
only extends the scale of the multi-agent environment but also facilitates communication through multi-head attention mechanism,
which computes influences of interactions and differentiates the
impacts to each agent. Yet, the majority of current HRL methods
require careful task-specific design, making them difficult to apply
in real-world scenarios [19]. To the best of our knowledge, CoRide
is the first work to apply hierarchical reinforcement learning on
the ride-hailing problem.
3
PROBLEM FORMULATION
We formulate the problem of controlling large-scale homogeneous
vehicles in online ride-hailing platforms, which combines order
dispatching system with fleet management system with the goal of
maximizing the city-level ADI and ORR. In practice, vehicles are
divided into two groups: order dispatching (OD) group and fleet
management (FM) group. For OD group, we match these vehicles
with available orders pair-wisely; whereas for FM group, we need to
reposition them to the locations or dispatch orders to them (same as
OD group). The illustration of the problem is shown in Figure 2. We
use the hexagonal-grid world to represent the map and take a grid as
an agent. Considering that only orders within the pick-up distance
can be dispatched to vehicle, we set distance between grids based on
the pick-up distance. Given that, in our setting, vehicles in the same
spatial-temporal node are homogeneous, i.e., the vehicles located
at the same grid share the same setting. As such, we can model
order dispatching as a large-scale parallel ranking task, where we
rank orders and match them with homogeneous vehicles in each
grid. The fleet control for fleet management, i.e. repositioning the
vehicle to neighbor grids or staying at the current grid, is treated as
fake orders (as will be further discussed in Section 6) and conducted
in the ranking procedure as same as order dispatching.
Since each agent can only reposition vehicles located in the managing grid, we propose to formulate the problem using Partially
Observable Markov Decision Process (POMDP) [27] in a hierarchical
multi-agent reinforcement learning setting for both order dispatching and fleet management. Thus, we decompose the original complicated tasks into many local ones and transform a high-dimensional
problem into multiple low-dimensional problems.
Formally, we model this task as a Markov game G for N agents,
which is defined by a tuple G = (N , S, A, P, R, γ ), where N , S,
A, P, R, γ are the number of agents, set of states, set of actions,
state transition probability, reward function, and a future reward
discount factor, respectively. The definitions of major components
are as follows.
Agent. We consider each region cell as an agent identified by i ∈ I,
where I = {i | i = 1, . . . , N }. In detail, a single grid represents
a worker agent, a district containing multiple grids represents a
manager agent. An example is presented in Figure 2. Each individual
grid serves as a worker agent with 6 neighbor grids, as shown in
the same color, composes a manager agent. Note that although
the number of vehicles and orders varies over time, the number of
agents is fixed.
State st ∈ S, Observation ot ∈ O. Although there are two different types of agents - manager and worker, their observations
only differ in scale. Observation of each manager is actually a joint
observation of its workers. At each timestep t, agent i draws private
observations oit ∈ O correlated with the environment state st ∈ S.
In our setting, the state input used in our method is expressed as
S = ⟨Nv , No , E, N f , Do ⟩, where inner elements represent the number of vehicles, number of order, entropy, number of vehicles in
FM group and distribution of order features (e.g. price, duration)
in current grid respectively. Note that both dispatching and repositioning belong to resource allocation similar to the thermodynamic
system (Figure 1), and once order dispatching or fleet management
occurs, dispatched or fleeted items slip out of the system. Namely,
only idle vehicles and available orders can contribute to disorder
and unevenness of the ride-hailing system. Therefore, we introduce
and extend the concept of entropy here, defined as:
Õ
E = −k B ·
ρ i log ρ i := −k B · ρ 0 log ρ 0
(1)
i
where k B is a Boltzmann constant, and ρ i means probability for
each state: ρ 1 for dispatched and fleeted, ρ 0 elsewhere. As aforementioned analysis, once order and vehicle combined as order-vehicle
pairs, both order and vehicle are out of the ride-hailing platform.
Therefore, we choose to ignore items at state 1 (ρ 1 ) and compute ρ 0
as proportion of available order-vehicle pairs in all potential pairs:
min(Nv , No ) × min(Nv , No ) Nv × Nv
Nv
=
=
(2)
Nv × N o
Nv × N o
No
which is conditioned in Nv < No situation and straightforward to
transform to other situations.
ρ0 =
Reward R. Like previous hierarchical RL settings [34], only manager interacts with the environment and receives feedback from
it. This extrinsic reward function determines the direction of optimization and is proportional to both immediate profit and potential
OD
Action
FM
Action
OD
Reward
FM
Reward
Figure 2: Illustration of the grid world and problem setting.
value; while the intrinsic reward is set to encourage the worker to
follow the instruction from the manager. The details will be further
discussed in Eq. (4) and Eq. (6).
More specifically, we give a simple example based on the above
problem setting in Figure 2. At time t = 0, the worker agent 0 ranks
available real orders and potential fake orders for fleet control, and
selects the Selected-2 (as will be further discussed in Eq. (9)) options:
a real order from grid 0 to grid 17, a fake order from grid 0 to grid
5. After the driver finished, the manager agent, whose sub-workers
maintain the worker agent 0, received corresponding reward.
4 METHODOLOGIES
4.1 Overall Architecture
As shown in Figure 3, CoRide employs two layers of agents, namely
the layer of manager agents and the layer of worker agents. Each
agent is associated with a communication component for exchanging messages. As both agent and decision-making process conduct
in a hierarchical way, multi-head attention mechanism served for
communication is extended into multi-layer setting.
Compared with traditional one-to-one manager-worker control
in hierarchical RL [34], we design one-to-many manager-worker
pattern, extend the scale, and learn to collaborate on two layers of
agents. The manager internally computes a latent state representation htM as an input to the manager-level attention network, and
outputs a goal vector дt . The worker produces action and input for
worker-level attention conditioned on its private observation oW
t ,
peer message mW
,
and
the
manager’s
goal
д
.
The
manager-level
t
t −1
and worker-level attention networks share the same architecture
Attention
htM
mt-1M
otM
Envir onment
Action at ∈ A, State Transition P. In our hierarchical RL setting,
manager’s action is to generate abstract and intrinsic goal to its
workers, and each worker needs to provide a ranking list of relevant
real orders (OD) and fleet control served as fake orders (FM). Thus,
the action of worker is defined as the weight vector for the ranking
features. Changing an action of the worker means to change the
weight vector for the ranking features (as will be further discussed
in Section 6). Each timestep the whole multi-agent system produces
a joint action for each manager and worker at ∈ A, where A :=
A1 × ... × A N , which induces a transition in the environment
according to the state transition P(st +1 |st , at ).
Grid World
otW
Manager
Module
... ...
gt
Hier ar chy
communications
... ...
... ...
Same-layer
communications
Obser vations
fr om env.
htW
Attention
mt-1W
Attention
Figure 3: Overall architecture of CoRide.
Wor ker
Module
Local Observ𝑎𝑡𝑖𝑜𝑛
m
𝑜HT
𝑝HT
𝑠HT
4.3
Input for
Attention
𝑚HT
𝑔H
MLP
Goal
T
𝑚HKL
RNN
ReLU
Worker Module
We adopt the goal embedding from Feudal Networks (FuN) [34]
in our worker framework (see Figure 5), where w t is generated as
goal-embedding vector via linear projection. At each timestep t, the
agent receives an observation oW
t from the environment and feeds
into a regular RNN with peer message mW
t −1 . As Figure 5 shows,
the output of RNN uW
t together with w t generates the primitive
action - ranking weight vector ωt .
ℎHT
Message
Figure 4: Manager Module.
m
Local Observ𝑎𝑡𝑖𝑜𝑛
Input for
Attention
𝑚HI
Feature Weights
𝑝HI
𝑜HI
introduced in Section 4.4. The details and training procedure for
manager and worker are given in following parts.
w
𝑢HI
𝑠HI
Action
MLP
4.2
Manager Module
I
𝑚HKL
The architecture of the manager module is presented in Figure 4.
The manager network is a two layer Preceptron (MLP) and a dilated
RNN [34]. Note that the structure of CoRide and formula of the RNN
enable manager operate both at lower spatial resolution via taking
joint observation of its workers and lower temporal resolution via
dilated convolutional network [41].
At timestep t, the agent receives an observation otM from the environment and feeds into the dilated RNN with peer messages mtM−1 .
Goal дt and input for manager-level attention htM are generated as
output of the RNN, governed by the following equations:
Message
RNN
w
Softmax
ℎHI
ReLU
w
OD&FM
ReLU
e
𝑔H
𝑤H
e
e
Goal
Linear
Ranking Feature
Figure 5: Worker Module.
Given that worker needs to be encouraged to follow the goal
generated by its manager, we adopt the intrinsic reward proposed
by [34], defined as:
c
htM , дbt = RNN(stM , htM−1 ; θ r nn );
дt = дbt /||b
дt ||
(3)
where θ r nn is the parameters of the RNN network. The environment responds with a new observation otM+1 and a scalar reward
r t . The goal of the agent is to maximize the discounted return
Í
Rt = k∞=0 γ k r rM+k +1 with γ ∈ [0, 1]. Specifically, in the ride-hailing
setting, we design our global reward taking both ADI and ORR into
account, which can be formulated as:
r tM = r ADI + rO RR
(4)
where r ADI denotes accumulated driver income, computed according to the price of each served order; while rO RR encourages ORR,
and is calculated with several correlative factors as:
Õ
Õ
(5)
rO RR =
(E − Ē)2 +
D K L (Pto ∥Ptv )
дr id
ar ea
where E, Ē are the manager’s entropy and global average entropy
respectively. Area, different from grid, often means a certain region
which needs to be taken more care of. In our experiment, we select
several grids whose entropy largely differs from the average as
the area. D K L (Pto ∥Ptv ) denotes Kullback-Leibler (KL) divergence
which shows the margin between vehicle and order distributions
of certain area at timestep t. Pto and Ptv are realized with Poisson
distribution, a common distribution for vehicle routing [9] and
arriving [16]. In practice, this distribution parameters can be estimated from real trip data by the mean and std of orders and vehicles
in each grid at each timestep. Such a combined ORR reward design
helps optimization both globally and locally.
r tW =
1Õ
W
dcos (oW
t − o t −i , дt −i ),
c i=1
(6)
where dcos (α, β) = α T β/(|α | · |β |) is the cosine similarity between
two vectors. Notice that дt now represents an advantageous direction in the latent state space at a horizon c [34]. Such intrinsic
reward design would provide directional shift for worker s to follow.
Different from traditional FuN [34], procedure of our worker module produces action consists of two steps: (i) parameter generating,
and (ii) action generating, inspired by [44]. We utilize state-specific
scoring function fθ W in parameter generating setup to map the
current state oW
t to a list of weight vectors ωt as:
W
(7)
fθ W : oW
t → ωt , ht
which is calculated based on nerual network shown in Figure 5. In
action generating setup, note that it is straightforward to extend
linear relations with non-linear ones, we formulate that the scoring
function parameter ωt and the ranking feature ei for order i as:
scorei = ωTt ei
(8)
The detailed formulation of ei will be discussed in Section 5. Then,
we build and add real orders and potential fleet control - repositioning to neighbor grids and staying at the current grid - as fake
orders into item space I. After computing scores for all available
options in I, instead of directly ranking and selecting Top-k items
for order dispatching and fleet management, we adopt Boltzmann
softmax selector to generate Selected-k items:
exp(scorei /τ )
Selected-k = ÍM
i=1 exp(scorei /τ )
(9)
where k = min(Nv , No ), τ denotes temperature hyper-parameter
to control the exploration rate, and M is the number of scored order
candidates. In practice, we set the initial temperature as 1.0, then
gradually reduce the temperature until 0.01 to limit exploration.
This approach not only equips the action selection procedure with
controllable exploration but also diversify the policy’s decision to
avoid choosing groups of vehicles fleeted to the same destination.
Algorithm 1 CoRide for joint multi-scale OD & FM
Require: current observations otM , oW
t ; mutual communication
messages mtM−1 , mW
t −1 .
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
for each manager in grid world do
Generate дt , htM according to Eq. (3).
for each worker of the manager do
Generate ωt , hW
t according to Eq. (7).
Add orders and fleet control items to item space I.
Rank items in I according to Eq. (8).
Generate Selected-k items according to Eq. (9).
end for
worker -level attention mechanism generates mW
t according
to Eq. (12).
manager receives extrinsic reward r tM , and its workers receive intrinsic reward r tW according to Eq. (4) and Eq. (6)
respectively.
end for
manaдer -level attention mechanism generates mtM according
to Eq. (12).
Update parameters according to Algorithm 2.
where H is the number of attention heads, and WT ,WS ,WC are
multiple sets of trainable parameters. Thus, peer message mt is generated and will be feed into the corresponding module to produce
the cooperative information ht . We present the overall CoRide for
joint order dispatching and fleet management in Algorithm 1.
4.5
As described in Algorithm 1, managers generate specific goals
based on their observations and peer messages (line 2). The workers
under the manager generate the weight vector according to private
observation and sharing goal (line 4). We then build a general item
space I for order dispatching and fleet management (line 5), and
rank items in I (line 6). Considering that our action is conditional
to the minimum of the number of vehicles and orders, we generate
Selected-k items as the final action (line 7).
We extend learning approach from FuN [34] and HIRO [19] to
train manager and worker module in the similar way. In CoRide, we
utilize DDPG algorithm [14] to train the parameters for both manager and worker module for following reasons. Classically, the critic
is designed to leverage an approximator, to learn an action-value
function Q(ot , at ), and to direct the actor updating its parameters.
The optimal action-value function Q ∗ (ot , at ) should follow the Bellman equation [3] as:
Q ∗ (ot , at ) = Eot +1 [r t + γ max Q ∗ (ot +1 , at +1 )|ot , at ]
a t +1
Multi-head Attention for Coordination
1:
3:
Note that manager and worker share the same setting of multi-head
attention mechanism, agent in this subsection can represent either
of them. We utilize hit −1 to denote the cooperating information for
i-th agent generated from RNN at t-1, and extend self-attention
mechanism to learn to evaluate each available interaction as:
j
ij
ht −1 = (hit −1WT ) · (ht −1WS )T
(10)
j
where hit −1WT , ht −1WS are embedding of messages from target
4:
5:
6:
7:
8:
9:
10:
ij
agent and source agent respectively. We can model ht −1 as the
value of communication between i-th agent and j-th agent. To
retrieve a general attention value between source and target agents,
we further normalize this value in neighborhood scope as:
ij
ij
α t −1 = softmax(ht −1 ) =
ij
exp(ht −1 /ι)
Í
ij
j ∈Ni exp(h t −1 /ι)
mit
Õ Õ
1 n=H
i jn j
α t −1 (ht −1WCn ) + bq
= σ Wq ·
H n=1
j ∈Ni
11:
12:
13:
14:
(11)
15:
where Ni is the neighborhood scope: the set of communication
available for target agent, and ι denotes temperature factor. To
jointly attend to the neighborhood from different representation
subspaces at different grids, we leverage multi-head attention as in
previous work [32, 33, 38, 42] to extend the observation as:
16:
17:
18:
19:
(12)
(13)
Algorithm 2 Parameters Training with DDPG
2:
4.4
Training
20:
21:
Randomly initialize Critic network Q(mt +1 , ot , at |θ Q ) and actor µ(mt −1 , ot |θ µ ) with weights θ Q and θ µ .
′
Initialize target network Q ′ and µ ′ with weights θ Q ← θ Q ,
′
µ
µ
θ ←θ .
Initialize replay buffer R.
for each training episode do
for agent i = 1 to M do
m 0 = initial message, t = 1.
while t < T and ot , terminal do
Select the action at = µ t (mt −1 , ot |θ µ ) for active agent;
Receive reward r t and new observation ot +1 ;
Generate message mt = Attention(ht0−1 , ht1−1 , ..., htK−1 ),
where hkt−1 is latent vector in RNN and K denotes the
number of neighboring agents;
end while
Store episode {m 0 , o 1 , a 1 , r 1 , m 1 , o 2 , a 2 , r 2 ...} in R.
end for
Sample a random minibatch of transitions T : < mt −1 , ot ,
at , r t , ot +1 > from replay buffer R.
for each transition T do
′
′
Set yt = r t + γQ ′ (mt , ot +1 , µ(mt , ot +1 |θ µ )|θ Q );
Update Critic by minimizing the loss:
′
L(θ Q ) = (yt − Q(mt −1 , ot , at |θ Q ))2 ;
Update Actor policy by maximizing the Critic:
J (θ µ ) = Q(mt −1 , ot , at |θ Q )|a=µ(mt −1,ot |θ µ ) ;
Update communication component.
end for
end for
which requires |A| evaluations to select the optimal action. This
prevents Eq. (13) to be adopted in real-world scenario, e.g. ridehailing setting, with enormous state and action spaces. However, the
actor architecture proposed in Section 4.3 generates a deterministic
action for critic. Furthermore, Lillicrap et al. [14] proposed a flexible
and practical method to use an approximator function to estimate
the action-value function, i.e. Q(o, a) ≈ Q(o, a; θ µ ). In practice, we
refer to leverage DQN: a neural network function approximator
can be trained by minimizing a sequence of loss functions L(θ µ ) as:
L(θ µ ) = Est ,at ,r t ,ot +1 [(yt − Q(ot , at ; θ µ ))2 ]
+ γQ ′ (o
(14)
′
; θ µ )|o
where yt = Eot +1 [r t
t +1 , a t +1
t , a t ] is the target for
the current iteration. According to the aforementioned analysis,
the general training algorithm for the manager and worker module
is presented in Algorithm 2.
5
SIMULATOR
The trial-and-error nature of reinforcement learning requires a
dynamic simulation environment for training and evaluation. Thus,
we adopt and extend the grid-based simulator designed by Lin et al.
[15] to joint order dispatching and fleet management.
5.1
Data Description
The real world data provided by Didi Chuxing† includes order information and trajectories of vehicles in the central area of three large
cities with millions of orders in four consecutive weeks. Data of
each day contains million-level orders and tens of thousands vehicles in each city. The order information includes order price, origin,
destination, and duration. The trajectories contain the positions
(latitude and longitude) and status (on-line, off-line, on-service)
of all vehicles every few seconds. As the radius of grid is approximate 1.3 kilometers, the central area of the city is covered by a
hexagonal grids world consisting of 182, 126, 112 grids in three
cities respectively. In order to adapt to the grid-based simulator, we
utilize unique gridID to represent position information.
5.2
Simulator Design
In the grid-based simulator, the city is covered by a hexagonal gridworld as illustrated in Figure 2. At each timestep t, the simulator
provides an observation ot with a set of idle vehicles and a set of
available orders including real orders and aforementioned fake orders for fleet control. All such fake orders share the same attributes
as real orders, except that some of attributes are set stationary like
price. All these real orders are generated by bootstrapping from
real-world dataset introduced above. More specifically, suppose
the current timestep of simulator is t, we randomly sample real
orders occuring in the same period, i.e. happening between t ∆ ×t to
t ∆ × (t + 1), where t ∆ denotes timestep interval. In practice, we set
sampling rate 100%. Like orders, vehicles are set online and offline
alternatively according to a distribution learned from real-world
dataset via maximum likelihood estimation. Each order feature, i.e.
ranking feature ei in Eq. (8), includes the origin gridID, the destination gridID, price, duration and the type of order indicating real
or fake order; while each vehicle takes its gridID as a feature, and
† Similar dataset supported by Didi Chuxing can be found via GAIA open dataset
(https://outreach.didichuxing.com/research/opendata/en/).
vehicles located at the same grid are regarded as homogeneous.
Moreover, as the travel distance between neighboring grids is approximately 1.3 kilometers and timestep interval t ∆ is 10 minutes,
we assume that vehicles will not automatically move to other grids
before taking another order. The ride-hailing platform then provides an optimal list of vehicle-order pairs according to current
policy. After receiving the list, the simulator will return a new observation ot +1 and a list of order fees. Stepping on this feedback,
rewards r t for each agent will be calculated and the corresponding
record (ot , at , r t , ot +1 ) will be stored into a replay buffer. The whole
network parameters will be updated using a batch of samples from
replay buffer.
The effectiveness of the grid-based simulator is evaluated based
on the calibration against the real data in term of the most important
performance measurement: accumulated driver income (ADI) [15].
The coefficient of determination r 2 between simulated ADI and real
ADI is 0.9331 and the Pearson correlation is 0.9853 with p-value
p < 0.00001.
6
EXPERIMENT
In this section, we conduct extensive experiments to evaluate the
effectiveness of our proposed method in joint order dispatching and
fleet management environment. Given that there are no published
methods fitting our task. Thus, we first compare our proposed
method with other models either widely used in the industry or
published as academic papers based on a single order dispatching
environment. Then, we further evaluate our proposed method in
joint setting and compare with its performance in single setting.
6.1
Compared Methods
As discussed in [15], learning-based methods, currently regarded as
state-of-the-art methods, usually outperform rule-based methods.
Thus, we employ 6 learning-based methods and random method as
the benchmark for comparison in our experiments.
• RAN: A random dispatching algorithm considering no additional
information. It only assigns idle vehicles with available orders
randomly at each timestep.
• DQN: Li et al. [13] conducted action-value function approximation based on Q-network. The Q-network is parameterized by a
MLP with four hidden layers and we adopt the ReLU activation
between hidden layers and to transform the final linear output
of Q-network.
• MDP: Xu et al. [39] implemented dispatching through a learning and planning approach: each vehicle-order pair is valued in
consideration of both immediate rewards and future gains in
the learning step, and dispatch is solved using a combinatorial
optimizing algorithm in planning step.
• DDQN: Wang et al. [36] introduced a double-DQN with spatialtemporal action search. The network architecture is similar to
the one described in DQN except that a selected action space is
utilized and network parameters are updated via double-DQN.
• MFOD: Li et al. [13] modeled the order dispatching problem
with MFRL [40] and simplified the local interactions by taking
an average action among neighborhoods.
• CoRide: Our proposed model as detailed in Section 4.
Table 1: Performance comparison of competing methods in terms of ADI and ORR with respect to the performance of RAN. For
a fair comparison, the random seeds that control the dynamics of the environment are set to be the same across all methods.
City
Metrics
City A
Normalized ADI Normalized ORR
City B
Normalized ADI Normalized ORR
City C
Normalized ADI Normalized ORR
DQN
MDP
DDQN
MFOD
+5.71% ± 0.02%
+7.11% ± 0.05%
+6.68% ± 0.04%
+6.62% ± 0.03%
+2.67% ± 0.01%
+2.71% ± 0.03%
+3.19% ± 0.04%
+3.71% ± 0.02%
+6.30% ± 0.01%
+7.89% ± 0.05%
+7.75% ± 0.06%
+7.91% ± 0.04%
+3.01% ± 0.02%
+3.13% ± 0.04%
+4.06% ± 0.05%
+4.01% ± 0.02%
+6.11% ± 0.02%
+7.53% ± 0.03%
+7.62% ± 0.04%
+7.32% ± 0.02%
+3.04% ± 0.01%
+3.19% ± 0.03%
+4.58% ± 0.05%
+4.60% ± 0.01%
CoRideCoRide
+9.27% ± 0.04%
+9.80% ± 0.04%
+4.23% ± 0.03%
+4.81% ± 0.05%
+8.73% ± 0.03%
+8.94% ± 0.06%
+4.35% ± 0.02%
+4.89% ± 0.04%
+9.06% ± 0.03%
+9.23% ± 0.05%
+4.23% ± 0.04%
+5.19% ± 0.04%
• CoRide-: In order to further evaluate performance for hierarchical setting and agent communication, we set CoRide without
multi-head attention mechanism as one of the baselines.
6.2
Result Analysis
For all learning methods, following [13], we run 20 episodes for
training, store the trained model periodically, and conduct the evaluation on the stored model with 5 random seeds. We compare the
performance of different models regarding two criteria, including
ADI, computed as the total income in a day, and ORR, calculated
by the number of orders taken divided by the number of orders
generated.
Experimental Results and Analysis. As shown in Table 1, the
performance surpasses the state-of-the-art models like DDQN and
industry deployed model like MDP. DDQN along with DQN mainly
limits in lack of interaction and cooperation in the multi-agent
environment. MDP mainly focuses on order price but ignores other
features of order like duration, which makes against finding a balance between getting higher income per order and taking more
orders. Instead, our proposed method achieves higher growths in
term of ADI not only by considering every feature of each order concurrently but through learning to collaborate hierarchically. MFOD
trys to capture dynamic demand-supply variations by propagating many local interactions between vehicles and the environment
among mean field. Note that the number and information of available grid are relatively stationary while the number and feature
of active vehicles are more dynamic. Thus, CoRide, which takes
grid as agent, is more likely and easier to learn to cooperate from
interaction between agents.
Apart from cooperation, multi-head attention network also enables CoRide to capture demand-supply dynamics from both district
(manager) and grid (worker) scale (as will be further discussed in
Figure 6). Such a novel combined scale setting facilitates CoRide
both effectively and efficiently.
Visualization Analysis. Except for quantitive results, we also analyze whether the learned multi-head attention network can capture
the demand-supply relation (see Figure 6(b)) through visualization.
As shown in Figure 3, our communication mechanism conducts in
a hierarchical way: attention among the managers communicates
and learns to collaborate abstractly and globally while peers in
worker-layer operate and determine key grid locally.
The values of several example managers and a group of workers
belonging to the same manager are visualized in Figure 6(a). By
taking a closer look at Figure 6, we can observe that the area with
high demand-supply indeed centralized at certain places, which has
been well captured in manager-scale. Such district-level attention
value allows precious vehicles to be dispatched efficiently in a global
view. Apart from manager-scale one, multi-head attention network
also provides worker-scale attention value, which focuses on local
allocation. Stepping on this multi-scale dispatching system design,
CoRide could operate as a microscope, where coarse and fine focuses
work together to obtain precise action.
Figure 6: Sampled attention value and demand-supply gap in
the city center during peak hours. Grids with more orders or
higher attention value are shown in red (in green if opposite)
and the gap is proportional to the shade of colors.
Ablation Study. In this subsection, we evaluate the effectiveness
of components of CoRide. Notice that manager and worker modules
serve as key components and are integrated through the hierarchical multi-agent architecture, as Figure 3 shows. Thus, we choose to
investigate the performance of multi-head attention network here
and set CoRide- as a variation of proposed method. As shown in the
last two rows in Table 1, CoRide- achieves significant advantages
over the aforementioned baselines, especially in City A. Similar
results occur with CoRide. This phenomenon can be explained from
the fact that City A is the largest one according to Section 5.1, which
requires frequent and large numbers of transportations among regions. Multi-scale guidance via multi-head attention network and
hierarchical multi-agent architecture is therefore potentially helpful, especially at a large-scale case.
Case Study. The above experimental results show that the success rate of our model is significantly better than others in single
Table 2: Performance comparison of competing methods in terms of AST and TNF with three different discounted rates (DR).
The numbers in Trajectory denote gridID in Figure 7 and its color denotes the district it located in. O and W mean the vehicle
is On-service and Waiting at the current grid. Also, we use underlined number to present fleet management.
DR
Metrics
20%
Trajectory
AST
TNF
30%
Trajectory
AST
TNF
40%
Trajectory
AST
TNF
RES
REV
13,9,W,14,W,13,8,2,7,11
O,O,15,W,20,O,O,O,O,11
8
9
8
3
13,W,14,W,W,13,8,W,7,11
O,O,15,W,W,W,O,O,O,11
6
7
6
2
13,W,14,W,W,W,W,19,O,9
O,O,15,W,W,W,20,W,O,14
5
6
4
3
CoRide
CoRide+
13,9,W,O,0,4,O,2,O,5
13,9,W,O,0,4,O,2,O,5
9
9
6
6
13,W,O,11,W,W,O,O,0,5
8,3,0,2,O,4,O,2,O,5
7
8
4
5
13,W,W,O,17,W,W,O,0,3
8,3,0,2,O,4,O,2,O,5
6
8
3
5
Downtown
1
11
0
12
8
3
5
17
7
2
6
18
13
9
4
10
Uptown
14
15
16
19
Max Entropy Min
Unavailable
Gird
Figure 7:
Illustration of the
grid world in
case study, where
color of the grids
denote
their
entropy.
20
dispatching task. In order to evaluate the performance of CoRide
in joint dispatching and repositioning. Also, to further differ the
formulations of the models, we constructed a synthetic dataset containing 3 districts with 21 grids, as showed in Figure 7. All these
synthetic datasets are obtained via sampling real-world dataset
supported by Didi Chuxing. More concretely, order distributions of
all grids are sampled from the average distribution in real world
dataset. Namely, order distributions in each grid are homogeneous.
In order to differ downtown areas from uptown areas, we introduce
sampling rate here. The sampling rate for each grid denotes popular
rate in the real world. We set downtown (red grids in Figure 7)
with stationary sampling rate 100%. The other regions are sampled
with sampling rate := 100% − discounted rate for yellow district and
100% − 2 × discounted rate for green district. Specifically we set
discounted rate as 20%, 30% and 40% respectively, and further verify
our proposed model by comparing against following 3 methods:
• RES: This response-based method aims to achieve higher ORR,
which corresponds to Total Number of Finished orders (TNF) in
this section. Orders with short duration will gain high priority
to get dispatched first. Once there are multiple orders with the
same estimated trip time, then orders with higher prices will be
served first.
• REV: The revenue-based algorithm focuses on a higher ADI,
which corresponds to Accumulated on-Service Time (AST) in
this section, by assigning vehicles to high price orders. Following
the similar principle as described above, the price and duration
of orders will be considered as primary and secondary factors
respectively.
• CoRide+: To distinguish CoRide running in different environment: single order dispatching, and joint order dispatching and
fleet management, we sign the former one CoRide and latter one
CoRide+.
In order to analyze these performances in a more straightforward way, we mainly employ rule-based methods here. Also, we
introduce AST calculated as the accumulated on-service time and
TNF computed as the total number of finished orders as the new
metrics corresponding to ADI and ORR respectively. In order to
further analyze these performances in a long-term way, we select
one vehicle starting at grid 12, trace 10 timesteps and record its
trajectory (as Figure 7 shows), then conclude these results in Table 2.
Although we only record the first 10 timesteps, we can observe that
our proposed methods, both CoRide+ and CoRide, are guiding the
vehicle to regions with larger entropy. This is benefit from architecture where the state of both manager and worker, and ranking
feature ei take the grid information into consideration. In contrast,
other methods greedily optimize either AST (ADI) or TNF (ORR)
and ignore these information. After taking a close look at Table 2,
we can find that CoRide and CoRide+ share the same trajectory on
discounted rate 20% and differ greatly when discounted rate moves
to 30% and 40 %. This can be explained by regarding CoRide+ as
a combined design between our proposed model CoRide and joint
order dispatching and fleet management setting. Namely, CoRide is
actually a special case of CoRide+, where fleet management is unable. Equipped with fleet management, CoRide+ allows the vehicle
move to and serve order in the hotspots more directly than CoRide.
Also, when discounted rate varies from 20% to 40%, fleet management enables CoRide+ with better adaptation and achieve stable
performance, even can ignore the dynamics of order distributions
in some cases.
According to aforementioned analysis, we can conclude that (i)
CoRide+ achieves not only the state-of-the-art but a more stable
result benefiting from joint order dispatching and fleet management
setting; (ii) both CoRide and CoRide+ can direct the vehicle to grids
with larger entropy via taking grid information into consideration.
7
CONCLUSION AND FUTURE WORK
In this paper, we proposed CoRide, a hierarchical multi-agent reinforcement learning solution to combine order dispatching and fleet
management for multi-scale ride-hailing platforms. The results on
multi-city real-world data as well as analytic synthetic data show
that our proposed algorithm achieves (i) a higher ADI and ORR than
aforementioned methods, (ii) a multi-scale decision-making process,
(iii) a hierarchical multi-agent architecture in the ride-hailing task
and (iv) a more stable method at different cases. Note that CoRide
could achieve fully decentralized execution and incorporate closely
with other geographical information based model like estimating
time of arrival (ETA) [35] theoretically. Thus, it’s interesting to
conduct further evaluation and investigation. Also, we notice that
applying hierarchical reinforcement learning in real-world scenarios is very challenge and our work is just a start. There is much
work for future research to improve both stability and performance
of hierarchical reinforcement learning methods on real-world tasks.
Acknowledgments. The corresponding author Weinan Zhang
thanks the support of National Natural Science Foundation of China
(Grant No. 61702327, 61772333, 61632017). We would also like to
thank our colleague in DiDi for constant support and encouragement.
REFERENCES
[1] Sanjeevan Ahilan and Peter Dayan. 2019. Feudal multi-agent hierarchies for
cooperative reinforcement learning. arXiv preprint arXiv:1901.08492 (2019).
[2] Pierre-Luc Bacon, Jean Harb, and Doina Precup. 2017. The Option-Critic Architecture.. In AAAI. 1726ś1734.
[3] Richard Bellman. 2013. Dynamic programming. Courier Corporation.
[4] Christian Daniel, Gerhard Neumann, and Jan Peters. 2012. Hierarchical relative
entropy policy search. In Artificial Intelligence and Statistics. 273ś281.
[5] Peter Dayan and Geoffrey E Hinton. 1993. Feudal reinforcement learning. In
Advances in neural information processing systems. 271ś278.
[6] Thomas G Dietterich. 2000. Hierarchical reinforcement learning with the MAXQ
value function decomposition. Journal of Artificial Intelligence Research 13 (2000),
227ś303.
[7] Carlos Florensa, Yan Duan, and Pieter Abbeel. 2017. Stochastic neural networks
for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012 (2017).
[8] Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. 2017.
Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767 (2017).
[9] Gianpaolo Ghiani, Francesca Guerriero, Gilbert Laporte, and Roberto Musmanno.
2003. Real-time vehicle routing: Solution concepts, algorithms and parallel
computing strategies. European Journal of Operational Research 151, 1 (2003),
1ś11.
[10] Xiangyu Kong, Bo Xin, Fangchen Liu, and Yizhou Wang. 2017. Effective masterslave communication on a multiagent deep reinforcement learning system. In
Hierarchical Reinforcement Learning Workshop at the 31st Conference on NIPS,
Long Beach, CA, USA.
[11] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum.
2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction
and intrinsic motivation. In Advances in neural information processing systems.
3675ś3683.
[12] Andrew Levy, George Konidaris, Robert Platt, and Kate Saenko. 2018. Learning
multi-level hierarchies with hindsight. (2018).
[13] Minne Li, Yan Jiao, Yaodong Yang, Zhichen Gong, Jun Wang, Chenxi Wang,
Guobin Wu, Jieping Ye, et al. 2019. Efficient Ridesharing Order Dispatching with
Mean Field Multi-Agent Reinforcement Learning. arXiv (2019).
[14] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez,
Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with
deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
[15] Kaixiang Lin, Renyu Zhao, Zhe Xu, and Jiayu Zhou. 2018. Efficient Large-Scale
Fleet Management via Multi-Agent Deep Reinforcement Learning. arXiv preprint
arXiv:1802.06444 (2018).
[16] Dominique Lord, Simon P Washington, and John N Ivan. 2005. Poisson, Poissongamma and zero-inflated regression models of motor vehicle crashes: balancing
statistical fit and theory. Accident Analysis & Prevention 37, 1 (2005), 35ś46.
[17] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis
Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep
reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
[18] James Munkres. 1957. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics 5, 1 (1957),
32ś38.
[19] Ofir Nachum, Shane Gu, Honglak Lee, and Sergey Levine. 2018. Data-Efficient
Hierarchical Reinforcement Learning. arXiv preprint arXiv:1805.08296 (2018).
[20] Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. 2018. Near-Optimal
Representation Learning for Hierarchical Reinforcement Learning. arXiv preprint
arXiv:1810.01257 (2018).
[21] Takuma Oda and Yulia Tachibana. 2018. Distributed Fleet Control with Maximum
Entropy Deep Reinforcement Learning. (2018).
[22] Doina Precup. 2000. Temporal abstraction in reinforcement learning. University
of Massachusetts Amherst.
[23] Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Van de Wiele, Volodymyr Mnih, Nicolas Heess, and Jost Tobias
Springenberg. 2018. Learning by Playing-Solving Sparse Reward Tasks from
Scratch. arXiv preprint arXiv:1802.10567 (2018).
[24] Kiam Tian Seow, Nam Hai Dang, and Der-Horng Lee. 2010. A collaborative
multiagent taxi-dispatch system. IEEE Transactions on Automation Science and
Engineering 7, 3 (2010), 607ś616.
[25] Olivier Sigaud and Freek Stulp. 2018. Policy Search in Continuous Action Domains: an Overview. arXiv preprint arXiv:1803.04706 (2018).
[26] Hugo P Simao, Jeff Day, Abraham P George, Ted Gifford, John Nienow, and
Warren B Powell. 2009. An approximate dynamic programming algorithm for
large-scale fleet management: A case application. Transportation Science 43, 2
(2009), 178ś197.
[27] Matthijs TJ Spaan. 2012. Partially observable Markov decision processes. In
Reinforcement Learning. Springer, 387ś414.
[28] Martin Stolle and Doina Precup. 2002. Learning options in reinforcement learning.
In International Symposium on abstraction, reformulation, and approximation.
Springer, 212ś223.
[29] Richard S Sutton, Doina Precup, and Satinder Singh. 1999. Between MDPs and
semi-MDPs: A framework for temporal abstraction in reinforcement learning.
Artificial intelligence 112, 1-2 (1999), 181ś211.
[30] Xiaocheng Tang and Zhiwei Qin. 2018. A Deep Value-network Based Approach
for Multi-Driver Order Dispatching. Technical Report (2018).
[31] Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J Mankowitz, and Shie Mannor.
2017. A Deep Hierarchical Approach to Lifelong Learning in Minecraft.. In AAAI,
Vol. 3. 6.
[32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In NeurIPS.
[33] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv (2017).
[34] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max
Jaderberg, David Silver, and Koray Kavukcuoglu. 2017. Feudal networks for
hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161 (2017).
[35] Zheng Wang, Kun Fu, and Jieping Ye. 2018. Learning to estimate the travel time.
In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining. ACM, 858ś866.
[36] Zhaodong Wang, Zhiwei Qin, Xiaocheng Tang, Jieping Ye, and Hongtu Zhu. 2018.
Deep Reinforcement Learning with Knowledge Transfer for Online Rides Order
Dispatching. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE,
617ś626.
[37] Chong Wei, Yinhu Wang, Xuedong Yan, and Chunfu Shao. 2018. Look-Ahead
Insertion Policy for a Shared-Taxi System Based on Reinforcement Learning.
IEEE Access 6 (2018), 5716ś5726.
[38] Hua Wei, Nan Xu, Huichu Zhang, Guanjie Zheng, Xinshi Zang, Chacha Chen,
Weinan Zhang, Yanmin Zhu, Kai Xu, and Zhenhui Li. 2019. CoLight: Learning Network-level Cooperation for Traffic Signal Control. arXiv preprint
arXiv:1905.05717 (2019).
[39] Zhe Xu, Zhixin Li, Qingwen Guan, Dingshui Zhang, Qiang Li, Junxiao Nan,
Chunyang Liu, Wei Bian, and Jieping Ye. 2018. Large-Scale Order Dispatch in
On-Demand Ride-Hailing Platforms: A Learning and Planning Approach. In
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining. ACM, 905ś913.
[40] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang.
2018. Mean Field Multi-Agent Reinforcement Learning (ICML).
[41] Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated
convolutions. arXiv preprint arXiv:1511.07122 (2015).
[42] Huichu Zhang, Siyuan Feng, Chang Liu, Yaoyao Ding, Yichen Zhu, Zihan Zhou,
Weinan Zhang, Yong Yu, Haiming Jin, and Zhenhui Li. 2019. CityFlow: A MultiAgent Reinforcement Learning Environment for Large Scale City Traffic Scenario.
arXiv preprint arXiv:1905.05217 (2019).
[43] Lingyu Zhang, Tao Hu, Yue Min, Guobin Wu, Junying Zhang, Pengcheng Feng,
Pinghua Gong, and Jieping Ye. 2017. A taxi order dispatch model based on
combinatorial optimization. In Proceedings of the 23rd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. ACM, 2151ś2159.
[44] Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Dawei Yin, Yihong Zhao, and Jiliang
Tang. 2017. Deep Reinforcement Learning for List-wise Recommendations. arXiv
preprint arXiv:1801.00209 (2017).
[45] Qingnan Zou, Guangtao Xue, Yuan Luo, Jiadi Yu, and Hongzi Zhu. 2013. A novel
taxi dispatch system for smart city. In International Conference on Distributed,
Ambient, and Pervasive Interactions. Springer, 326ś335.