Lu 2020
Lu 2020
Lu 2020
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Generally, when tapping the card at the station, the pas- B. Advanced Data Collecting Technologies
senger’s location and time are recorded in the AFC system. 1) Smart Phone Data (SPD): The smartphone is an inte-
In some transit systems, such as Beijing metro and gration of GPS, Wi-Fi, and accelerometers. It provides a
Shanghai metro, passengers need to tap the card when board- new way to track the individual long-term travel data,
ing and alighting, which provides accurate OD information which could help track the passenger mobility and behav-
and this kind of systems is called the closed system. How- iors in the transit systems [28], [29]. Generally, mobile
ever, in some cities, passengers only tap the card when phone data have high penetration rates, so their anonymized
entering or exiting the system [13]. The origin or desti- flow data has been mined to perform transit analysis and
nation information is lost and we call it an open system. optimization [30].
The trip chaining approach is usually applied to estimate In order to record users’ travel data, some researchers
complete trips in the open system, which will be discussed have developed several cell phone applications. Once the
in Section 4. applications are activated, the cell phone will send peri-
2) Automatic Vehicle Location (AVL): In response to grow- odic and anonymized location updates to a central tracking
ing passenger demands for operation reliability, many transit server. Those data could help distinguish whether the passen-
operators are seeking to improve transit vehicle operations ger is on a vehicle or not. The GPS trajectory data are used
by investing in the AVL [14] technology. These systems as the input for the route matching algorithm that determines
collect the location of vehicles usually by broadcasting the whether the user is in a bus or another vehicle [31].
sensors’ values using an interval of 10-30 seconds depending Another advantage of smartphone data is the collec-
on the radio capacity. Typically, AVL systems are based on tion of the passengers’ attitudes for real-time information.
GPS measurements [15]. Watkins et al. [32] using OneBusAway application, observed
At the very beginning, the offline AVL systems in which passengers arriving at Seattle-area bus stops to measure their
the data couldn’t be transmitted to the main server in time can waiting time while asking a series of questions, including
now produce continuous data streams with the development of how long they believed they had waited for. It is found that
online technology. Specifically, each vehicle transmits the data for passengers without real-time information, the perceived
with a very short (but certain) periodicity to the main server, waiting time is greater than the measured waiting time.
namely real-time AVL system [16]. Based on the real-time However, when passengers have the real-time information,
data, many researchers provide real-time decision model to their perceived time is no longer more than the experienced
support the operation control such as travel time and dwell waiting time. Moreover, the real-time information users wait
time prediction [16], [17] and real-time rescheduling [18]. almost 2 minutes less than those using traditional schedule
3) Automatic Passenger Collection (APC): APC data is an information. Also, mobile real-time information has the ability
important passenger information supplement for AVL. It relies to improve the experience of transit passengers by providing
on estimation techniques based on door loop counts or weight available transit system information in their pre-trips.
sensors installed in vehicles [15]. The APC system records 2) Bluetooth Technology: Bluetooth is a wireless technol-
passenger activities such as boarding and alighting [14]. Based ogy standard for exchanging data over short distances, which
on the APC and AVL data, researchers could estimate the provides a new way of analyzing the passenger behavior at
passenger O-D matrix [14], dwell time estimation [19] and some specific locations of the stations, such as stairs or ele-
passenger assignment [20]. vators. Wu et al. [33] represented one of the earliest attempts
4) General Transit Feed Specification (GTFS): GTFS pro- to use the Bluetooth technology as an information collection
vides a common format standard for open public trans- application in an Android-based smartphone for the Beijing
portation schedules and associated geographic information by Metro in peak hours. From the total of 41,806 records during
over 150 cities around the world, especially covering the 120 days, they performed extensive analysis on a variety of
information on trip, route, stop times and stop location etc. statistics for the number of neighbors, the lifetime of the
Some recent research [21]–[23] have used GTFS to analyze neighbors, flow speed of the nearby passengers and battery
transit accessibility by estimating the point-to-point travel usage rate. It shows that the Bluetooth is a perfect technology
times at different time periods of day. Fortin et al. [24] also to build up a relatively small multi-hop wireless network
import the GTFS data to perform transit network analysis with the network size of four nodes, and it is applicable and
for dynamic network connectivity, service frequency at stops may promote new applications in an underground environment
and service speed at routes. In addition, a few papers also during the peak hour. Meanwhile, taking advantage of the
use GTFS data for schedule-based transit system modeling Bluetooth short distance connection, the Bluetooth technology
and optimization [3], [25], [26]. Recently, a real-time version, has also been used to estimate the crowdedness in the stairs
named GTFS-realtime (GTFS-rt), has begun to emerge to and elevators [34].
allow agencies to update real-time trip information, vehicle 3) Wi-Fi Technology: Wi-Fi is a technology for wireless
location, and service alters. To address the prediction errors of local area networking with location information. The passen-
real-time GTFS data, Barbeau [27] developed an open-source ger’s OD information could be detected by backward tracking
tool to monitor and validate data and further produce statistics Wi-Fi signals of mobile devices carried by transit passengers.
for all validations. Additionally, the Transportation Research The O-D flows, determined directly from the Wi-Fi data for
Board at the US had awarded a grant to improve the quality a specific bus trip, do not correspond well to the ground
of GTFS real-time feeds. truth flows, but the results demonstrate the promise of using
Wi-Fi signal data for O-D flow determination when the III. T HE S YSTEMATIC I MPLEMENTATION
data are aggregated across multiple bus trips for a time-of- OF T RANSIT B IG DATA
day period, especially when being used in conjunction with In order to derive valuable insights for decision support and
APC data [35]. automation from huge volumes of heterogenous sensor data,
a few of big data technologies have been focused and invested
4) Biometric Face Recognition: It is vital to detect the
by different enterprises and organizations. Those technologies
passenger distribution in large-scale transportation systems.
mainly belong to three categories, including data storage
One of the main difficulties in tracking passengers is the
(such as, Hadoop, Data Lakes, NoSQL Databases, etc.), data
lack of detailed individual information in the transit system.
processing (such as, Spark, Hadoop, data governance, etc.),
Biometric facial recognition technologies make it possible to
and data analytics (Spark, cloud computing, edge computing,
obtain the individual position of each transit user [36]. Biomet-
artificial intelligence, modeling and optimization, etc.). Specif-
ric facial recognition technologies contain automated methods
ically, the Hadoop ecosystem has been widely recognized due
for verifying or recognizing the identity of a person on the
to its reliable, efficient and scalable distributed processing
basis of a facial image. The basic structure of an automated
of large data sets [44]. In intelligent transportation systems,
facial recognition system consists of four fundamental blocks:
different frameworks for big data analytics on transportation
face detector, feature extractor, database and classifier [37].
management and operations are introduced by using Hadoop
Mikłasz et al. [38] first attempted to use the facial recognition
with MapReduce or Spark [45]–[48]. Wang et al. [44] shows
to obtain pedestrian distribution in the interchange station.
that the performance of processing mass GPS data via Hadoop
Passenger transfer matrices obtained from optical analysis and
platform is improved by 4000% compared with the serial
traditional survey proves the effectiveness and potential of
program. Focusing on the urban rail transit system, historical
image analysis methods. In addition to obtaining pedestrian
passenger travel data from AFC are processed to derive
distribution, this technology has been used in the metro and
the passenger arrival rate, passenger alighting proportion,
railway security check-in [39].
and travel patterns in Hadoop big data platform [49], [50].
5) Social Media Data: Twitter, Facebook, and WeChat are In addition, Hadoop platform is also applied in the Bus Rapid
widely-used social media applications. They collect social Transit (BRT) system to analyze passenger travel pattern by
interactions of a large number of people, thereby, they are a improving the K-means++ clustering performance on large
valuable resource for predicting various large-scale trends [40]. datasets scalability [51] and identify fraud using transaction
Social media data have been used to improve models for profiling from 165 million records [52]. However, our paper
predicting flu trends and detect seismic activity after earth- will mainly focus on different methodologies used in big data
quakes [41], [42]. Similar research in the field of trans- analytics on statistics, optimization, simulation and machine
portation monitoring has been carried out as well. Sentiment learning, etc., which will be explained in detail in the following
analysis were used to reveal public opinions regarding transit sections.
agencies [43]. Specifically, this paper will focus on the review on
Table I summarizes the aspects of the application of differ- the big data applications and analytics in transit sys-
ent data collection technologies, from different research direc- tems classified as three aspects, including passenger behav-
tions, such as the passenger travel time estimation, passenger ior analysis, operation planning and policy making, whose
trip distribution, demand forecast and timetable rescheduling. connection is clearly illustrated in Fig.3 by a tree-based
It ranges from passenger behavior to system optimal opera- graph.
tions. As observed in Table I, one specific research problem The words in blue represent the three applications of transit
can be analyzed or optimized by utilizing multi-source sensor data. The words in green represent the research branches
data. Therefore, how those different data can be incorporated and subtopic of each data application. The words in purple
and evaluated is important for model validation, system per- represent the objectives of each subtopic. The words in black
formance optimization and long-term system planning. represents the constraints and methodologies.
are the fundamental elements of travel pattern (see Table III) Daily habitual travel patterns possess regularity as a basis,
[58]–[60]. Clustering is the basic method for mobility analysis, but more considerations on long-term trends are required
including two-level models with temporal trip clustering and for transportation planners., the variability and correlation of
spatial trip clustering [61], density-based spatial clustering travel patterns and time changes for transit network planning
to overcome the huge amount of data [62], [63], network- have gained a great attention, such as, Analysis of vari-
based clustering methods [64], two-level model with ance (ANOVA) in trip patterns for one month [59], variability
membership clustering and Gaussian mixture model for statistics for regularity analysis of demand pattern [66], statisti-
temporal trip clustering [65]. cal analysis and correlation matrix [64]. Other approaches for
demand patter analysis include iterative self-organizing data elasticity of distance travelled (EDT) relative to the cost
analysis [67], flow-comap-based visualization [26]. of travel to monitor the “transit-served areas (TSAs) [74],
While commuters make the majority of daily journeys [67], Sun et al. applied regression analysis on the dynamics of
extreme travelers are increasingly considered by both the boarding/alighting activities and its impact on bus dwell
academia and mass media recently due to the increased times [75], and Ma et al. developed Markov chain based
numbers of the unemployed, self-employed and part-timers, Bayesian decision tree algorithm to estimate passengers’ origin
the rise of telecommuters and low-paying jobs relocated to in Beijing’s flat-rate bus system [76].
cheaper places inside or outside a region/country [68]. At first, 2) Network Accessibility: Passenger accessibility highly
extreme travelers are defined as passengers who take exces- relies on the accessibility of transit network and its schedule.
sively long trips [69]. There are three more types of extreme In the current literature, accessibility can be calculated for
travelers in the context of the Chinese society, including the different categories of opportunities (See Table IV). The range
public transit passengers who (1) make significantly more of the classic models indicates its complexity and signifi-
trips (‘recurring itinerants’), (2) travel significantly earlier than cance. At the disaggregate level, every user can be viewed
average passengers (the ‘early birds’) during weekdays, and as the representative of a unique class, which is viewed as
(3) ride in unusually late hours (the ‘night owls’) during agent-based models. In addition, a classification based on the
weekdays [68], and the applied methods have Extreme Index trip purpose, vehicle type, and transportation mode would
(EI)-based mixture Gaussian model [71] and kernel density provide reasonable realism to minimize the classes/flows in
estimation for four extreme transit behaviors [70]. the models. A number of research focus on work accessibility
Expect for the demand patterns, Sun et al. [72] applied a with gravity models [79], food accessibility for residents who
simulation tool (MATSim) for transfer behavior detection, Luo rely on public transit [80] and activities accessibility for Bus
et al. aggregated the demand of spatially close stations for Rapid Transit (BRT) [81]. Initially, researchers only consider
transit demand construction by k-mean-based station aggrega- the spatial facility accessibilities [82], such as walking service
tion based on passenger flow and spatial station distance [73], quality [83], [84]. The mobility is measured by spatial accessi-
and four corresponding metrics (the minimum, actual, random bility [82], [85], time-space prism [86], and connectivity [87].
and maximum travels) are calculated to reflect transit riders’ However, for the schedule-based transit network, especially
for the low-frequency system, it is necessary to evaluate the AFC data [97]. Some new data analysis methods such as
accessibility based on the timetable [22], [88]. machine learning have also been used to analyze those emerg-
The analysis methodologies are mostly depended on ing data [95].
the statistical analysis [80], [81], [84], stochastic frontier 4) Trip Chaining: Trip chaining is an important research
modeling [86], accessibility indicators [22], [79], [82], [83], topic for researchers to obtain the destination location and
[87]–[89] and GIS-based tools for visualization [90]. transfer stations for each passenger. These results could help
3) Activity and Trip Purpose: Trip purpose is one of main designers optimize operation schedule for successive legs to
aspects when passengers determine their behaviors in the save transfer time and optimize the stop locations. In open
transit network. For work trips, passengers prefer to choose AFC systems, only the boarding information like boarding
the shortest path to save time, but passengers with shopping time and location is available, so the alighting information
trips usually prefer to have better comfortability and may is not available, such as in the London bus system and Metro
choose the path that is less crowded. It is possible to obtain Transit in Minneapolis-St Paul [99]. Based on other support
the purpose-based behavior characteristics when aggregated data resources such as the automated vehicle location (AVL)
passenger behavior includes different trip purposes, which and automated data collection (ADC), more detailed informa-
are then used for the customized service. Trip purpose is tion about boarding and alighting can be inferred. Researchers
one of the main topics for analyzing passenger behavior have worked on the individual trip destination estimation
(see Table V). Passengers, especially the commuters, may for tap-in only transit system (see Table VI) in the past
change their choices at different times of the day. In the decades. There are two basic assumptions for the trip-chaining
morning peak hour, passengers pay more attention to the algorithms.
service reliability and prefer the metro. In the off-peak hour, • A high percentage of passengers returns to the destination
bus or taxi may be preferred by passengers to finish their station of their previous trip to begin their next trip.
trips given the amount of travel time. The discrete choice • A high percentage of passengers ends their last trip of the
model [91], [92], clustering method [93] and data mining and day at the station where they began their first trip of the day.
machine learning [94], [95] are the most popular algorithm for Some other parameters, such as walking distance and
analyzing trip activity and trip purpose. the time interval between two trip legs, were studied when
Generally, the trip purpose can be classified into 3 cate- matching separate journeys for an individual cardholder, espe-
gories: work, home, and others [91]. Activity duration time, cially for the transfer determination [100]. When applying the
land use of station location and station frequency are the trip-chaining methodology to infer the alighting station for
basic parameters to determine the trip purpose [91]–[93]. the current trip, each cardholder should have more than one
In addition, the daily trip symmetry characters are also applied trip in the transit system and there is no private transportation
to determine the trip purpose [96]. The work and home mode trip segment such as car or bicycle between consecutive
trip purpose usually appear in the first and last trip within transit trip segments in a daily trip sequence [13]. In some
one day, and the passenger may have other activities after research, the single trip destination is inferred based on other
work. To obtain more detailed personal information, household days’ records, and if it only appears once, the alighting
survey and GPS logs are also taken as supplements for station is invalid. The sensitivity analysis is applied with
onboard survey data to validate the feasibility of the method. function, which is usually the generalized path cost [76]. The
However, the onboard survey is expensive and data samples generalized cost contains the total travel time and the penalty
are usually limited. The used methodologies mainly depend on for transfers and crowdedness. As such, the crowdedness plays
the logics of trip chain based on available AFC, AVL, ADC a crucial role in passenger route choice behavior [25], [107].
and onboard survey data, OD estimation approach, sampling In busy transit systems such as at Beijing and Shanghai, some
analysis, sensitivity analysis, etc. passengers fail to board on the incoming train and need to
5) Travel Time Reliability and Route Choice: Passengers wait for the next one in the morning peak hour. Sun and
are normally pretty sensitive to the waiting time in their Schonfeld [108] proposed a method to estimate the ‘failing to
trips, which could be represented by the reliability of travel board’ phenomena. Actually, Liu and Zhou [25] shows how
time. They may change their routes based on the path travel this kind of failing to board under tight capacity constraints can
time reliability, especially when some accidents occur. These invoke the bounded rationality behavior. In addition, a number
reliability-based route choice characteristics and the real- of generalized user equilibriums are also studied by consid-
time information-based behavior are important references for ering different route choice assumptions, such as, the expe-
accident rescheduling. Passenger route choice and travel time rienced least-cost path selection [113], optimal strategy for
reliability is an ongoing research endeavour (see Table VII). expected least-cost path set selection [114], path selection
Travel time is the foundation for route choice, which could be with perceived random error [115], and the non- coopera-
estimated by link travel time or trip time from AFC data. Gen- tive behavior with a number of exogenous priority loading
erally, the link travel time and trip time share a closed-from rules [116].
time distribution based on the additive property [102]. Reliabil- Modelling travel time is fundamental for the route choice
ity indicator calculation [103], Guassian Mixture [104], [105] model. Before performing passenger assignment, a path set is
Markov Chain [76], and Bayesian inference [106] are the usually generated for each passenger. The number of passenger
most popular method to estimate the travel time distribution paths and link uncertainty matter for the efficiency of the
and trip time. The key for route choice model is to find assignment algorithm. Passengers generally pay more attention
the probability for different paths. Logit model is the basic to the travel time reliability (service reliability) rather than
one [107]–[109]. Considering the passenger preference and the total travel time [102], [117], especially for public transit
travel strategy, the Nested and Mixed logit model is also and other transit modes [110], [118]. The algorithms will
applied in research [110]. Kim et al. performed regression become much more complex when mapping the timetable to
analysis on route stickiness [111] and Nassir et al. developed the physical map. An onboard survey showed that some of the
a statistical inference to deduce the set of attractive routes passengers always selected the same route (high stickiness)
for public transit passengers [112]. More detailed informa- compared with those with more varied patterns of route
tion is discussed as below. Generally, researchers assume selection (low stickiness). The number of the feasible paths
that passengers made their decisions by a maximum utility will decrease with the stickiness index [111]. Another major
application of the route choice is in the cases of disruption The ticket scheme and fare optimization will be discussed in
occurrence. When accidents or disruptions happen in the next policy part in detail.
system, the operation agencies need to adjust the train schedule Due to the model complexity and some concave con-
and provide a quick response service for passengers to dispatch strains, Genetic Algorithm (GA) [123]–[125] is the most
the stranded passengers [119]. popular algorithm for operation optimization. To acceler-
ate the solution searching time, some updated algorithms
B. Operation Optimization are further applied, such as, Branch-and-bound GA [126],
Operation optimization is one of the most important top- Simulation-based GA [127] and non-dominated sorting
ics in transit data application (see Table VIII). Operation genetic (NSGA-II) based algorithm [128]. In addition, other
optimization contains operation plan, schedule and the fare heuristic algorithms, such as, hybrid artificial bee colony
scheme. In this part, we mainly focus on the plan and schedule. algorithm [129], Tabu search [130], are also considered.
Besides those heuristic approaches, Lagrangian relaxation more flexible stopping plan with the elastic demands, such as
is also widely used in timetable optimization [131]–[133]. skip-stop schedule [129], [130].
Different simulation methods, such as discrete events/ Rescheduling and dispatching addressing external disrup-
ordered time simulation [134] and Time-driven passenger tions (such as, traffic congestion, road accidents) and inter-
microscopic simulation [135], are also adopted to evaluate nal system delays (such as, dynamic vehicle running time,
the timetable. More detailed information is discussed as stochastic passenger demand) in the transit system are much
below. more difficult than determining the daily dynamic schedule.
During daily operations, transit vehicle follows the timetable Based on the real-time AVL and APC data, a number of
which has been set before the operation. Despite the high studies have focused on the real-time vehicle bunching and
fixed headway and dwell time in the peak hour, the service control to reduce headway deviation and passenger waiting
may still not meet the high passenger demand. Ignoring such time. Adaptive control schemes are proposed by dynam-
demand dynamics may result in minor disruptions and poor ically determines bus holding times at a route’s control
service reliability. Therefore, it is necessary to understand points [136] and adjusting a bus cruising speed [137] based
the spatial-temporal travel pattern of the passengers and to on real-time headway information. In addition, two holding
design demand sensitive timetables to meet the demand uncer- methods were investigated, including threshold-based control
tainty [53]–[55], [126]. The objectives of the research on logic and real-time control based on preceding and following
demand-driven timetables are to minimize the total passenger headways, to determine the locations and numbers of control
waiting time or operation costs. Some researchers applied a points and optimal control strength [138]. A combination of
different real-time data-driven prediction approaches was also the population around the corridors or the desired land value
proposed to optimize holding control strategies and station increases have resulted in significant population displacement.
skipping [139]–[142]. Other approaches, such as, dynamic Kim et al. [156] investigated the relationship between transit
bus propagation model [142], boarding person limit [143], are investment and urban land use change in a parcel-level land use
also proposed to improve the system efficiency. Recently, the in Southern California by developing a multinomial logistic
performance of different holding methods was compared in regression model, which shows that vacant parcels within
terms of headway instability and mean holding time with and the vicinity of new transit stations are more likely to be
without real-time predictions [144], and real-time transfer syn- developed not only for residential but also for other urban
chronization was considered to reconcile single-line regularity purposes. Hu et al. [157] proposed three machine-learning
and inter-line arrivals [145]. Most of performance evaluations models to quantify the interdependencies between land use and
are conducted in different simulation-based environments. public transport ridership and further provided the guidance
In addition, some adjusted Genetic algorithms were proposed of development of city regional center and amenity resource
to address the train delay in metro systems [125], [146], and allocations in Singapore.
some researchers also conducted tests on real-time automatic The smart cards had been used for collecting the fare and
rescheduling strategies to make the dynamic train timetable improving the profit of the operators. Some operators changed
more flexible [134], [147]. the fare scheme to make more profit. In 2014, Beijing metro
In addition, incorporating travel behaviors in timetable substituted a flat-fare policy for a distance-based fare policy.
design is still a challenging issue. The travelers’ response Mining the passenger response to the fare change is significant
to the adjusted transit schedules should be considered in a for the next round of fare adjustment. Lots of work have
network-level transit system. Particular studies conducted by been working on the transit fare estimation based on the fare
Liu and Zhou is firstly to propose an agent-based modeling elasticity. Generally, the elasticity is taken as −0.3 regardless
framework to consider the optimal design in transit schedule of case-by-case difference [158]. The variety of passengers’
network with boundedly rational travelers in the operational responses to fare change exists at a station level and three fare
planning [125]. increase alternatives (high, medium, and low) were evaluated
Another topic of interest is the study of timetable in terms of their impacts on ridership and revenue [10]. For
evaluation. The evaluation could be time-driven microscopic each alternative, the majority of the total trips with a length
simulation method [148] and demand uncertainty for robust- of around 15 km are the most sensitive to fare increases.
ness [149], [150]. Other works are related to the special Meanwhile, travel responses are influenced by many factors,
cases in timetable optimization, such as, the cyclic railway not only price but also age, gender, income, day of week, time
timetabling [124] and the schedule for loop line [123]. of day and trip purpose [159]–[161]. The new ticket scheme
should consider serval affects and provide more flexible fare
C. Policy Applications
scheme for different purpose, such as the accumulate discount
In this section, we collect various works related to land use, for commuters, early bird discount for elder passenger and
financial applications and data privacy with transit big data. daily pass for visitors.
The public transit services could affect the long-term land The advances in technology make it possible to share
use pattern, such as, how people select their home and the detailed individual travel information. This informa-
business locations; on the other hand, land use patterns tion provides great benefits, including improved services
further influence the demand for transportation related to for customers and increased revenues and decreased costs
travel distances. By recognizing that the public transportation/ for businesses. However, it has also raised important issues
land-use relationship is extremely complex, Polzin [151] such as the misuse of personal information and loss of
conducted deep analysis to better understanding of this rela- privacy [162]. Chen, Fung and Desai proposed an effi-
tionship in terms of accessibility improvements, complemen- cient data-dependent differentially private transit data sani-
tary policies, and momentum and promotion. Johnson [152] tization approach based on a hybrid-granularity prefix tree
pointed out that there are three main approaches to enhance structure [163].
transit ridership by land use planning near transit corridors, As a short summary, there are a few common research
including increasing residential density near transit corri- topics between bus transit and rail transit, such as, net-
dors, mix-use for land, and retail development. In addition, work accessibility, route choice, activity purpose analysis, and
Ratner and Goetz [153] focused on how the transit-oriented timetable optimization and rescheduling. However, the big
development (TOD) reshapes the land use and urban form data applications still vary from each mode due to their
throughout the entire Denver region and finds that the transit specific characteristics. In Metro systems, the one-pay tick-
stations attract different type of land use and development ets with origin and destination in smart cards makes more
based on their urban locations. For the sustainability of mass research focus on the temporal and spatial demand pattern
rail transit (MRT), Li et al. [154] presents a TOD planning analysis, and the seamless transfer leads to more studies
model to find the optimal schemes of land-use type and land- on passenger transfer estimation and First/Last train con-
use density in planned region around Shenzhen Metro line 3 in nection. On the other hand, in the urban bus systems,
China. After reviewing a number of papers related to BRT the unrecorded trip destination results in more research
development and land use, Stokenberga [155] concluded that on trip origin-destination demand estimation and trip chain
it is still not clear whether BRT has improved accessibility for analysis.
V. F UTURE R ESEARCH D IRECTIONS will be more interaction with land use [172], [173] and census
The transit big data application has evolved rapidly over the data to better reveal the passenger behavior for long-term and
past three decades, fueled by the diverse technologies ranging short-term public policy decisions and city planning. At the
from the traditional smart card technique to smartphone data, same time, the number of data samples required to analyze
while providing rich data resources for research. Characterized passenger behaviors is also another tough research question.
by inherent applications and detailed passenger travel patterns, • New analysis methodology: More and more research focus
it has nevertheless spawned a vast body of literature that on the agent-based behavior and disaggregated models and
encompasses a broad gamut of research directions. While the approaches due to the availability of individual travel informa-
previous research has led to rapid studies in the understanding tion. Some new analysis methods from computer science and
of passenger behavior and operation adjustment, there are still statistics have been applied to transmit data analysis, such as
a number of challenges and trends worthy to be conducted in machine learning and data fusion clustering [174]. There will
the future. be more new analysis methods, especially for variation and
• Accurate transit demand acquisition with induced demand: clustering studies along with the correlation with other majors
One of the public transit target markets is the commuter, but and fields, such as data mining and artificial intelligence.
some passengers may choose to use car or shared mobility for • Shared Mobility: Transit system only provide stop-level
their commute due to the possible low-quality transit services. service in the city. The transportation service from trip ori-
In this case, transit data sets only record the passengers gin to transit stop is also necessary from the perspective
who take transit, which only accounts for part of the total of multi-modal transportation systems. The First Mile Last
transit demand. Once the multi transportation datasets are Mile (FMLM) challenge garners significant attention as a
merged, it is possible to obtain those potential and attractive means to assess the accessibility of the first leg to public transit
passengers from other transportation modes by analyzing the and the last leg from transit [175]. In recent years, shared
spatial and temporal distribution within the completed trip mobility, including car sharing, personal vehicle sharing (peer-
chain individually. Meanwhile, the smartphone communication to-peer car sharing and fractional ownership), bike sharing and
datasets also provide access to obtain the individual passenger customized bus, has proliferated in global cities not only as
who works at weekends or works very late. The potential an innovative transportation mode enhancing urban mobility
transit demand can be obtained by considering the social but also as a potential solution to address first- and last-mile
characteristics of passengers such as the income and work connectivity with public transit [176]. The research challenges
place. In addition, it is worthy to study the interaction of travel and hot topics of shred mobility are as follows.
demand and transit network structure as a closed loop [164]. 1) Bike sharing: In recent years, bicycle-sharing programs,
• Data visualization: Compared with the traditional sta- such as CityCycle [177] and NiceRide, have received increas-
tistical analysis, data visualization provides a more direct ing attention with initiatives to increase bike usage, better meet
view about the passenger travel pattern and passenger travel the demand of a more mobile demand, and lessen the envi-
clustering [165], [148]. The long-term cumulative multi-source ronmental impacts of our transportation activities [178]–[180].
data make it possible to obtain the variance for the passenger China is taking a leading role in both public bike share and
and vehicles during the week or within one year [166], [167]. private electric bike (e-bike) growth. Based on the bike trajec-
Visualizations clarify the spatial and temporal changes for the tories and the impacts of bike share demand such as distance,
passenger flow and loading factor in the network under dif- temperature and user heterogeneities, Campbell et al. [181]
ferent conditions (e.g a regular day versus a national holiday, analyzed the viability of deploying large-scale shared e-bike
sunny day versus a rainy day). In this term, as stated above, systems in China. While the shared bikes provide better service
the data visualization tool with GIS and GPS will become a hot for accessibility to transit system, it caused some serious
topic in the next few years. Real-time information and resched- social problems such as the unregulated bike parking and
ule: The AFC and AVL data can be packaged and sent to the road occupancy of broken bikes. It is possible to collaborate
main server every 15 min to satisfy the requirements of data with the available transit data to optimize the parking spot to
analysis and operation optimization. In addition, when a signal normalize user parking behavior and forecast the proper bike
failure or accident happened in the transit system, it becomes demand to avoid capacity waste.
feasible to inform the passenger and obtain the real-time 2) Customized transit service: Feeder bus or Customized
station and network operation conditions to reschedule the transit service provides the personalized and flexibility that
timetable [168], [169]. Meanwhile, the real-time operation travelers need to access or egress from a bus or rail ’trunk
information will help passengers choose a better pre-trip or line’, especially for the commuter; whereas public transit
en-route travel strategy. is often constrained by fixed routes, driver availability, and
• Cross-validation and data sample size: Most transit data, vehicle scheduling [132], [182], [183]. Big data in transit
containing a lot of detailed passenger travel information, systems are providing the available travel information such
are not designed to obtain all the information about pas- as demand, activity, passenger category and travel pattern.
sengers such as socioeconomic information and trip pur- These characteristics offer better pictures for customer per-
pose. To better understand the passenger behavior, some sona, which helps researchers optimize the bus stop locations,
researchers have already merged household survey data with routes, timetables, and passenger-to-vehicle assignments to
transit data to detect passengers’ social group and travel provide better feeder transit. Tong et al. developed a joint
pattern [68], [170], [171]. Despite a household survey, there optimization model, providing flexible public transportation
