Geo-Friends Recommendation in GPS-Based Cyber-Physical Social Network
Geo-Friends Recommendation in GPS-Based Cyber-Physical Social Network
Geo-Friends Recommendation in GPS-Based Cyber-Physical Social Network
Abstract—The popularization of GPS-enabled mobile devices football game, or book club. Geo-friends have a much higher
provides social network researchers a taste of cyber-physical probability to participate these real life events than other
social network in advance. Traditional link prediction methods friends from virtual social network.
are designed to find friends solely relying on social network
information. With location and trajectory data available, we can Following example demonstrate the idea of recommending
generate more accurate and geographically related results, and geo-friends in web-based social network.
help web-based social service users find more friends in the real Example 1: Alex wants to find some new geo-friends to
world. Aiming to recommend geographically related friends in join him in a local charity event. There are three candidates:
social network, a three-step statistical recommendation approach Bob who shares a large number of friends with Alex, but lives
is proposed for GPS-enabled cyber-physical social network. By
combining GPS information and social network structures, we in another country; Carlos who works in the same company
build a pattern-based heterogeneous information network. Links with Alex, but shares no similarity in terms of social network
inside this network reflect both people’s geographical infor- structure; David who shares couple of common friends, and
mation, and their social relationships. Our approach estimates also go to the same gym, same comic book store as Alex does
link relevance and finds promising geo-friends by employing every week.
a random walk process on the heterogeneous information net-
work. Empirical studies from both synthetic datasets and real-
After analyzing both social structure and recently collected
life dataset demonstrate the power of merging GPS data and GPS data, social network services should recommend David
social graph structure, and suggest our method outperforms as Alex’s geo-friends, since he has a higher probability to
other methods for friends recommendation in GPS-based cyber- participate the local event with Alex.
physical social network. Previous approaches of link prediction which usually only
rely on social network structure would recommend Bob. But
I. I NTRODUCTION
apparently, Bob is not a good candidate for Alex’s social
Popular social network services like Facebook and Twitter events, since he lives in another country. Also, solely relying
allow users to store and share both digital information, e.g., on location or trajectory information for geo-friend finding
web content, and also locations and trajectories collected from does not work as well. Carlos who has a very high positive
the real world. Analyzing these extra data from physical world geographical correlation with Alex shares no social interests
can help us better understand people’s daily activities, social with Alex. Recommending Carlos to Alex is pointless as well.
areas and life patterns. Social network with data collected from In this paper, we propose a a three-step approach, named
sensors is usually refered as Cyber-Physical Social Network GEo-Friends Recommendation framework (a.k.a., GEFR).
[3]. In this paper, we study friend recommendation problem First, interesting and discriminative GPS patterns are extracted
in cyber-physical social network. With location and trajectory from a large amount of raw GPS data. Then we combines
information available, we improve the accuracy of the results both geo-information and social network in a pattern-based
and make on-line social services much closer to users’ real heterogeneous information network. By applying random walk
life. to reproduce friends making process on the network, we can
One major difference between virtual web-based social net- effectively identify potential geo-friends for a specific user.
work and real life social network is, new friends in real world The contributions of this paper are summarized as follows:
tend to be geographically related. Geographical similarity is • Propose geo-friends recommendation problem, and dis-
hiding in users’ recently GPS data. To help web-based social cuss the differences from previously studied link predic-
network users find more friends in their real life, we define tion problem.
potential real life friends, who have both social similarities • Define and generate a set of GPS patterns to describe
and geographical correlation as Geo-Friends, and denote Geo- people’s real life social interaction and correlation.
Friends Finding Problem as real life friends discovery on web- • Propose a random walk-based statistical framework for
based social network. geo-friend recommendation (GEFR).
The reason why we want to isolate geo-friends from general • Design and conduct a series of experiments on both
web-based social network friends is intuitive. Geo-friends play synthetic and real-world datasets. Demonstrate the power
an important role in off-line social events, e.g., holiday party, of GEFR in various situations.
II. BACKGROUNDS AND P RELIMINARIES A. GPS Pattern Extraction
We briefly introduce the related data model, and define geo- Most GPS applications use raw GPS data directly, e.g.,
friends finding problem in context. storage or visualization. However, raw GPS data are in huge
size and hard to provide people with a semantic understanding
A. Data Model of human behavior, In order to better understand the hidden
GPS data are continuous in both spacial and temporal information, we first present four different heuristics on geo-
dimensions. However, different devices have different data graphically correlation based on empirical observations.
sample rate, which leads to shifting time intervals. Without Common Location: GPS locations can reflect people’s in-
losing generality, we assume constant sample rate within one terests, and people tend to go to their interests related locations
application. more often. If two people share common locations, which
Definition 1 (GPS Trajectory): A GPS trajectory can be suggests they might share common interests, the probability
generated by sequentially connecting GPS records of a specific that they become friends would be higher.
user following the ascending order of timestamps. We denote Common Routine: GPS trajectory segments indicate peo-
(1) (2) (n)
GPS trajectory for a person j as Sj = hsj , sj , ..., sj i, ple’s habits and routines. People who share similar routines,
(i)
where sj is a GPS location record. tend to become friends.
Definition 2 (GPS-CPN): A GPS-Based Cyber-Physical Meeting: If two people share same locations at the same
Social Network can be defined as G(S, V, E), similarly to timestamps in their GPS trajectory, they should be geograph-
social network, V is the set of vertices, represents all the ically related.
people in the network. E is the set of edges, represents all Hanging Out: Two people share same routine in a specific
the links between people. S is the set of GPS trajectories, and time period, which indicates they are hanging out in that
each trajectory in S is associated with a specific person in V , time period. If two people hang out, the probability of they
represents this person’s movements. becoming geo-friends would be higher.
Based on above empirical observations and heuristics, we
propose four different GPS patterns to capture these infor-
mation. We first convert raw GPS trajectory dataset S to
User
Layer
User
Layer
categorical dataset Scat , and sequential dataset Sseq 1 . In Scat ,
we simply discard temporal information and keep discretized
locations in a unordered manner. While in Sseq , locations are
Pattern A Pattern C still sequentially connected by the order of timestamps. With
GPS GPS
Data Patterns
Pattern B
categorical dataset Scat , sequential dataset Sseq and original
(a) Cyber Physical Social Network (b) Pattern-Based Heterogeneous In- GPS dataset S, we define GPS patterns as follow.
formation Network Definition 3 (FL-Pattern): Closed frequent patterns with
Fig. 1. GPS Networks support ≥ 2 in Scat is defined as Frequent Location Patterns
(a.k.a., FL-Patterns), following Common Location heuristic.
Notice that, people who carry GPS devices is a subset of Definition 4 (FT-Pattern): Closed sequential pattern with
vertices V in this network, so |S| ≤ |V |. An example of GPS- support ≥ 2 and length ≥ 2 in Sseq is defined as Frequent
CPN can be found in Figure 1(a). Trajectory Pattern (a.k.a., FT-Pattern), following Common
Routine heuristic.
B. Problem Definition Definition 5 (FLT-Pattern): For each FL-Pattern, if loca-
Intuitively, geo-friends recommendation is trying to find tions share the same timestamp in all corresponding GPS
potential real life friends on web-based social network. With trajectories, and no super-pattern with the same support can
data model defined above, we formally define this problem as: be generated by adding another time constrained location,
Given a GPS-CPN G(S, V, E), and a specific query person v∗, this pattern can be defined as Frequent Location with Time
one method should return a ranked list of people nodes in V Constraint Patterns (a.k.a., FLT-Patterns), following Meeting
and also for each element v ′ in the list, hv ′ , v∗i ∈
/ E. What’s heuristic.
more, the ranking score in the process should consider both Definition 6 (FTT-Pattern): Similarly to FLT-Pattern, Fre-
GPS trajectory S and social network (V, E). quent Trajectory with Time Constraint Pattern (a.k.a., FTT-
Pattern) can be defined as closed sequential pattern with
III. G EO F RIENDS F INDING F RAMEWORK support ≥ 2 and length ≥ 2 in Sseq and it shares the
This section describes the three-step geo-friends recommen- same time period in corresponding GPS trajectories, following
dation framework (GEFR) in a given cyber-physical social Hanging Out heuristic.
network, including GPS pattern extraction, pattern-based het-
1 GPS data discretization process are related to GPS error analysis and actual
erogeneous information network building and random walk
GPS coordinates. GPS datasets in the experiments of this paper have already
on the network. Details of the three-step approach will be been discretized and preprocessed. For detailed methods, please refer to [4]
presented in the following subsections. and [5].
To mine FL-Patterns, we apply FP-Growth [6] on Scat and can be found in Equation 1.
generate closed frequent patterns with support ≥ 2. After j=i k=L
FL-Pattern generation, there are two methods to generate FT-
X X
argmaxi = − pj log(pj ) − pk log(pk ) (1)
Patterns from Sseq . 1) Directly apply PrefixSpan [10] on Sseq , j=0 k=i+1
and extract all closed sequential frequent patterns, or 2) first
where pj = nj / a=i
P Pb=i
calculate the permutation set of each FL-Pattern, generate a=0 na , pk = nk / b=0 nb and n is the
combination set for each permutation, and then simply check pattern frequency.
each combination against Sseq , to make sure ascending times- We apply this measure to MIT Reality Mining dataset [2].
tamp order still hold. By collecting combinations in ascending As presented in Figure 2, after calculation of entropy-based
timestamp order, FT-Pattern set could be generated as well. thresholds, we first select patterns in support histogram with
Method 2 could be more efficient if FL-Pattern set is small, threshold 13 and 18, and then filter out patterns with length
and we used Method 2 in our experiments. lower than 4. This pattern selection strategy provides similar
FLT-Patterns can be generated based on FL-Patterns. We results as our manual parameter tuning experiments.
first calculate the combination set for each FL-Pattern, and
5
10 350
check the same timestamp constraint in each combination
300
in the raw GPS dataset. By collecting combinations holding 4
10
250
Num of Patterns
Num of Patterns
same timestamp constraint, FLT-Pattern set can be generated 3
10 200
10 0
constraint, so it is necessary to calculate the combination 0 10 20
Support
30 40 0 2 4
Length
6 8 10
set before check the time constraint. FTT-Patterns can be (a) Support Count Histogram (b) Length Count Histogram
generated from FT-Patterns in a similar way.
Fig. 2. GPS Pattern Selection using Entropy Measure
B. Building Pattern-Based Information Network After the construction of the heterogeneous information
network, edge weights between nodes need to be defined
By combining GPS patterns generated in last subsection, before random walk process. By adding filtered pattern nodes
and the given social network, we can build a pattern-based into social network, edge set E ′ now contains three types of
heterogeneous information network H(P, V, E ′ ) as follows. edges, including, edges from GPS pattern node p to person
Given GPS-CPS G(S, V, E), we first discard raw GPS trajec- node v, edges from person node v to pattern node p, and also
tory set S. For each GPS pattern, we create an additional node edges from person node v to person node v ′ . No edges are
p, and link corresponding person node v with p if this GPS defined between GPS patterns. We first define edge weights
pattern exists in person v’s GPS trajectory history. And then from different types of GPS pattern nodes to person nodes.
create a new edge hv, pi, and add it to E ′ . Notice that, edge We use w( v, p) to represent raw edge weights, which will be
set E ′ in heterogeneous information network contains three normalized before random walk process.
types of edges, which are edges between people, edges from
person nodes to pattern nodes, as well as edges from pattern log (1 + length(p))
wF L (v, p) = ·α (2)
nodes to person nodes. |N bp (v)|
An example of GPS pattern-based heterogeneous informa-
tion network is presented in Figure 1(b). log (1 + length(p))
wF T (v, p) = ·β (3)
One needs to notice that, adding a large number of GPS |N bp (v)|
patterns without selection, can decrease the performance of log (1 + length(p))
our method badly. This can be explained from different per- wF LT (v, p) = ·γ (4)
|N bp (v)|
spectives. 1) High frequency patterns usually indicate common
public locations people can all related to, e.g., a bus station,
log (1 + length(p) · timespan(p))
or a hospital. These locations do not carry any discriminative wF T T (v, p) = ·θ (5)
GPS information, and they do not follow any heuristics we |N bp (v)|
mentioned above. 2) The number of low length and support where N bp (v) denotes the set of pattern nodes connecting to
patterns are gigantic. Adding such patterns into the network person node v, length(p) denotes the length of pattern p, and
will lead to a substantial increase of link search space for timespan(p) denotes time span of a time constraint pattern
random walk process, which will reduce both the efficiency p, in terms of number of timestamps. Parameter α, β, γ and
and precision. θ controls pattern importance, and setting will be discussed in
Instead of manually refine patterns, we employ an entropy- next section.
based thresholding measure similar to [7] to filter out pattern We define edge weights starting from pattern nodes to
with high or low support and length. Threshold calculation person nodes as follows.
Algorithm 1: Geo Friends Recommendation Framework
1
w(p, v) = (6) Input: A GPS-based cyber-physical social network
|N bv (p)|
G(S, V, E), where S is the GPS trajectory
where N bv (p) denotes the set of person nodes connecting to dataset, V is person node set, and E is edge set.
pattern node p. A recommending target person v∗. Number of
The third type of edge are from person nodes to person recommended friends K. Parameters α, β, γ and
nodes, we define the edge weight from person node v to its θ for edge weights definitions, and λ for random
neighbor v ′ as: walk process.
1 Output: A ranked list of recommended geo-friends
w(v, v ′ ) = (7)
|N bv (v)| 1, Generate FL-Patterns, FT-Patterns, FLT-Patterns and
FTT-Patterns based on Definitions 4 to 7;
where N bv (v) denotes the set of person nodes connected to
person node v. 2, Filter out biased patterns and shrink link search space
From above weight definitions for edges from person nodes by using Equation 1;
to pattern nodes, one can notice that, edges weights from 3, Construct heterogeneous information network
person nodes to pattern nodes are various based on pattern H(P, N, E ′ ) by adding pattern nodes into GPS-based
types, and pattern attributes, including length and timespan. cyber-physical social network, link pattern nodes with
Parameters α, β, γ and θ are designed to adjust the im- corresponding person nodes, and define edge weiths
portance of different types of pattern in different scenarios. following Equations from 2 to 7;
Edge weights from pattern nodes to person nodes, and from
4, Generate transition probability matrix and normalize
person nodes to person nodes are solely based on number of
each column following Equation 8;
neighbors. If number of people neighbor is smaller, the edge
(t)
weight will be larger. 5, Iteratively update RN using Equations 10 or 11
regarding input query node v∗ until it converges;
C. Random Walk on Heterogeneous Information Network
6, Rank link relevance based on the results from last
Random walk process on graph has been widely used in step, and return the top-k nodes as the final results;
social network analysis [15] [11] [17]. To apply random walk
on GPS pattern-based information network, we first need to
define a transition probability matrix to describe all transition
probability on the edge set of H. To represent all possible
transitions on H, the size of the matrix should be (|V | + P r(V ) P r(A)
|P |) · (|V | + |P |). We sort all person nodes based on their ID P r(H) = (9)
in GPS trajectory dataset, and then append a sorted pattern P r(B) 0
nodes following the order they were extracted. By combining
where P r(V ) is an |V | × |V | matrix representing the tran-
equations 2 to 7, we can generate a transition weight matrix
∗ sition probability between person nodes to person nodes, as
P r(H) .
defined in Equation 7. P r(V ) is not a symmetric matrix, since
One thing we need to notice is that, by adding additional
transition probability between two people nodes are defined
pattern attributes into edge weights definitions, including
based on number of neighbors of the starting node. P r(A) is
length and timespan, patterns are more semantic but the sum
a |P | × |V | matrix representing the transition probability from
of weights on out going edges of person nodes is no longer
GPS pattern nodes to person nodes, as defined in Equation
equals to 1, we need to normalize edge weights of pattern
∗ 6. P r(B) is a |V | × |P | matrix representing the transition
nodes in transition weight matrix P r(H) to get transition
probability from person nodes to GPS pattern nodes, as defined
probability matrix P r(H) of the heterogeneous information
in Equation 2, 3, 4, 5 and 8.
network, following Equation 8.
We choose random walk with restart process to simulate
w(v, n) geo-friends finding process, and estimate link relevance for a
pr(v, n) = P (8) specific query person v∗ in heterogeneous information network
m∈N b(v) w(v, m)
for several reasons. This statistical approach is stable and
where n is a node connected to person node v, and N bv denote robust in different datasets, and also random walk with restart
all nodes connected to person node v. No normalization is process satisfies some basic heuristics for identifying geo-
required for pattern nodes, because the weight on out going friends.
edges of pattern nodes only depend on the number of person First we present a list of GPS pattern utilizing heuristics,
node neighbors, the sum of which is already 1. We denote which random walk process can simulate:
normalized matrix as transition probability matrix P r(H) . (1) If a GPS pattern contains more geographical information
For ease of presentation, we simplify the representation of or has a stronger semantic meaning, the in-coming probability
P r(H) as from person nodes to this pattern should be higher, which
increases the probability from one person to another via this 1) Datasets: We generate 4 synthetic datasets with different
GPS pattern. sizes, attributes and distributions in order to cover differ-
(2) If two people share more GSP patterns, the overall ent scenarios and thoroughly test our framework. Synthetic
probability for one person link to another via these GPS pattern datasets are designed to cautiously simulate different types of
nodes would be higher. relationships and communities, including, friends with high
(3) If one GPS pattern is rare, the out-going probability of positive correlation in GPS histories, friends with different
this node would be larger, so that people connected to this geographical social areas, neighbors who scatter in the whole
pattern would have a higher probability to be linked together. social social network with high positive geographical cor-
Then we discuss social network structure heuristics which relation, friends within certain community, friends between
random walk with restart process can simulate: communities, as well as a 10% noisy links adding to the whole
(1) Two people who share large number of common friends social graph, etc.
tend to be connected together, since the sum of transition Each synthetic dataset contains two parts, a social network,
probability between these two person nodes is higher. and a set of GPS history for each person, with 2-month
(2) If person nodes are close to each other in the network in timespan. The GPS datasets are designed by adding different
terms of number of hops, the probability of connecting them types of GPS patterns on people’s random movements.
together is larger. We also attempt to apply our method on MIT Reality
One can notice that, random walk with restart process Mining dataset [2], which is collected from a project con-
satisfies all above heuristics; hence this method is a good fit ducted from 2004 to 2005 in MIT Media Lab. The positioning
for geo-friends recommendation in heterogeneous information method in this dataset is based on cellular tower estimation
network H(P, E, V ′ ). We denote the query person as v∗. The instead of GPS. Due to this limitation, we can not directly
random walk process can be represented as: mine accurately mine GPS patterns as we defined in previous
sections, instead, we accommodate this dataset by defining
(t)
RN = (1 − λ)P r(H) RN
(t−1) (0)
+ λP r(N ) (10) approximate FT-Pattern and FLT-Pattern. Approximate FP-
Patterns in real dataset are generated by cellular tower IDs,
where RN is a vector, that represents the link relevance from while approximate FLT-Patterns are generated based on Blue-
(t)
all the nodes in H to query person v∗, and RN represents tooth history. Blue-tooth is actually a more reliable technique
th in terms of identifying and positioning other subjects within
the link relevance of each node at the t iteration . We assign
(0) certain radius. Also, to match GPS history time span of
Pn (v∗) = 1 where v∗ is the query nodes, and all the other
elements to 0. Similar to represent matrix P r(H) , we can re- synthetic datasets, we only use two month’s GPS data (Jan
write RN = [RV RP ]T , based on Equation 10, we have: 2005 and Feb 2005) in the real dataset for each subject in the
network.
(t)
A more detailed summarization of all the datasets can be
RN found in Table I. And also we present graph visualization
(t−1) (t−1)
" # " #
(0)
P r(V ) × R(V ) + P r(A) × R(P ) RV results of synthetic dataset gpsnet120 as well as the real dataset
= (1 − λ) (t−1) +λ (0)
P r(B) × R(V ) + 0 RP in Figure 3(a) and 3(e). From these figures, one can notice that,
(11) the social network structure of the real dataset is relatively
Based on Equation 11, we can iteratively update RN until sparse compared to the synthetic dataset. Although social
it converges. By ranking link relevance scores in R(V ) , top- network parts of synthetic datasets have been disarranged by
k nodes can be generated easily. Overall framework can be adding certain amount of noisy edges, based on observation,
found in Algorithm 1. we still can roughly find different communities on the overall
network.
IV. E XPERIMENT
And also, we should mention that, to demonstrate the idea of
In this section, we test our methods against several com- the power of our framework GEFR on identifying geo-friends,
petitor approaches on both synthetic and real datasets. major links in our synthetic datasets are geographically related.
2) Competitor Methods: To demonstrate the effectiveness
A. Experiment Setup of our results, we also build and apply a set of baseline
methods on all the datasets. These baseline methods including
TABLE I popular method widely used in on-line social network services.
D ATASETS S UMMARIZATION
These methods include:
Datasets Graph Size # GPS Pattern # Communities Random: random selection. Same Edge: choose friends
(|V |, |E|) based on number of same friends. GPS Similarity: choose
gpsnet120 120, 435 108 6 friends by measuring GPS location and trajectory similarity.
gpsnet240 240, 1441 192 6
Random Walk without GPS Patterns: Recommend friends
gpsnet600 600, 3583 576 12
gpsnet1200 1200, 18556 1170 15 by applying random walk with restart on the original social
mit reality 106, 88 FT: 49, BT:106 NA network.
For real dataset, we also add another baseline method named
Bluetooth. Bluetooth technique can capture people meeting in GPS trajectory histories is not a crucial friends identification
frequency very accurately. So in this method, we recommend criterion. However, in general, friends still share similar GPS
friends by returning people who share high meeting frequency. trajectory than non-friends after all. This explains the increase
The recommended friend list by this method is sorted by the performance in larger result sets. This method also performs
meeting frequency detected by bluetooth devices. badly in real dataset, however, as we mentioned before, users’
3) Evaluation: We use 4-fold cross validation method to locations in MIT Reality dataset are not directly captured
finally evaluate performances of different methods. Since the by GPS device, instead they are estimated by cellular tower
problem we study in this paper is friend finding, the search positions, which can be very inaccurate. This could be the
space for this problem should be the overall link space. major reason why GPSSim failed in real dataset.
To apply 4-fold cross validation on our datasets, we first While SameEdge and GPSSim perform poorly in real
need to randomly partition the overall link space into 4 subsets, dataset. Bluetooth method provides acceptable results on both
with corresponding people nodes as well as GPS trajectory precision and recall measures. And also, Precision@1 of
histories. And test each of the subset with all of the methods, Bluetooh is much lower than Precision@1 of GEFR (17.6%
while using the other 3 subsets as training dataset. For each vs. 75%), while Precision@10 of Bluetooh is much closer to
test set, we sample 50 people nodes from the corresponding Precision@10 of GEFR (6.48% vs. 12.76%). This again is a
person node space, and use them as queries. To estimate the good evidence of existence of neighbor relationship.
recommending results, if the recommended link exists in the RWR (Random Walk with Restart) method finds friends by
testing set, we denote this recommendation as a hit, otherwise using Random Walk with Restart process on social network to
it would be denoted as a miss. By counting hits and misses, we estimate link relevance. By simulating friends finding process
calculate precision, recall as well as precision recall curve to on the graph, RWR gives a consistently good performance
finally estimate the overall performance of different methods on all synthetic datasets and real dataset on both precision
on both synthetic and real dataset. and recall. This indicates graph structure information plays
an important role in friends recommendation process and we
TABLE III
P ERFORMANCE ON R EAL DATASETS need more adaptive and suitable method like RWR to capture
enough information.
Methods P@1 P@5 P@10 R@10 R@20 Compared with the baseline methods, our method (GEFR)
Random 0.0132 0.0168 0.0168 0.0949 0.1760 gives a significantly better performance on all synthetic
SameEdge 0.0827 0.0659 0.0406 0.2513 0.3034 datasets on both precision and recall measures. These re-
GPSSim 0.0985 0.0394 0.0288 0.2266 0.3542
sults are much better than any method solely using only
BLT 0.1760 0.0880 0.0648 0.4520 0.6080
RWR 0.7419 0.1984 0.1056 0.7688 0.7809 GPS information (GPSSim) or graph structure (SameEdge or
GEFR 0.7500 0.2016 0.1276 0.8011 0.8696 RWR). This indicates, by combining GPS information and
graph topological information and analyze in a heterogeneous
information work fashion, GEFR can successfully extract
B. Results discriminative patterns and useful information hidden in dif-
Precision and recall results of different methods can be ferent data sources, and increase performance by mutually
found in Table II and III. SameEdge method performs rel- reinforce of different information. One might challenge that,
atively consistent on precision measures in all synthetic Precisions at different positions of GEFR is somehow close to
datasets, which suggests, by simply counting the same num- corresponding measures of RWR. We hereby studied the pre-
ber of edges, SameEdge can capture partial link relevance cision and recall curve between GEFR and RWR on synthetic
information in graph structure. However, recall of SameEdge dataset gpsnet120. As presented in Figure 3(d), on average,
decreases with the graph size increasing (72.54% in gpsnet120 at the same recall level, GEFR surpasses RWR about 20%
vs. 38.99% in gpsnet1200). This suggests, SameEdge method on precision. Specially, when recall is 90%, the precision of
works better in smaller and relatively sparse social network. GEFR is around 60% while the precision of RWR is only
When the social network grows larger, more sophisticated 25%. This result provides good evidence when comparing
methods are needed to help users find friends. However, performances of these two methods on synthetic dataset.
SameEdge performs poorly on the real dataset (Precision@1 However, the precision and recall measures of these two
is 8.27% and Recall@20 is 30.34%). Because the average methods on real dataset are very close. Similarly, we studied
number of edge in the real dataset is much lower than the the precision and recall curve of these two methods, the differ-
synthetic datasets, graph structure is not able to carry enough ences of which are not as significant as in synthetic datasets,
link relevance information, which leads to a disappointing but GEFR is also better than RWR. One possible explanation
performance in this dataset. is, as we mentioned before, locations in this datasets are
GPSSim method in synthetic datasets gives bad performance represented by cellular tower IDs instead of actually GPS
if user only requires a small number of new friends, however, positions, so we can only mine approximate GPS patterns
performance increases on both precision and recall if user based on tower IDs and bluetooth data. Although we try to
queries more results. This observation reveals the common ex- filter and select good patterns after mining, the overall quality
istence of neighbor relationship. Having a positive correlation of GPS patterns we are using in the information network
TABLE II
P ERFORMANCE ON SYNTHETIC DATASETS
Precision
Precision
Recall
0.6
0.5
0.4 0.4
0.4
0.2
0 0 0.1
P@1 P@3 P@5 P@10 R@3 R@5 R@10 R@20 0.5 0.6 0.7 0.8 0.9 1
Position Position Recall
(a) gpsnet120 dataset social network (b) gpsnet120 dataset precision (c) gpsnet120 dataset recall (d) gpsnet120 dataset precision-recall
curve
0.8 1
Random 1
GEFR
0.7 SameEdge
GPSSim RWR
0.8 0.8
0.6 RWR
GEFR
0.5 BlueTooth
Precision
0.6
Precision
0.6
Recall
0.4
0.2
0.2 0.2
0.1
0 0 0
P@1 P@3 P@5 P@10 R@3 R@5 R@10 R@20 0.7 0.75 0.8 0.85 0.9 0.95
Position Position Recall
(e) mit reality dataset social network (f) mit dataset precision (g) mit dataset recall (h) mit dataset precision-recall curve
estimate link relevance and correlation by applying probabilis- to reproduce and distribute reprints for Government purposes
tic inference [8]. Inference based models are usually hard to notwithstanding any copyright notation here on.
interpret while training process is usually very time consuming
R EFERENCES
and not scalable. All these methods are designed to find co-
authorship or online friends, while none of them can detect [1] X. Cao, G. Cong, C. S. Jensen. Mining Significant Semantic Locations
From GPS Data In PVLDB, 2010.
geo-friends with compatibility of GPS data analysis. [2] N. Eagle, A. Pentland, and D. Lazer. Inferring Social Network Structure
using Mobile Phone Data In Proceedings of the National Academy of
VI. C ONCLUSIONS Sciences, 106(36), pp. 15274-15278, 2009.
[3] R.K. Ganti, Y. Tsai, and T.F. Abdelzaher. SenseWorld: Towards Cyber-
We propose the problem of identifying geographically re- Physical Social Networks. In International Conference on Information
lated friends, and also a three-step statistical framework which Processing in Sensor Networks, pp.563-564, 22-24 April 2008.
combines geo-information with social analysis. [4] GPS Error Sources: http://www.kowoma.de/en/gps/errors.htm
[5] GPS Distance: http://en.wikipedia.org/wiki/Great-circle distance
We first capture different types of GPS information by [6] J. Han, J. Pei, Y. Yin, R. Mao. Mining frequent patterns without
defining and generating four types of GPS patterns from GPS candidate generation: A frequent-pattern tree approach. In Data mining
history data. Then, we build a pattern based heterogeneous and knowledge discovery, 2004.
[7] J.N. Kapur, P.K. Sahoo and A.K.C. Wong. A new method for gray-level
information network, and defined transition probability matrix picture thresholding using the entropy of the histogram In Computer
following GPS pattern definitions. By applying random walk Vision, Graphics, and Image Processing, March 1985.
process on this information network, link relevance between [8] H. Kashimaoki and N. Abe. A Parameterized Probabilistic Model of
Network Evolution for Supervised Link Prediction In Proc. of ICDM,
different nodes could be estimated, and potential geo-friends 2006.
would be recommended to a specific query person. [9] D. Liben-Nowell and J. Kleinberg. The link prediction problem for
Interesting future work includes, domain-oriented GPS pat- social networks. In Proc. of CIKM, 2003.
[10] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, M. Hsu.
tern definitions, friends recommendation based on query per- PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected
son and specific interests, and also, real time friends recom- Pattern Growth In 17th International Conference on Data Engineering,
mendation by tracking user’s GPS usage on the fly. 2001.
[11] P. Pons and M. Latapy. Computing communities in large networks
using random walks. In Journal of Graph Algorithms and Applications,
ACKNOWLEDGEMENT 10(2):191–218, 2006.
The work was supported in part by U.S. National Sci- [12] L. Sha, S. Gopalakrishnan, X. Liu and Q. Wang. Cyber-Physical
Systems: A New Frontier. In Machine Learning in Cyber Trust, 2009.
ence Foundation grants CNS-0931975, IIS-0905215, the U.S. [13] Microsoft SensorMap: http://tinyurl.com/6nqem2
Army Research Laboratory under Cooperative Agreement [14] L. Tang, X. Yu, S. Kim, J. Han, C. Hung and W Peng. Tru-Alarm:
No. W911NF-09-2-0053 (NS-CTA), U.S. Air Force Office Trustworthiness Analysis of Sensor Networks in Cyber-Physical System.
In Proc. ICDM, 2010.
of Scientific Research MURI award FA9550-08-1-0265, and [15] Z. Yin, M. Gupta, T. Weninger, and J. Han. A Unified Framework for
Boeing company. The views and conclusions contained in Link Recommendation Using Random Walks. In Proc. ASONAM, 2010.
this document are those of the authors and should not be [16] Y. Zheng, L. Zhang, X. Xie, and W.-Y. Ma. Mining interesting locations
and travel sequences from GPS trajectories. In Proc. WWW, 2009.
interpreted as representing the official policies, either ex- [17] Y. Zhou, H. Cheng and J. Yu. Graph clustering based on struc-
pressed or implied, of the Army Research Laboratory or tural/attribute similarities In Proc. VLDB Endow., 2009.
the U.S. Government. The U.S. Government is authorized