Lip: A Lifetime and Popularity Based Ranking Approach To Filter Out Fake Files in P2P File Sharing Systems

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

LIP: A LIfetime and Popularity Based Ranking Approach

to Filter out Fake Files in P2P File Sharing Systems∗

Qinyuan Feng, Yafei Dai


(Department of Computer Science and Technology, Peking University, Beijing, China)
{fqy, dyf}@net.pku.edu.cn

Abstract We design two detectors to identify fake files by mining


historical logs and find that fake files are indeed pervasive.
P2P file sharing systems often use incentive policies to To make things worse, some users even publish fake files
encourage sharing. With the decrease of free riders, the before a real file comes out.
amount of cheating behaviors has increased. Some users Two kinds of reputation mechanisms have been pro-
rename a common file with a popular name to attract the posed to resolve this problem. One evaluates a user’s rep-
downloads of other users in order to gain unfair advan- utation from his behaviors [2], but a user’s reputation is
tages from incentive policies. We call the renamed file a sometimes different from a file’s reputation; whereas the
fake file. While techniques have been proposed to com- other evaluates a file’s reputation directly [3], it can reflect
bat fake files, an effective approach to filter out fake files the file’s essential attribute.
in existing systems is lacking, especially before a real file Users tend to retain a real file longer and delete a fake
comes out. In this paper, we design two detectors to iden- file more quickly. From this phenomenon, we introduce a
tify fake files by mining historical logs and find that fake metric called lifetime, which is a file’s average retention
files are indeed pervasive. We introduce a metric called time in users’ computers, to analyze the file’s quality. We
lifetime, which is a file’s average retention time in users’ show that it can be used to distinguish between real and
computers, and show that it can be used to distinguish be- fake files. Based on it, we propose a lifetime and popular-
tween real and fake files. We then propose a lifetime and ity based ranking approach to filter out fake files.
popularity based ranking approach to filter out fake files. Experiments are designed with the real and fake files
Experiments are designed with the real and fake files col- collected by the two detectors, and the results show that
lected by the two detectors, and the results show that this this approach can reduce the download volume of fake
approach can reduce the download volume of fake files files both before and after a real file comes out.
both before and after a real file comes out. The road map of this paper is as follows. In Section 2,
we cover the related works. We design two detectors and
analyze the time characteristics of files in Section 3. We
1. Introduction then propose a lifetime and popularity based ranking ap-
proach in Section 4 and design an experiment to evaluate
this approach in Section 5. We draw a brief conclusion
P2P file sharing systems often use incentive policies to
and identify some ways to expand in future research in
encourage sharing. Some incentive policies [1] are based
Section 6.
on points where peers are rewarded points for uploading
and spend points for successful downloading, and use ser-
vice differentiation to reward and punish users for their be-
2. Related Works
haviors. They give downloading preference to users with
In pollution measurement aspect, J. Liang et al. [4]
higher scores and sometimes apply a bandwidth quota to
analyzed the severe pollution which is lunched by some
the downloads of users with lower scores.
companies to protect copyrights in P2P file sharing sys-
With these incentive policies, the amount of free riders
tems; We analyze the severe pollution which is lunched by
decreases, however the amount of cheating behaviors has
some users to gain unfair advantages from incentive poli-
increased. Some users rename a common file with a pop-
cies in P2P file sharing systems, and our approach can also
ular name to attract the downloads of other users in order
be used in their circumstances; U. Lee et al. [5] analyzed
to gain unfair advantages from the incentive policies, and
the time interval between download and the quality check-
we call the renamed file a fake file.
ing; We find the same situation in our work; D. Dumitriu

Supported by NSFC(60673183) and Doctoral Funding of et al. [6] concluded that a user’s behaviors such as will-
MOE(20060001044) ing to share and removing pollution quickly have a great

1
impact on the effect of file targeted attacks; Our work im-
proves this results by quantitative analysis of users’ be-
haviors with lifetime; N. Christin et al. [7] analyzed the
differences between pollution and poisoning and their re-
spective impact on content availability in P2P file sharing
networks, they define a deliberate injection of decoys to
reduce the availability of targeted files as poisoning and
a accidental injection of unusable files as pollution; Our
work finds that a user’s deliberate behavior can also come
from cheating incentive policies; J. Liang et al. [8] an- Figure 1: Specification sketch
alyzed index poison, and our work is a complement of
theirs. from U1 and U4 downloads F3 from U2. We can get the
In pollution identification aspect, D. Dumitriu et al. [6] following conclusions: T1 has real file F1 whose owners
discussed random algorithm; J. Liang et al. [8] recom- are U1 and U2; T2 has real file F2 whose owners are U1
mended distributed blacklist; S. Kamvar et al. [2] pro- and U3; T2 has fake file F3 whose owners are U2 and U4;
posed EigenTrust algorithm; K. Walsh et al. [3] proposed Content hash H1 corresponds to two names which are N1
an object reputation system; J. Liang et al. [4] proposed and N3; Content hash H2 corresponds to one name which
an approach depending on whether a file is decodable and is N2.
the warp of its duration to identify pollution automatically.
But some of these approaches require users to participate 3.2. Detection of Fake Files
actively; Some can’t identify fake files before a real file
comes out; Some need to download all or part of a file; Our historical logs are collected from Maze [9] which is
Some need a lot of system resource; Some can only iden- a large deployed P2P file sharing system with more than 2
tify some types of files. Our approach calculates the reten- million registered users and more than 10,000 users online
tion times of files and filters out fake files automatically, at any given time. A log server is used to record every
so it is good at bypassing the limitations above. downloading action and each log contains uploading peer-
id, downloading peer-id, global time, file’s content hash,
3. Analysis of Fake Files and file’s name. The logs from Oct 11, 2005 to Aug 11,
2006 are selected. Though our analysis is based on Maze,
3.1. Term Specification it is similar to many other P2P file sharing systems, so the
approach and conclusion are universal.
• Title: the common description of a particular con- We choose five representative popular titles and use T1,
tent. Such as ”Movie: The Matrix” and ”Music: Yes- T2, T3, T4, and T5 to express them.
terday Once More”. We use T(Title) to express it. From users’ feedbacks in Maze forum, we know there
are some users who rename a common file with a popular
• File: the sharing object in P2P file sharing system.
name to attract the downloads of other users, so we get the
We use F(File) to express it. A file is composed by
hint to detect fake files from the change of names and the
two parts:
first publication time of a file.
– Name: the name of a file and it should belong When a user creates a fake file by renaming a file’s
to a title, we use N(Name) to express it. name, it causes the file’s content hash to correspond to
at least two names. One belongs to the original title and
– Content hash: the hash value which is gener-
the other belongs to the popular title, and the former one
ated by digesting the content of a file with a
should appear earlier than the latter one. Furthermore, all
hash function, it is independent of the name.
the files of a title which are published before the title’s real
We use H(Hash) to express it.
file comes out are sure to be fake files.
• Fake file: a file whose name does not match its con- So, two detectors are designed to detect fakes files of a
tent hash. title.

• File Owner: a user who has ever owned the file. • Detector 1: for each file of a title, the detector col-
lects all the names that correspond to the file’s con-
Figure 1 gives an explanation of these specifications. In tent hash, if there exists a name which belongs to an-
figure A, U1 owns real files F1 and F2, U2 owns real file other title and appears before this file’s first publica-
F1, and N1 belongs to T1 whereas N2 belongs to T2; in tion time, the detector will identify this file as fake.
figure B, U2 changes F1’s name to N3 which belongs to
T2 and creates fake file F3; in figure C, U3 downloads F2 • Detector 2: for a title, the detector gains the real

2
does work. What you should do is renaming a file, and es-
Table 1: Identification results
pecially if you rename it as a popular movie which has not
Real file Detector 1 Detector 2 Detector 1&2 appeared, all the users who want to download this movie
only only will find you. In fact, we have tried some other ways to
T1 827(51.5%) 53(3.3%) 432(26.9%) 294(18.3%) produce fake files, but this only makes things more com-
T2 409(83.3%) 12(2.4%) 21(4.3%) 49(10.0%) plex and gains no more befinits.
T3 291(91.5%) 8(2.5%) 2(0.6%) 17(5.3%) Even so, we should recognize that the two detectors
T4 411(68.3%) 42(7.0%) 49(8.1%) 100(16.6%) are not feasible in real systems, we need prepare a lot to
T5 151(74.4%) 52(25.6%) 0(0.0%) 0(0.0%) analyze a title, this is the reason that we only choose five
titles. Even if they are deployed, the users can also cheat
file’s first publication time from Divx [10] and iden- the two detectors easily. But they collect the real and fake
tifies the files published before this time as fake. files from logs for our analysis. With the analysis below,
a lifetime and popularity based ranking approach will be
When we are identifying a title, we first collect all the designed to filter out fake files automatically. Experiments
files that belong to the title, then use the two detectors to are also designed with the the real and fake files collected
identify them and mark them as real file(not identified by by the two detectors to evaluate the effect of the approach.
any of the detectors), detector 1 only(only identified by
detector 1), detector 2 only(only identified by detector 2), 3.3. Time Characteristic Analysis
detector 1&2(identified by both of the detectors). Table
1 shows the results with the pecentage of files belongs to Users tend to retain a real file longer and delete a fake
each category. The five titles all have their own character- file more quickly. From this phenomenon, we introduce a
istics: T1 is the most popular title and it’s fake files are the metric called lifetime, which is a file’s average retention
most pervasive; T2 has low-grade fake files; T3 has only time in users’ computers, to analyze the file’s quality.
published for a short time and it has the lowest grade of Definition of Lifetime and Popularity
fake files; T4 has middle-grade fake files; T5’s first publi- If file F has n owners, and each owner’s retention
Pn time
ti
cation time is earlier than the logs’ collecting time and it of F is ti (1 <= i <= n), then define LF = i=1 n as
represents the titles with incomplete logs. file F’s lifetime and n as file F’s popularity.
In order to calculate a file’s lifetime, we check file and
800
user’s appearance time in the logs. For each file, its own-
Number of Files

600 ers are gathered at first. Then each owner’s first and last
Real file appearance times with this file are collected, the difference
400 Fake file
between the two times is treated as this owner’s retention
200 time of this file. For example, if U1 downloads F1 at 10:30
and uploads F1 at 14:00 and 16:30, we will calculate U1’s
0
0 50 100 150 200 retention time of F1 as (16.5 − 10.5) × 60 × 60 = 21600s.
Time(day) Some owners will close their client softwares after down-
loading a file or move the file out of the sharing directory,
Figure 2: The change of cumulative numbers with time so the retention time calculated by this method will be
smaller than the real retention time, but the warps are sim-
Figure 2 shows the change of T1’s real and fake files’ ilar to all the files and we are mainly comparing lifetimes
cumulative numbers with time. The x-axis is the time, of real and fake files, it is therefore relatively reliable.
the y-axis is the cumulative numbers of real and fake files 100
which have ever appeared (The other 4 titles’ figures are Real files
80
similar to this one). It can be deduced from the figure that Fake files
Lifetime(10,000s)

fake files can appear both before and after a real file comes 60
out. So the approach to filter out fake files should have an
40
effect in both conditions.
From the analysis above, we know that the two simple 20
but efficient detectors can identify a lot of fake files, but
0
there are still some questions: Why is it better to serve a 0 100 200 300 400 500 600 700
fake file than the real one? Why not download a popu- Popularity

lar file and serve the real one? Doesn’t this give the same
benefits as serving a fake file? Why can’t users use a com- Figure 3: The change of lifetimes with popularity
pletely new file as their fake files? The answer is same to
all the questions: because it is the most simple way and it Differentiation and Astringency of Lifetime

3
Figure 3 shows five real and five fake files belonging zones by their lifetimes and popularities. The files in zone
to five titles separately and the change of their lifetimes 1, whose popularities are large and lifetimes are high, are
while their popularities increase. We can get the follow- more likely to be real files; the files in zone 4, whose pop-
ing conclusions. When the popularity is small, the life- ularities are large but lifetimes are low, are more likely to
time fluctuates greatly. When the popularity reaches 100, be fake files; the files in zone 2 and 3 should be judged
the lifetime of a fake file converges below 100,000s, so later.
we can distinguish fake files from real files then. Because So we design a lifetime and popularity based ranking
the lifetime of a file with small popularity does not con- approach (LIP): it divides files into four zones with life-
verge, we only analyze the files whose popularities are time and popularity thresholds, filters out the files in zone
larger than 20 below. 4 and recommends user to select the file with the high-
est lifetime in zone 1, 2 and 3. The settings of lifetime and
1.0
popularity thresholds will be discussed in the next section.
Percentage of Files

0.8
Information Collection
0.6 This approach only needs the retention times of files. In
Real Files

0.4
Fake Files
DHT, the publication node periodically publishes its shar-
ing information, such as a file’s name and content hash, to
0.2
an index node. When the index node first receives a node’s
0.0 sharing information, it does not only record the sharing in-
0 10 20 30 40 50

Lifetime(10,000s)
formation, but also records the publication time. When the
index node receives the same sharing information from the
Figure 4: CDF of real and fake files’ lifetimes same node, it will update this user’s retention time for this
file. For example, U1 publishes the information of F1 to
Distribution of Lifetime U2 at 13:00 and 15:00 separately. U2 will calculate U1’s
Figure 4 shows T1’s CDF of real and fake files’ life- retention time of F1 as (15 − 13) × 60 × 60 = 7200s.
times. The x-axis is the lifetime, the y-axis is the per- Information Processing
centage of real and fake files whose lifetimes are within x When calculating a file’s lifetime, it just sums all the
(The other 4 titles’ figures are similar to this one). We can owners’ retention times up, and divides it by the file’s pop-
conclude from the figure that real files’ lifetimes are holis- ularity. In DHT, each index node records all the retention
tically higher than fake files’, which means compared to times within its responsible area, so it can calculate the
fake files, there are more real files with higher lifetimes. popularity and lifetime by itself.
Information Feedback
4. Lifetime and Popularity Based Ranking Ap- It filters out fake files with lifetime and popularity
thresholds, and sends files’ lifetimes with search results
proach
to users. In DHT, each index node can do this by itself.
We will describe this approach with some implementa-
tion details in DHT. 5. Experiment and Results
3
File
In this section. We first describe the experiment, and
next use training set to choose lifetime and popularity
Lifetime(10,000s)

2 2 1
thresholds. We then evaluate the effect of LIP.
In experiment, we simulate a sequence of downloading
1 actions of a title. For each downloading action, we first
3 4 find out all the files which still have replicas and calcu-
0 late their numbers of replicas, popularities and lifetimes.
0 100 200 300 400 500
Then, we use an approach to choose a file from them and
Popularity
calculate a retention time for this downloaded file.
Figure 5: Classification of files We can collect the following contents from the logs.

• File set for a title: all the files of a title compose the
Lifetime and Popularity Based Ranking Approach file set for this title. With the two detectors in Section
From the analysis above, we know lifetime can be used 2, we also label a file as fake or real.
to distinguish between real and fake files. But when the
popularity of a file has not reached a certain threshold, its • Retention time set for a file: all the owners’ reten-
lifetime does not converge, so we should combine popu- tion times of a file compose the retention time set for
larity with lifetime. In figure 5, files are divided into four this file.

4
• Downloading action set for a title: all the down-
loading actions of a title compose the downloading
action set for this title.

Figure 7: Simulation sketch

simulate U6’s downloading action at time 10. It can be


Figure 6: Log collection sketch seen from the figure that F2’s replica has deracinated at
time 7 and only F1 and F3 still have replicas. Their num-
Figure 6 shows an explanation of how to collect these bers of replicas are 2 and 1, their popularities are 2 and
contents from the logs of a title. The x-axis is the time 2, their lifetimes are ((10 − 1) + (10 − 8))/2 = 5.5 and
and each unit means one second.1 Each rectangular inter- ((10−3)+(9−4))/2 = 6. If we use RP, we should choose
val is a user’s retention time. User, file, first appearance F1 and select a retention time from the retention time set
time, last appearance time and retention time are written for F1 randomly as U6’s retention time of F1; If we use
in each interval. We can see from the figure that, F1 has LP, we should choose F3 and select a retention time from
two owners who are U1 and U5, their retention times of the retention time set for F3 randomly.
F1 are 12 and 7 respectively, and the two retention times
5.1. Training Set
compose the retention time set for F1; F2 has four users
who are U2, U3, U4 and U6, their retention times of F2
LIP needs two thresholds: lifetime threshold and pop-
are 4, 8, 9, and 6 respectively, and the four retention times
ularity threshold. Different thresholds have different ef-
compose the retention time set for F2; the six users’ down-
fects. If popularity threshold is too high, a fake file can
loading actions are (U1, 1), (U2, 2), (U3, 4), (U4, 6), (U5,
only be identified after a lot of downloads. If popularity
7), (U6, 9), they compose the downloading action set for
threshold is too low, the lifetime of a file may have not
this title.
converged, so we may identify a real file as fake. If life-
We should also simulate the following contents.
time threshold is too high, we may also identify a real file
• File’s retention time: In simulation, when a user as fake. If lifetime threshold is too low, we may identify a
downloads a file, we select a retention time from the fake file as real.
retention time set for this file randomly as this user’s Our approach should reduce the download volume of
retention time of this file. fake files both before and after a real file comes out. And
we can see from figure 2 that T1 has a lot of fake files
• Five comparative approaches: both before and after a real file comes out, so we use T1 as
training set to choose lifetime and popularity thresholds.
1. Replica priority (RP): select the file with the From figure 3, we know a fake file’s lifetime converges
maximum number of replicas. below 100,000s when its popularity reaches 100, so we
2. Lifetime priority (LP): select the file with the choose popularity and lifetime thresholds near 100 and
highest lifetime. 100,000 respectively. We simulate with different thresh-
3. Select randomly (SR). old settings. And from the perspectives of reducing the
download volume of fake files and guaranteeing the down-
4. Lifetime and popularity based ranking ap- load volume of real files, we choose popularity threshold
proach (LIP): select the file with the highest as 80 and lifetime threshold as 80,000 at last. It’s inter-
lifetime after filtering. esting to see that 80,000 seconds is about one day, which
5. Actual state (AS): it is the actual state of the means most real files’ lifetimes are longer than one day,
logs. and most fake files’ are shorter than one day.

Figure 7 shows a concrete example of the simulation 5.2. Testing Set


with a title. It has simulated U1 to U5’s downloading
actions at time 1, 2, 3, 4, 8 respectively and is going to We use T2, T3, T4 and T5 as testing sets to evaluate the
1
This is only a sketch, so the time begins from 1 and the range is
effect of LIP.
very small. In real situation, it is the global time of log server and the Figure 8 gives the comparison of the download volume
range is very large. of fake files between AS and LIP. We can conclude that

5
6 AS downloading actions that have happened, and the y-axis
LIP
is the download volume of fake files. It can be concluded

Download Volume(1000)
4 from the figure that RP is the worst, so is SR, LP can sig-
nificantly reduce the download volume of fake files after a
real file comes out, but only LIP can reduce the download
2 volume of fake files both before and after a real file comes
out.
0
T2 T3 T4 T5 6. Conclusion and Future Work
Title

Figure 8: Comparison of the download volume of fake files


Some users create fake files to gain unfair advantages
from incentive policies and it can make fake files perva-
LIP can significantly reduce the download volume of fake sive. Two detectors are designed to collect data from logs
files most of the time. for our analysis and experiments. We show that lifetime
When we use other titles in our experiments, there are can be used to distinguish between real and fake files, and
also some instances that the download volume of real files propose a lifetime and popularity based ranking approach
reduces or the download volume of fake files increases. to rank the reputation of files and filter out fake files in P2P
There may be three factors that influence the effect of LIP. sharing systems. The experimental results show that this
approach can reduce the download volume of fake files
1. The quality of fake files. If a fake file’s quality is both before and after a real file comes out.
better than the real file, which means the original There are many aspects to improve LIP, such as secu-
movie is more attractive, users may choose to retain rity considerations, the trust of feedback, how to choose
the fake file longer than the real file. lifetime and popularity thresholds.
LIP is not only a binary value to identify real and fake
2. The percentage of low quality files in real files. Be-
files, it can also reflect a file’s quality. We will realize it
cause LIP may filter out some real files of low quali-
and see how it works in real situation.
ties.

3. Real file’s first publication period. Before a real References


file’s lifetime converges, LIP can’t confirm whether
it is real, so it may filter out some doubtful real files. [1] Q. Lian, Y. Peng, M. Yang, Z. Zhang, Y. Dai, X. Li, Robust Incen-
tives via Multilevel Tit for tat, In Proc. of IPTPS, Santa Barbara,
CA, 2005.
5.3. Comparison of Different Approaches [2] S. Kamvar, M. Schlosser, and H. Garcia-Molina, The Eigentrust
Algorithm for Reputation Management in P2P Networks, ACM
We continue to use T1 to compare five different ap- WWW’03, pages 640-651, 2003.
proaches and analyze their effect on the download volume [3] K. Walsh and E. G. Sirer, Experience with an Object Reputation
System for Peer-to-Peer Filesharing, USENIX NSDI, 2006.
of fake files both before and after a real file comes out. [4] J. Liang, R. Kumar, Y. Xi, and K. Ross, Pollution in P2P file
From the downloading action set for T1, we know T1’s sharing systems, Proc. IEEE INFOCOM’05, Miami, FL, 2005.
real file first comes out at the 5837th downloading action, [5] U. Lee, M. Choi, J. Cho, M. Y. Sanadidi, M. Gerla, Understanding
which means during the previous 5836 downloading ac- Pollution Dynamics in P2P File Sharing, In Proc. of IPTPS, Santa
Barbara, CA, 2006.
tions, there are only fake files. [6] D. Dumitriu, E. Knightly, A. Kuzmanovic, I. Stoica, and W.
Zwaenepoel, Denial-of-service resilience in peer-to-peer file shar-
25 RP
ing systems, Proc. ACM SIGMETRICS’05, Banff, AB, Canada,
LP
Download Volume(1000)

20 LIP 2005.
SR [7] N. Christin, A. S. Weigend, and J. Chuang, Content Availability,
15 AS
Pollution and Poisoning in Peer-to-Peer File Sharing Networks,
In Proc. of the Sixth ACM Conference on Electronic Commerce
10 (EC’05) Vancouver, BC, Canada, 2005.
[8] J. Liang, N. Naoumov, and K.W. Ross, The Index Poison-
5
ing Attack in P2P File-Sharing Systems, IEEE Infocom 2006,
0 Barcelona, Spain, 2006.
0 5 10 15 20 25 [9] M. Yang, B. Zhao, Y. Dai and Z. Zhang, Deployment of a large
Number of Downloading Actions(1000) scale peer-to-peer social network, In Proc. of the 1st Workshop on
Real, Large Distributed Systems, San Francisco, CA. USA, 2004.
Figure 9: Comparison of five different approaches [10] Divx Database, http://divx.thu.cn, 2006.

Figure 9 shows the download volume of fake files with


five different approaches. The x-axis is the number of

You might also like