Lip: A Lifetime and Popularity Based Ranking Approach To Filter Out Fake Files in P2P File Sharing Systems
Lip: A Lifetime and Popularity Based Ranking Approach To Filter Out Fake Files in P2P File Sharing Systems
Lip: A Lifetime and Popularity Based Ranking Approach To Filter Out Fake Files in P2P File Sharing Systems
1
impact on the effect of file targeted attacks; Our work im-
proves this results by quantitative analysis of users’ be-
haviors with lifetime; N. Christin et al. [7] analyzed the
differences between pollution and poisoning and their re-
spective impact on content availability in P2P file sharing
networks, they define a deliberate injection of decoys to
reduce the availability of targeted files as poisoning and
a accidental injection of unusable files as pollution; Our
work finds that a user’s deliberate behavior can also come
from cheating incentive policies; J. Liang et al. [8] an- Figure 1: Specification sketch
alyzed index poison, and our work is a complement of
theirs. from U1 and U4 downloads F3 from U2. We can get the
In pollution identification aspect, D. Dumitriu et al. [6] following conclusions: T1 has real file F1 whose owners
discussed random algorithm; J. Liang et al. [8] recom- are U1 and U2; T2 has real file F2 whose owners are U1
mended distributed blacklist; S. Kamvar et al. [2] pro- and U3; T2 has fake file F3 whose owners are U2 and U4;
posed EigenTrust algorithm; K. Walsh et al. [3] proposed Content hash H1 corresponds to two names which are N1
an object reputation system; J. Liang et al. [4] proposed and N3; Content hash H2 corresponds to one name which
an approach depending on whether a file is decodable and is N2.
the warp of its duration to identify pollution automatically.
But some of these approaches require users to participate 3.2. Detection of Fake Files
actively; Some can’t identify fake files before a real file
comes out; Some need to download all or part of a file; Our historical logs are collected from Maze [9] which is
Some need a lot of system resource; Some can only iden- a large deployed P2P file sharing system with more than 2
tify some types of files. Our approach calculates the reten- million registered users and more than 10,000 users online
tion times of files and filters out fake files automatically, at any given time. A log server is used to record every
so it is good at bypassing the limitations above. downloading action and each log contains uploading peer-
id, downloading peer-id, global time, file’s content hash,
3. Analysis of Fake Files and file’s name. The logs from Oct 11, 2005 to Aug 11,
2006 are selected. Though our analysis is based on Maze,
3.1. Term Specification it is similar to many other P2P file sharing systems, so the
approach and conclusion are universal.
• Title: the common description of a particular con- We choose five representative popular titles and use T1,
tent. Such as ”Movie: The Matrix” and ”Music: Yes- T2, T3, T4, and T5 to express them.
terday Once More”. We use T(Title) to express it. From users’ feedbacks in Maze forum, we know there
are some users who rename a common file with a popular
• File: the sharing object in P2P file sharing system.
name to attract the downloads of other users, so we get the
We use F(File) to express it. A file is composed by
hint to detect fake files from the change of names and the
two parts:
first publication time of a file.
– Name: the name of a file and it should belong When a user creates a fake file by renaming a file’s
to a title, we use N(Name) to express it. name, it causes the file’s content hash to correspond to
at least two names. One belongs to the original title and
– Content hash: the hash value which is gener-
the other belongs to the popular title, and the former one
ated by digesting the content of a file with a
should appear earlier than the latter one. Furthermore, all
hash function, it is independent of the name.
the files of a title which are published before the title’s real
We use H(Hash) to express it.
file comes out are sure to be fake files.
• Fake file: a file whose name does not match its con- So, two detectors are designed to detect fakes files of a
tent hash. title.
• File Owner: a user who has ever owned the file. • Detector 1: for each file of a title, the detector col-
lects all the names that correspond to the file’s con-
Figure 1 gives an explanation of these specifications. In tent hash, if there exists a name which belongs to an-
figure A, U1 owns real files F1 and F2, U2 owns real file other title and appears before this file’s first publica-
F1, and N1 belongs to T1 whereas N2 belongs to T2; in tion time, the detector will identify this file as fake.
figure B, U2 changes F1’s name to N3 which belongs to
T2 and creates fake file F3; in figure C, U3 downloads F2 • Detector 2: for a title, the detector gains the real
2
does work. What you should do is renaming a file, and es-
Table 1: Identification results
pecially if you rename it as a popular movie which has not
Real file Detector 1 Detector 2 Detector 1&2 appeared, all the users who want to download this movie
only only will find you. In fact, we have tried some other ways to
T1 827(51.5%) 53(3.3%) 432(26.9%) 294(18.3%) produce fake files, but this only makes things more com-
T2 409(83.3%) 12(2.4%) 21(4.3%) 49(10.0%) plex and gains no more befinits.
T3 291(91.5%) 8(2.5%) 2(0.6%) 17(5.3%) Even so, we should recognize that the two detectors
T4 411(68.3%) 42(7.0%) 49(8.1%) 100(16.6%) are not feasible in real systems, we need prepare a lot to
T5 151(74.4%) 52(25.6%) 0(0.0%) 0(0.0%) analyze a title, this is the reason that we only choose five
titles. Even if they are deployed, the users can also cheat
file’s first publication time from Divx [10] and iden- the two detectors easily. But they collect the real and fake
tifies the files published before this time as fake. files from logs for our analysis. With the analysis below,
a lifetime and popularity based ranking approach will be
When we are identifying a title, we first collect all the designed to filter out fake files automatically. Experiments
files that belong to the title, then use the two detectors to are also designed with the the real and fake files collected
identify them and mark them as real file(not identified by by the two detectors to evaluate the effect of the approach.
any of the detectors), detector 1 only(only identified by
detector 1), detector 2 only(only identified by detector 2), 3.3. Time Characteristic Analysis
detector 1&2(identified by both of the detectors). Table
1 shows the results with the pecentage of files belongs to Users tend to retain a real file longer and delete a fake
each category. The five titles all have their own character- file more quickly. From this phenomenon, we introduce a
istics: T1 is the most popular title and it’s fake files are the metric called lifetime, which is a file’s average retention
most pervasive; T2 has low-grade fake files; T3 has only time in users’ computers, to analyze the file’s quality.
published for a short time and it has the lowest grade of Definition of Lifetime and Popularity
fake files; T4 has middle-grade fake files; T5’s first publi- If file F has n owners, and each owner’s retention
Pn time
ti
cation time is earlier than the logs’ collecting time and it of F is ti (1 <= i <= n), then define LF = i=1 n as
represents the titles with incomplete logs. file F’s lifetime and n as file F’s popularity.
In order to calculate a file’s lifetime, we check file and
800
user’s appearance time in the logs. For each file, its own-
Number of Files
600 ers are gathered at first. Then each owner’s first and last
Real file appearance times with this file are collected, the difference
400 Fake file
between the two times is treated as this owner’s retention
200 time of this file. For example, if U1 downloads F1 at 10:30
and uploads F1 at 14:00 and 16:30, we will calculate U1’s
0
0 50 100 150 200 retention time of F1 as (16.5 − 10.5) × 60 × 60 = 21600s.
Time(day) Some owners will close their client softwares after down-
loading a file or move the file out of the sharing directory,
Figure 2: The change of cumulative numbers with time so the retention time calculated by this method will be
smaller than the real retention time, but the warps are sim-
Figure 2 shows the change of T1’s real and fake files’ ilar to all the files and we are mainly comparing lifetimes
cumulative numbers with time. The x-axis is the time, of real and fake files, it is therefore relatively reliable.
the y-axis is the cumulative numbers of real and fake files 100
which have ever appeared (The other 4 titles’ figures are Real files
80
similar to this one). It can be deduced from the figure that Fake files
Lifetime(10,000s)
fake files can appear both before and after a real file comes 60
out. So the approach to filter out fake files should have an
40
effect in both conditions.
From the analysis above, we know that the two simple 20
but efficient detectors can identify a lot of fake files, but
0
there are still some questions: Why is it better to serve a 0 100 200 300 400 500 600 700
fake file than the real one? Why not download a popu- Popularity
lar file and serve the real one? Doesn’t this give the same
benefits as serving a fake file? Why can’t users use a com- Figure 3: The change of lifetimes with popularity
pletely new file as their fake files? The answer is same to
all the questions: because it is the most simple way and it Differentiation and Astringency of Lifetime
3
Figure 3 shows five real and five fake files belonging zones by their lifetimes and popularities. The files in zone
to five titles separately and the change of their lifetimes 1, whose popularities are large and lifetimes are high, are
while their popularities increase. We can get the follow- more likely to be real files; the files in zone 4, whose pop-
ing conclusions. When the popularity is small, the life- ularities are large but lifetimes are low, are more likely to
time fluctuates greatly. When the popularity reaches 100, be fake files; the files in zone 2 and 3 should be judged
the lifetime of a fake file converges below 100,000s, so later.
we can distinguish fake files from real files then. Because So we design a lifetime and popularity based ranking
the lifetime of a file with small popularity does not con- approach (LIP): it divides files into four zones with life-
verge, we only analyze the files whose popularities are time and popularity thresholds, filters out the files in zone
larger than 20 below. 4 and recommends user to select the file with the high-
est lifetime in zone 1, 2 and 3. The settings of lifetime and
1.0
popularity thresholds will be discussed in the next section.
Percentage of Files
0.8
Information Collection
0.6 This approach only needs the retention times of files. In
Real Files
0.4
Fake Files
DHT, the publication node periodically publishes its shar-
ing information, such as a file’s name and content hash, to
0.2
an index node. When the index node first receives a node’s
0.0 sharing information, it does not only record the sharing in-
0 10 20 30 40 50
Lifetime(10,000s)
formation, but also records the publication time. When the
index node receives the same sharing information from the
Figure 4: CDF of real and fake files’ lifetimes same node, it will update this user’s retention time for this
file. For example, U1 publishes the information of F1 to
Distribution of Lifetime U2 at 13:00 and 15:00 separately. U2 will calculate U1’s
Figure 4 shows T1’s CDF of real and fake files’ life- retention time of F1 as (15 − 13) × 60 × 60 = 7200s.
times. The x-axis is the lifetime, the y-axis is the per- Information Processing
centage of real and fake files whose lifetimes are within x When calculating a file’s lifetime, it just sums all the
(The other 4 titles’ figures are similar to this one). We can owners’ retention times up, and divides it by the file’s pop-
conclude from the figure that real files’ lifetimes are holis- ularity. In DHT, each index node records all the retention
tically higher than fake files’, which means compared to times within its responsible area, so it can calculate the
fake files, there are more real files with higher lifetimes. popularity and lifetime by itself.
Information Feedback
4. Lifetime and Popularity Based Ranking Ap- It filters out fake files with lifetime and popularity
thresholds, and sends files’ lifetimes with search results
proach
to users. In DHT, each index node can do this by itself.
We will describe this approach with some implementa-
tion details in DHT. 5. Experiment and Results
3
File
In this section. We first describe the experiment, and
next use training set to choose lifetime and popularity
Lifetime(10,000s)
2 2 1
thresholds. We then evaluate the effect of LIP.
In experiment, we simulate a sequence of downloading
1 actions of a title. For each downloading action, we first
3 4 find out all the files which still have replicas and calcu-
0 late their numbers of replicas, popularities and lifetimes.
0 100 200 300 400 500
Then, we use an approach to choose a file from them and
Popularity
calculate a retention time for this downloaded file.
Figure 5: Classification of files We can collect the following contents from the logs.
• File set for a title: all the files of a title compose the
Lifetime and Popularity Based Ranking Approach file set for this title. With the two detectors in Section
From the analysis above, we know lifetime can be used 2, we also label a file as fake or real.
to distinguish between real and fake files. But when the
popularity of a file has not reached a certain threshold, its • Retention time set for a file: all the owners’ reten-
lifetime does not converge, so we should combine popu- tion times of a file compose the retention time set for
larity with lifetime. In figure 5, files are divided into four this file.
4
• Downloading action set for a title: all the down-
loading actions of a title compose the downloading
action set for this title.
5
6 AS downloading actions that have happened, and the y-axis
LIP
is the download volume of fake files. It can be concluded
Download Volume(1000)
4 from the figure that RP is the worst, so is SR, LP can sig-
nificantly reduce the download volume of fake files after a
real file comes out, but only LIP can reduce the download
2 volume of fake files both before and after a real file comes
out.
0
T2 T3 T4 T5 6. Conclusion and Future Work
Title
20 LIP 2005.
SR [7] N. Christin, A. S. Weigend, and J. Chuang, Content Availability,
15 AS
Pollution and Poisoning in Peer-to-Peer File Sharing Networks,
In Proc. of the Sixth ACM Conference on Electronic Commerce
10 (EC’05) Vancouver, BC, Canada, 2005.
[8] J. Liang, N. Naoumov, and K.W. Ross, The Index Poison-
5
ing Attack in P2P File-Sharing Systems, IEEE Infocom 2006,
0 Barcelona, Spain, 2006.
0 5 10 15 20 25 [9] M. Yang, B. Zhao, Y. Dai and Z. Zhang, Deployment of a large
Number of Downloading Actions(1000) scale peer-to-peer social network, In Proc. of the 1st Workshop on
Real, Large Distributed Systems, San Francisco, CA. USA, 2004.
Figure 9: Comparison of five different approaches [10] Divx Database, http://divx.thu.cn, 2006.