Tkde Transfer Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication.


IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
1

A Survey on Transfer Learning


Sinno Jialin Pan and Qiang Yang Fellow, IEEE

Abstract—A major assumption in many machine learning and data mining algorithms is that the training and future data must be
in the same feature space and have the same distribution. However, in many real-world applications, this assumption may not hold.
For example, we sometimes have a classification task in one domain of interest, but we only have sufficient training data in another
domain of interest, where the latter data may be in a different feature space or follow a different data distribution. In such cases,
knowledge transfer, if done successfully, would greatly improve the performance of learning by avoiding much expensive data labeling
efforts. In recent years, transfer learning has emerged as a new learning framework to address this problem. This survey focuses on
categorizing and reviewing the current progress on transfer learning for classification, regression and clustering problems. In this survey,
we discuss the relationship between transfer learning and other related machine learning techniques such as domain adaptation, multi-
task learning and sample selection bias, as well as co-variate shift. We also explore some potential future issues in transfer learning
research.

Index Terms—Transfer Learning, Survey, Machine Learning, Data Mining.

1 I NTRODUCTION problems, which aims to detect a user’s current location based


on previously collected WiFi data, it is very expensive to
Data mining and machine learning technologies have already
calibrate WiFi data for building localization models in a large-
achieved significant success in many knowledge engineering
scale environment, because a user needs to label a large
areas including classification, regression and clustering (e.g.,
collection of WiFi signal data at each location. However, the
[1], [2]). However, many machine learning methods work well
WiFi signal-strength values may be a function of time, device
only under a common assumption: the training and test data are
or other dynamic factors. A model trained in one time period
drawn from the same feature space and the same distribution.
or on one device may cause the performance for location
When the distribution changes, most statistical models need to
estimation in another time period or on another device to be
be rebuilt from scratch using newly collected training data. In
reduced. To reduce the re-calibration effort, we might wish to
many real world applications, it is expensive or impossible to
adapt the localization model trained in one time period (the
re-collect the needed training data and rebuild the models. It
source domain) for a new time period (the target domain), or
would be nice to reduce the need and effort to re-collect the
to adapt the localization model trained on a mobile device (the
training data. In such cases, knowledge transfer or transfer
source domain) for a new mobile device (the target domain),
learning between task domains would be desirable.
as done in [7].
Many examples in knowledge engineering can be found As a third example, consider the problem of sentiment
where transfer learning can truly be beneficial. One example classification, where our task is to automatically classify the
is Web document classification [3], [4], [5], where our goal reviews on a product, such as a brand of camera, into positive
is to classify a given Web document into several predefined and negative views. For this classification task, we need to
categories. As an example in the area of Web-document first collect many reviews of the product and annotate them.
classification (see, e.g., [6]), the labeled examples may be We would then train a classifier on the reviews with their
the university Web pages that are associated with category corresponding labels. Since the distribution of review data
information obtained through previous manual-labeling efforts. among different types of products can be very different, to
For a classification task on a newly created Web site where the maintain good classification performance, we need to collect
data features or data distributions may be different, there may a large amount of labeled data in order to train the review-
be a lack of labeled training data. As a result, we may not be classification models for each product. However, this data-
able to directly apply the Web-page classifiers learned on the labeling process can be very expensive to do. To reduce the
university Web site to the new Web site. In such cases, it would effort for annotating reviews for various products, we may
be helpful if we could transfer the classification knowledge want to adapt a classification model that is trained on some
into the new domain. products to help learn classification models for some other
The need for transfer learning may arise when the data can products. In such cases, transfer learning can save a significant
be easily outdated. In this case, the labeled data obtained in amount of labeling effort [8].
one time period may not follow the same distribution in a In this survey article, we give a comprehensive overview of
later time period. For example, in indoor WiFi localization transfer learning for classification, regression and clustering
developed in machine learning and data mining areas. There
Department of Computer Science and Engineering, Hong Kong University of
Science and Technology, Clearwater Bay, Kowloon, Hong Kong has been a large amount of work on transfer learning for
Emails: {sinnopan, qyang}@cse.ust.hk reinforcement learning in the machine learning literature (e.g.,

Digital Object Indentifier 10.1109/TKDE.2009.191 1041-4347/$25.00 © 2009 IEEE


This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
2

[9], [10]). However, in this paper, we only focus on transfer a closely related learning technique to transfer learning is
learning for classification, regression and clustering problems the multi-task learning framework [21], which tries to learn
that are related more closely to data mining tasks. By doing multiple tasks simultaneously even when they are different.
the survey, we hope to provide a useful resource for the data A typical approach for multi-task learning is to uncover the
mining and machine learning community. common (latent) features that can benefit each individual task.
The rest of the survey is organized as follows. In the next In 2005, the Broad Agency Announcement (BAA) 05-29
four sections, we first give a general overview and define of Defense Advanced Research Projects Agency (DARPA)’s
some notations we will use later. We then briefly survey the Information Processing Technology Office (IPTO) 2 gave a
history of transfer learning, give a unified definition of transfer new mission of transfer learning: the ability of a system to
learning and categorize transfer learning into three different recognize and apply knowledge and skills learned in previous
settings (given in Table 2 and Figure 2). For each setting, we tasks to novel tasks. In this definition, transfer learning aims
review different approaches, given in Table 3 in detail. After to extract the knowledge from one or more source tasks and
that, in Section 6, we review some current research on the applies the knowledge to a target task. In contrast to multi-task
topic of “negative transfer”, which happens when knowledge learning, rather than learning all of the source and target tasks
transfer has a negative impact on target learning. In Section 7, simultaneously, transfer learning cares most about the target
we introduce some successful applications of transfer learning task. The roles of the source and target tasks are no longer
and list some published data sets and software toolkits for symmetric in transfer learning.
transfer learning research. Finally, we conclude the article with Figure 1 shows the difference between the learning
a discussion of future works in Section 8. processes of traditional and transfer learning techniques. As
we can see, traditional machine learning techniques try to learn
2 OVERVIEW each task from scratch, while transfer learning techniques try
2.1 A Brief History of Transfer Learning to transfer the knowledge from some previous tasks to a target
task when the latter has fewer high-quality training data.
Traditional data mining and machine learning algorithms make
predictions on the future data using statistical models that are
trained on previously collected labeled or unlabeled training
data [11], [12], [13]. Semi-supervised classification [14], [15],
[16], [17] addresses the problem that the labeled data may
be too few to build a good classifier, by making use of a
large amount of unlabeled data and a small amount of labeled
data. Variations of supervised and semi-supervised learning
for imperfect datasets have been studied; for example, Zhu
and Wu [18] have studied how to deal with the noisy class-
label problems. Yang et al. considered cost-sensitive learning (a) Traditional Machine Learning (b) Transfer Learning
[19] when additional tests can be made to future samples.
Fig. 1. Different Learning Processes between Traditional
Nevertheless, most of them assume that the distributions of
Machine Learning and Transfer Learning
the labeled and unlabeled data are the same. Transfer learning,
in contrast, allows the domains, tasks, and distributions used
Today, transfer learning methods appear in several top
in training and testing to be different. In the real world, we
venues, most notably in data mining (ACM KDD, IEEE ICDM
observe many examples of transfer learning. For example,
and PKDD, for example), machine learning (ICML, NIPS,
we may find that learning to recognize apples might help to
ECML, AAAI and IJCAI, for example) and applications of
recognize pears. Similarly, learning to play the electronic organ
machine learning and data mining (ACM SIGIR, WWW and
may help facilitate learning the piano. The study of Transfer
ACL for example) 3 . Before we give different categorizations
learning is motivated by the fact that people can intelligently
of transfer learning, we first describe the notations used in this
apply knowledge learned previously to solve new problems
article.
faster or with better solutions. The fundamental motivation
for Transfer learning in the field of machine learning was
discussed in a NIPS-95 workshop on “Learning to Learn” 2.2 Notations and Definitions
1
, which focused on the need for lifelong machine-learning In this section, we introduce some notations and definitions
methods that retain and reuse previously learned knowledge. that are used in this survey. First of all, we give the definitions
Research on transfer learning has attracted more and of a “domain” and a “task”, respectively.
more attention since 1995 in different names: learning to In this survey, a domain D consists of two components: a
learn, life-long learning, knowledge transfer, inductive trans- feature space X and a marginal probability distribution P (X),
fer, multi-task learning, knowledge consolidation, context- where X = {x1 , . . . , xn } ∈ X . For example, if our learning
sensitive learning, knowledge-based inductive bias, meta learn-
ing, and incremental/cumulative learning [20]. Among these, 2. http://www.darpa.mil/ipto/programs/tl/tl.asp
3. We summarize a list of conferences and workshops where transfer
1. http://socrates.acadiau.ca/courses/comp/dsilver/NIPS95 LTL/ learning papers appear in these few years in the following webpage for
transfer.workshop.1995.html reference, http://www.cse.ust.hk/∼sinnopan/conferenceTL.htm
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
3

task is document classification, and each term is taken as a Given specific domains DS and DT , when the learning
binary feature, then X is the space of all term vectors, xi tasks TS and TT are different, then either (1) the label spaces
is the ith term vector corresponding to some documents, and between the domains are different, i.e. YS = YT , or (2) the
X is a particular learning sample. In general, if two domains conditional probability distributions between the domains are
are different, then they may have different feature spaces or different; i.e. P (YS |XS ) = P (YT |XT ), where YSi ∈ YS and
different marginal probability distributions. YTi ∈ YT . In our document classification example, case (1)
Given a specific domain, D = {X , P (X)}, a task consists corresponds to the situation where source domain has binary
of two components: a label space Y and an objective predictive document classes, whereas the target domain has ten classes to
function f (·) (denoted by T = {Y, f (·)}), which is not classify the documents to. Case (2) corresponds to the situation
observed but can be learned from the training data, which where the source and target documents are very unbalanced
consist of pairs {xi , yi }, where xi ∈ X and yi ∈ Y. The in terms of the user-defined classes.
function f (·) can be used to predict the corresponding label, In addition, when there exists some relationship, explicit or
f (x), of a new instance x. From a probabilistic viewpoint, implicit, between the feature spaces of the two domains, we
f (x) can be written as P (y|x). In our document classification say that the source and target domains are related.
example, Y is the set of all labels, which is True, False for a
binary classification task, and yi is “True” or “False”.
2.3 A Categorization of Transfer Learning Tech-
For simplicity, in this survey, we only consider the case
niques
where there is one source domain DS , and one target domain,
DT , as this is by far the most popular of the research works in In transfer learning, we have the following three main research
the literature. More specifically, we denote the source domain issues: (1) What to transfer; (2) How to transfer; (3) When to
data as DS = {(xS1 , yS1 ), . . . , (xSnS , ySnS )}, where xSi ∈ transfer.
XS is the data instance and ySi ∈ YS is the corresponding “What to transfer” asks which part of knowledge can
class label. In our document classification example, DS can be transferred across domains or tasks. Some knowledge is
be a set of term vectors together with their associated true or specific for individual domains or tasks, and some knowledge
false class labels. Similarly, we denote the target domain data may be common between different domains such that they may
as DT = {(xT1 , yT1 ), . . . , (xTnT , yTnT )}, where the input xTi help improve performance for the target domain or task. After
is in XT and yTi ∈ YT is the corresponding output. In most discovering which knowledge can be transferred, learning
cases, 0 ≤ nT  nS . algorithms need to be developed to transfer the knowledge,
We now give a unified definition of transfer learning. which corresponds to the “how to transfer” issue.
“When to transfer” asks in which situations, transferring
Definition 1 (Transfer Learning) Given a source domain DS skills should be done. Likewise, we are interested in knowing
and learning task TS , a target domain DT and learning task in which situations, knowledge should not be transferred. In
TT , transfer learning aims to help improve the learning of the some situations, when the source domain and target domain are
target predictive function fT (·) in DT using the knowledge in not related to each other, brute-force transfer may be unsuc-
DS and TS , where DS = DT , or TS = TT . cessful. In the worst case, it may even hurt the performance
In the above definition, a domain is a pair D = {X , P (X)}. of learning in the target domain, a situation which is often
Thus the condition DS = DT implies that either XS = XT or referred to as negative transfer. Most current work on transfer
PS (X) = PT (X). For example, in our document classification learning focuses on “What to transfer” and “How to transfer”,
example, this means that between a source document set and by implicitly assuming that the source and target domains be
a target document set, either the term features are different related to each other. However, how to avoid negative transfer
between the two sets (e.g., they use different languages), or is an important open issue that is attracting more and more
their marginal distributions are different. attention in the future.
Similarly, a task is defined as a pair T = {Y, P (Y |X)}. Based on the definition of transfer learning, we summarize
Thus the condition TS = TT implies that either YS = YT the relationship between traditional machine learning and var-
or P (YS |XS ) = P (YT |XT ). When the target and source ious transfer learning settings in Table 1, where we categorize
domains are the same, i.e. DS = DT , and their learning tasks transfer learning under three sub-settings, inductive trans-
are the same, i.e., TS = TT , the learning problem becomes fer learning, transductive transfer learning and unsupervised
a traditional machine learning problem. When the domains transfer learning, based on different situations between the
are different, then either (1) the feature spaces between the source and target domains and tasks.
domains are different, i.e. XS = XT , or (2) the feature 1) In the inductive transfer learning setting, the target task
spaces between the domains are the same but the marginal is different from the source task, no matter when the
probability distributions between domain data are different; source and target domains are the same or not.
i.e. P (XS ) = P (XT ), where XSi ∈ XS and XTi ∈ XT . In this case, some labeled data in the target domain are
As an example, in our document classification example, case required to induce an objective predictive model fT (·)
(1) corresponds to when the two sets of documents are for use in the target domain. In addition, according to
described in different languages, and case (2) may correspond different situations of labeled and unlabeled data in the
to when the source domain documents and the target domain source domain, we can further categorize the inductive
documents focus on different topics. transfer learning setting into two cases:
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
4

TABLE 1
Relationship between Traditional Machine Learning and Various Transfer Learning Settings

Learning Settings Source and Target Domains Source and Target Tasks
Traditional Machine Learning the same the same
Inductive Transfer Learning / the same different but related
Transfer Learning Unsupervised Transfer Learning different but related different but related
Transductive Transfer Learning different but related the same

(1.1) A lot of labeled data in the source domain are Approaches to transfer learning in the above three different
available. In this case, the inductive transfer learning settings can be summarized into four cases based on “What to
setting is similar to the multi-task learning setting. transfer”. Table 3 shows these four cases and brief description.
However, the inductive transfer learning setting only The first context can be referred to as instance-based transfer-
aims at achieving high performance in the target task learning (or instance-transfer) approach [6], [28], [29], [30],
by transferring knowledge from the source task while [31], [24], [32], [33], [34], [35], which assumes that certain
multi-task learning tries to learn the target and source parts of the data in the source domain can be reused for
task simultaneously. learning in the target domain by re-weighting. Instance re-
(1.2) No labeled data in the source domain are available. weighting and importance sampling are two major techniques
In this case, the inductive transfer learning setting is in this context.
similar to the self-taught learning setting, which is A second case can be referred to as feature-representation-
first proposed by Raina et al. [22]. In the self-taught transfer approach [22], [36], [37], [38], [39], [8], [40], [41],
learning setting, the label spaces between the source [42], [43], [44]. The intuitive idea behind this case is to learn
and target domains may be different, which implies a “good” feature representation for the target domain. In this
the side information of the source domain cannot be case, the knowledge used to transfer across domains is encoded
used directly. Thus, it’s similar to the inductive transfer into the learned feature representation. With the new feature
learning setting where the labeled data in the source representation, the performance of the target task is expected
domain are unavailable. to improve significantly.
2) In the transductive transfer learning setting, the source A third case can be referred to as parameter-transfer ap-
and target tasks are the same, while the source and target proach [45], [46], [47], [48], [49], which assumes that the
domains are different. source tasks and the target tasks share some parameters or
In this situation, no labeled data in the target domain prior distributions of the hyper-parameters of the models. The
are available while a lot of labeled data in the source transferred knowledge is encoded into the shared parameters
domain are available. In addition, according to different or priors. Thus, by discovering the shared parameters or priors,
situations between the source and target domains, we knowledge can be transferred across tasks.
can further categorize the transductive transfer learning Finally, the last case can be referred to as the relational-
setting into two cases. knowledge-transfer problem [50], which deals with transfer
(2.1) The feature spaces between the source and target learning for relational domains. The basic assumption behind
domains are different, XS = XT . this context is that some relationship among the data in the
(2.2) The feature spaces between domains are the same, source and target domains are similar. Thus, the knowledge
XS = XT , but the marginal probability distributions of to be transferred is the relationship among the data. Recently,
the input data are different, P (XS ) = P (XT ). statistical relational learning techniques dominate this context
The latter case of the transductive transfer learning [51], [52].
setting is related to domain adaptation for knowledge Table 4 shows the cases where the different approaches
transfer in text classification [23] and sample selection are used for each transfer learning setting. We can see that
bias [24] or co-variate shift [25], whose assumptions are the inductive transfer learning setting has been studied in
similar. many research works, while the unsupervised transfer learning
3) Finally, in the unsupervised transfer learning setting, setting is a relatively new research topic and only studied
similar to inductive transfer learning setting, the target in the context of the feature-representation-transfer case. In
task is different from but related to the source task. addition, the feature-representation-transfer problem has been
However, the unsupervised transfer learning focus on proposed to all three settings of transfer learning. However,
solving unsupervised learning tasks in the target domain, the parameter-transfer and the relational-knowledge-transfer
such as clustering, dimensionality reduction and density approach are only studied in the inductive transfer learning
estimation [26], [27]. In this case, there are no labeled setting, which we discuss in detail below.
data available in both source and target domains in
training.
The relationship between the different settings of transfer
3 I NDUCTIVE T RANSFER L EARNING
learning and the related areas are summarized in Table 2 and Definition 2 (Inductive Transfer Learning) Given a source
Figure 2. domain DS and a learning task TS , a target domain DT and
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
5

TABLE 2
Different Settings of Transfer Learning

Transfer Learning Settings Related Areas Source Domain Labels Target Domain Labels Tasks
Inductive Transfer Learning Multi-task Learning Available Available Regression,
Classification
Self-taught Learning Unavailable Available Regression,
Classification
Transductive Transfer Learning Domain Adaptation, Sample Available Unavailable Regression,
Selection Bias, Co-variate Shift Classification
Unsupervised Transfer Learning Unavailable Unavailable Clustering,
Dimensionality
Reduction

Fig. 2. An Overview of Different Settings of Transfer

TABLE 3
Different Approaches to Transfer Learning

Transfer Learning Approaches Brief Description


Instance-transfer To re-weight some labeled data in the source domain for use in the target domain [6], [28], [29],
[30], [31], [24], [32], [33], [34], [35].
Feature-representation-transfer Find a “good” feature representation that reduces difference between the source and the target
domains and the error of classification and regression models [22], [36], [37], [38], [39], [8],
[40], [41], [42], [43], [44].
Parameter-transfer Discover shared parameters or priors between the source domain and target domain models, which
can benefit for transfer learning [45], [46], [47], [48], [49].
Relational-knowledge-transfer Build mapping of relational knowledge between the source domain and the target domains. Both
domains are relational domains and i.i.d assumption is relaxed in each domain [50], [51], [52].

TABLE 4
Different Approaches Used in Different Settings

Inductive
√ Transfer Learning Transductive
√ Transfer Learning Unsupervised Transfer Learning
Instance-transfer √ √ √
Feature-representation-transfer √
Parameter-transfer √
Relational-knowledge-transfer
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
6

a learning task TT , inductive transfer learning aims to help source domain data. If a lot of labeled data in the source
improve the learning of the target predictive function fT (·) in domain are available, supervised learning methods can be
DT using the knowledge in DS and TS , where TS = TT . used to construct a feature representation. This is similar to
common feature learning in the field of multi-task learning
Based on the above definition of the inductive transfer
[40]. If no labeled data in the source domain are available,
learning setting, a few labeled data in the target domain are
unsupervised learning methods are proposed to construct the
required as the training data to induce the target predictive
feature representation.
function. As mentioned in Section 2.3, this setting has two
cases: (1) Labeled data in the source domain are available;
(2) Labeled data in the source domain are unavailable while 3.2.1 Supervised Feature Construction
unlabeled data in the source domain are available. Most Supervised feature construction methods for the inductive
transfer learning approaches in this setting focus on the former transfer learning setting are similar to those used in multi-
case. task learning. The basic idea is to learn a low-dimensional
representation that is shared across related tasks. In addition,
the learned new representation can reduce the classification
3.1 Transferring Knowledge of Instances
or regression model error of each task as well. Argyriou et
The instance-transfer approach to the inductive transfer learn- al. [40] proposed a sparse feature learning method for multi-
ing setting is intuitively appealing: although the source domain task learning. In the inductive transfer learning setting, the
data cannot be reused directly, there are certain parts of the common features can be learned by solving an optimization
data that can still be reused together with a few labeled data problem, given as follows.
in the target domain.
Dai et al. [6] proposed a boosting algorithm, TrAdaBoost,  
nt
arg min L(yti , at , U T xti ) + γA22,1 (1)
which is an extension of the AdaBoost algorithm, to address A,U
t∈{T,S} i=1
the inductive transfer learning problems. TrAdaBoost assumes
that the source and target domain data use exactly the same set s.t. U ∈ Od
of features and labels, but the distributions of the data in the
In this equation, S and T denote the tasks in the source
two domains are different. In addition, TrAdaBoost assumes
domain and target domain, respectively. A = [aS , aT ] ∈ Rd×2
that, due to the difference in distributions between the source
is a matrix of parameters. U is a d × d orthogonal matrix
and the target domains, some of the source domain data may
(mapping function) for mapping the original high-dimensional
be useful in learning for the target domain but some of them
data to low-dimensional representations. The (r, p)-norm of
may not and could even be harmful. It attempts to iteratively d 1
A is defined as Ar,p := ( i=1 ai pr ) p . The optimization
re-weight the source domain data to reduce the effect of the
problem (1) estimates the low-dimensional representations
“bad” source data while encourage the “good” source data
U T XT , U T XS and the parameters, A, of the model at the
to contribute more for the target domain. For each round of
same time. The optimization problem (1) can be further trans-
iteration, TrAdaBoost trains the base classifier on the weighted
formed into an equivalent convex optimization formulation and
source and target data. The error is only calculated on the
be solved efficiently. In a follow-up work, Argyriou et al. [41]
target data. Furthermore, TrAdaBoost uses the same strategy as
proposed a spectral regularization framework on matrices for
AdaBoost to update the incorrectly classified examples in the
multi-task structure learning.
target domain while using a different strategy from AdaBoost
to update the incorrectly classified source examples in the Lee et al. [42] proposed a convex optimization algorithm for
source domain. Theoretical analysis of TrAdaBoost in also simultaneously learning meta-priors and feature weights from
given in [6]. an ensemble of related prediction tasks. The meta-priors can
Jiang and Zhai [30] proposed a heuristic method to remove be transferred among different tasks. Jebara [43] proposed to
“misleading” training examples from the source domain based select features for multi-task learning with SVMs. Ruckert et
on the difference between conditional probabilities P (yT |xT ) al. [54] designed a kernel-based approach to inductive transfer,
which aims at finding a suitable kernel for the target data.
and P (yS |xS ). Liao et al. [31] proposed a new active learning
method to select the unlabeled data in a target domain to
be labeled with the help of the source domain data. Wu and 3.2.2 Unsupervised Feature Construction
Dietterich [53] integrated the source domain (auxiliary) data an In [22], Raina et al. proposed to apply sparse coding [55],
SVM framework for improving the classification performance. which is an unsupervised feature construction method, for
learning higher level features for transfer learning. The basic
3.2 Transferring Knowledge of Feature Representa- idea of this approach consists of two steps. In the first step,
tions higher-level basis vectors b = {b1 , b2 , . . . , bs } are learned on
the source domain data by solving the optimization problem
The feature-representation-transfer approach to the inductive (2) as shown as follows,
transfer learning problem aims at finding “good” feature  
representations to minimize domain divergence and classifi- min xSi − j ajSi bj 22 + βaSi 1 (2)
a,b
cation or regression model error. Strategies to find “good” i
feature representations are different for different types of the s.t. bj 2 ≤ 1, ∀j ∈ 1, . . . , s
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
7

In this equation, ajSi is a new representation of basis bj for In inductive transfer learning,
input xSi and β is a coefficient to balance the feature con-
struction term and the regularization term. After learning the wS = w0 + vS and wT = w0 + vT ,
basis vectors b, in the second step, an optimization algorithm where, wS and wT are parameters of the SVMs for the source
(3) is applied on the target domain data to learn higher level task and the target learning task, respectively. w0 is a common
features based on the basis vectors b. parameter while vS and vT are specific parameters for the
 j source task and the target task, respectively. By assuming ft =
a∗Ti = arg min xTi − aTi bj 22 + βaTi 1 (3)
aTi wt · x to be a hyper-plane for task t, an extension of SVMs
j
to multi-task learning case can be written as the following:
Finally, discriminative algorithms can be applied to {a∗Ti } s
with corresponding labels to train classification or regression min J(w0 , vt , ξti ) (4)
w0 ,vt ,ξti
models for use in the target domain. One drawback of this  
nt 
method is that the so-called higher-level basis vectors learned λ1
= ξti + vt 2 + λ2 w0 2
on the source domain in the optimization problem (2) may not 2
t∈{S,T } i=1 t∈{S,T }
be suitable for use in the target domain. s.t. yti (w0 + vt ) · xti ≥ 1 − ξti ,
Recently, manifold learning methods have been adapted for
ξti ≥ 0, i ∈ {1, 2, ..., nt } and t ∈ {S, T }.
transfer learning. In [44], Wang and Mahadevan proposed
a Procrustes analysis based approach to manifold alignment By solving the optimization problem above, we can learn the
without correspondences, which can be used to transfer the parameters w0 , vS and vT simultaneously.
knowledge across domains via the aligned manifolds. Several researchers have pursued the parameter transfer
approach further. Gao et al. [49] proposed a locally weighted
3.3 Transferring Knowledge of Parameters ensemble learning framework to combine multiple models for
transfer learning, where the weights are dynamically assigned
Most parameter-transfer approaches to the inductive transfer according to a model’s predictive power on each test example
learning setting assume that individual models for related tasks in the target domain.
should share some parameters or prior distributions of hyper-
parameters. Most approaches described in this section, includ-
ing a regularization framework and a hierarchical Bayesian 3.4 Transferring Relational Knowledge
framework, are designed to work under multi-task learning. Different from other three contexts, the relational-knowledge-
However, they can be easily modified for transfer learning. transfer approach deals with transfer learning problems in
As mentioned above, multi-task learning tries to learn both relational domains, where the data are non-i.i.d. and can be
the source and target tasks simultaneously and perfectly, while represented by multiple relations, such as networked data and
transfer learning only aims at boosting the performance of social network data. This approach does not assume that the
the target domain by utilizing the source domain data. Thus, data drawn from each domain be independent and identically
in multi-task learning, weights of the loss functions for the distributed (i.i.d.) as traditionally assumed. It tries to transfer
source and target data are the same. In contrast, in transfer the relationship among data from a source domain to a
learning, weights in the loss functions for different domains target domain. In this context, statistical relational learning
can be different. Intuitively, we may assign a larger weight to techniques are proposed to solve these problems.
the loss function of the target domain to make sure that we Mihalkova et al. [50] proposed an algorithm TAMAR that
can achieve better performance in the target domain. transfers relational knowledge with Markov Logic Networks
Lawrence and Platt [45] proposed an efficient algorithm (MLNs) across relational domains. MLNs [56] is a powerful
known as MT-IVM, which is based on Gaussian Processes formalism, which combines the compact expressiveness of
(GP), to handle the multi-task learning case. MT-IVM tries to first order logic with flexibility of probability, for statistical
learn parameters of a Gaussian Process over multiple tasks by relational learning. In MLNs, entities in a relational domain
sharing the same GP prior. Bonilla et al. [46] also investigated are represented by predicates and their relationships are rep-
multi-task learning in the context of GP. The authors proposed resented in first-order logic. TAMAR is motivated by the fact
to use a free-form covariance matrix over tasks to model that if two domains are related to each other, there may exist
inter-task dependencies, where a GP prior is used to induce mappings to connect entities and their relationships from a
correlations between tasks. Schwaighofer et al. [47] proposed source domain to a target domain. For example, a professor
to use a hierarchical Bayesian framework (HB) together with can be considered as playing a similar role in an academic
GP for multi-task learning. domain as a manager in an industrial management domain.
Besides transferring the priors of the GP models, some In addition, the relationship between a professor and his or
researchers also proposed to transfer parameters of SVMs her students is similar to the relationship between a manager
under a regularization framework. Evgeniou and Pontil [48] and his or her workers. Thus, there may exist a mapping
borrowed the idea of HB to SVMs for multi-task learning. from professor to manager and a mapping from the professor-
The proposed method assumed that the parameter, w, in SVMs student relationship to the manager-worker relationship. In this
for each task can be separated into two terms. One is a vein, TAMAR tries to use an MLN learned for a source domain
common term over tasks and the other is a task-specific term. to aid in the learning of an MLN for a target domain. Basically,
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
8

TAMAR is a two-stage algorithm. In the first step, a mapping learning, we also assume that some target-domain unlabeled
is constructed from a source MLN to the target domain based data be given. In the above definition of transductive transfer
on weighted pseudo loglikelihood measure (WPLL). In the learning, the source and target tasks are the same, which
second step, a revision is done for the mapped structure in the implies that one can adapt the predictive function learned
target domain through the FORTE algorithm [57], which is an in the source domain for use in the target domain through
inductive logic programming (ILP) algorithm for revising first some unlabeled target-domain data. As mentioned in Section
order theories. The revised MLN can be used as a relational 2.3, this setting can be split to two cases: (a) The feature
model for inference or reasoning in the target domain. spaces between the source and target domains are different,
In the AAAI-2008 workshop on transfer learning for com- XS = XT , and (b) the feature spaces between domains are the
plex tasks 4 , Mihalkova et al. [51] extended TAMAR to the same, XS = XT , but the marginal probability distributions of
single-entity-centered setting of transfer learning, where only the input data are different, P (XS ) = P (XT ). This is similar
one entity in a target domain is available. Davis et al. [52] to the requirements in domain adaptation and sample selection
proposed an approach to transferring relational knowledge bias. Most approaches described in the following sections are
based on a form of second-order Markov logic. The basic related to case (b) above.
idea of the algorithm is to discover structural regularities in
the source domain in the form of Markov logic formulas 4.1 Transferring the Knowledge of Instances
with predicate variables, by instantiating these formulas with Most instance-transfer approaches to the transductive transfer
predicates from the target domain. learning setting are motivated by importance sampling. To
see how importance sampling based methods may help in
4 T RANSDUCTIVE T RANSFER L EARNING this setting, we first review the problem of empirical risk
The term transductive transfer learning was first proposed minimization (ERM) [60]. In general, we might want to learn
by Arnold et al. [58], where they required that the source the optimal parameters θ∗ of the model by minimizing the
and target tasks be the same, although the domains may be expected risk,
different. On top of these conditions, they further required θ∗ = arg min E(x,y)∈P [l(x, y, θ)],
that that all unlabeled data in the target domain are available θ∈Θ
at training time, but we believe that this condition can be where l(x, y, θ) is a loss function that depends on the para-
relaxed; instead, in our definition of the transductive transfer meter θ. However, since it is hard to estimate the probability
learning setting, we only require that part of the unlabeled distribution P , we choose to minimize the ERM instead,
target data be seen at training time in order to obtain the
1
n
marginal probability for the target data. θ∗ = arg min [l(xi , yi , θ)],
Note that the word ’transductive’ is used with several mean- θ∈Θ n i=1
ings. In the traditional machine learning setting, transductive
where n is size of the training data.
learning [59] refers to the situation where all test data are
In the transductive transfer learning setting, we want to
required to be seen at training time, and that the learned model
learn an optimal model for the target domain by minimizing
cannot be reused for future data. Thus, when some new test
the expected risk,
data arrive, they must be classified together with all existing 
data. In our categorization of transfer learning, in contrast, we θ∗ = arg min P (DT )l(x, y, θ).
use the term transductive to emphasize the concept that in this θ∈Θ
(x,y)∈DT
type of transfer learning, the tasks must be the same and there
However, since no labeled data in the target domain are
must be some unlabeled data available in the target domain.
observed in training data, we have to learn a model from the
Definition 3 (Transductive Transfer Learning) Given a source source domain data instead. If P (DS ) = P (DT ), then we may
domain DS and a corresponding learning task TS , a target simply learn the model by solving the following optimization
domain DT and a corresponding learning task TT , transductive problem for use in the target domain,
transfer learning aims to improve the learning of the target 
θ∗ = arg min P (DS )l(x, y, θ).
predictive function fT (·) in DT using the knowledge in DS θ∈Θ
(x,y)∈DS
and TS , where DS = DT and TS = TT . In addition, some
unlabeled target domain data must be available at training time. Otherwise, when P (DS ) = P (DT ), we need to modify
the above optimization problem to learn a model with high
This definition covers the work of Arnold et al. [58], since
generalization ability for the target domain, as follows:
the latter considered domain adaptation, where the difference
 P (DT )
lies between the marginal probability distributions of source θ∗ = arg min P (DS )l(x, y, θ)
and target data; i.e., the tasks are the same but the domains θ∈Θ P (DS )
(x,y)∈DS
are different.
nS
PT (xTi , yTi )
Similar to the traditional transductive learning setting, which ≈ arg min l(xSi , ySi , θ). (5)
aims to make the best use of the unlabeled test data for learn- θ∈Θ i=1
PS (xSi , ySi )
ing, in our classification scheme under transductive transfer Therefore, by adding different penalty values to each instance
P (x ,y )
4. http://www.cs.utexas.edu/∼mtaylor/AAAI08TL/ (xSi , ySi ) with the corresponding weight PTS (xSTi ,ySTi ) , we can
i i
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
9

learn a precise model for the target domain. Furthermore, between the domains. The first step of SCL is to define a set
since P (YT |XT ) = P (YS |XS ). Thus the difference between of pivot features 6 (the number of pivot feature is denoted
P (DS ) and P (DT ) is caused by P (XS ) and P (XT ) and by m) on the unlabeled data from both domains. Then, SCL
PT (xTi ,yTi ) P (xSi ) P (xSi )
PS (xSi ,ySi ) = P (xTi ) . If we can estimate P (xTi ) for each
removes these pivot features from the data and treats each pivot
instance, we can solve the transductive transfer learning feature as a new label vector. The m classification problems
problems. can be constructed. By assuming each problem can be solved
P (x )
There exist various ways to estimate P (xSTi ) . Zadrozny by linear classifier, which is shown as follows,
i
[24] proposed to estimate the terms P (xSi ) and P (xTi ) fl (x) = sgn(wlT · x), l = 1, . . . , m
independently by constructing simple classification problems.
Fan et al. [35] further analyzed the problems by using various SCL can learn a matrix W = [w1 w2 . . . wm ] of parameters. In
classifiers to estimate the probability ratio. Huang et al. [32] the third step, singular value decomposition (SVD) is applied
proposed a kernel-mean matching (KMM) algorithm to learn to matrix W = [w1 w2 . . . wm ]. Let W = U DV T , then
P (xSi ) θ = U[1:h,:]
T
(h is the number of the shared features) is the
P (xTi ) directly by matching the means between the source
domain data and the target domain data in a reproducing- matrix (linear mapping) whose rows are the top left singular
kernel Hilbert space (RKHS). KMM can be rewritten as the vectors of W . Finally, standard discriminative algorithms can
following quadratic programming (QP) optimization problem. be applied to the augmented feature vector to build models.
The augmented feature vector contains all the original feature
1 T
min − κT β
2 β Kβ (6) xi appended with the new shared features θxi . As mentioned
β
nS in [38], if the pivot features are well designed, then the
s.t. βi ∈ [0, B] and | i=1 βi − nS | ≤ nS  learned mapping θ encodes the correspondence between the
 
KS,S KS,T features from the different domains. Although Ben-David and
where K = K K
and Kij = k(xi , xj ). KS,S and Schuller [61] showed experimentally that SCL can reduce the
T,S T,T
KT,T are kernel matrices for the source domain data and the difference between domains, how to select the pivot features
nS nT is difficult and domain-dependent. In [38], Blitzer et al. used a
target domain data,
 respectively. κ i = n T j=1 k(xi , xTj ),
where xi ∈ XS XT , while xTj ∈ XT . heuristic method to select pivot features for natural language
P (x )
It can be proved that βi = P (xSTi ) [32]. An advantage of processing (NLP) problems, such as tagging of sentences. In
i
using KMM is that it can avoid performing density estimation their follow-up work, the researchers proposed to use Mutual
of either P (xSi ) or P (xTi ), which is difficult when the size Information (MI) to choose the pivot features instead of using
of the data set is small. Sugiyama et al. [34] proposed an more heuristic criteria [8]. MI-SCL tries to find some pivot
algorithm known as Kullback-Leibler Importance Estimation features that have high dependence on the labels in the source
P (x ) domain.
Procedure (KLIEP) to estimate P (xSTi ) directly, based on the
i Transfer learning in the NLP domain is sometimes re-
minimization of the Kullback-Leibler divergence. KLIEP can
ferred to as domain adaptation. In this area, Daumé [39]
be integrated with cross-validation to perform model selec-
proposed a kernel-mapping function for NLP problems, which
tion automatically in two steps: (1) estimating the weights
maps the data from both source and target domains to a
of the source domain data; (2) training models on the re-
high-dimensional feature space, where standard discriminative
weighted data. Bickel et al. [33] combined the two steps in
learning methods are used to train the classifiers. However,
a unified framework by deriving a kernel-logistic regression
the constructed kernel mapping function is domain knowledge
classifier. Besides sample re-weighting techniques, Dai et
driven. It is not easy to generalize the kernel mapping to other
al. [28] extended a traditional Naive Bayesian classifier for the
areas or applications. Blitzer et al. [62] analyzed the uniform
transductive transfer learning problems. For more information
convergence bounds for algorithms that minimized a convex
on importance sampling and re-weighting methods for co-
combination of source and target empirical risks.
variate shift or sample selection bias, readers can refer to a
In [36], Dai et al. proposed a co-clustering based algorithm
recently published book [29] by Quionero-Candela et al. One
to propagate the label information across different domains.
can also consult a tutorial on Sample Selection Bias by Fan
In [63], Xing et al. proposed a novel algorithm known as
and Sugiyama in ICDM-08 5 .
bridged refinement to correct the labels predicted by a shift-
unaware classifier towards a target distribution and take the
4.2 Transferring Knowledge of Feature Representa- mixture distribution of the training and test data as a bridge to
tions better transfer from the training data to the test data. In [64],
Most feature-representation transfer approaches to the trans- Ling et al. proposed a spectral classification framework for
ductive transfer learning setting are under unsupervised learn- cross-domain transfer learning problem, where the objective
ing frameworks. Blitzer et al. [38] proposed a structural cor- function is introduced to seek consistency between the in-
respondence learning (SCL) algorithm, which extends [37], domain supervision and the out-of-domain intrinsic structure.
to make use of the unlabeled data from the target domain to In [65], Xue et al. proposed a cross-domain text classification
extract some revelent features that may reduce the difference algorithm that extended the traditional probabilistic latent
semantic analysis (PLSA) algorithm to integrate labeled and
5. Tutorial slides can be found at http://www.cs.columbia.edu/∼
fan/PPT/ICDM08SampleBias.ppt 6. The pivot features are domain specific and depend on prior knowledge
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
10

unlabeled data from different but related domains, into a An iterative algorithm for solving the optimization function
unified probabilistic model. The new model is called Topic- (8) was given in [26].
bridged PLSA, or TPLSA. Similarly, [27] proposed a transferred discriminative analy-
Transfer learning via dimensionality reduction was re- sis (TDA) algorithm to solve the transfer dimensionality
cently proposed by Pan et al. [66]. In this work, Pan et reduction problem. TDA first applies clustering methods to
al. exploited the Maximum Mean Discrepancy Embedding generate pseudo-class labels for the target unlabeled data. It
(MMDE) method, originally designed for dimensionality re- then applies dimensionality reduction methods to the target
duction, to learn a low dimensional space to reduce the data and labeled source data to reduce the dimensions. These
difference of distributions between different domains for trans- two steps run iteratively to find the best subspace for the target
ductive transfer learning. However, MMDE may suffer from data.
its computational burden. Thus, in [67], Pan et al. further
proposed an efficient feature extraction algorithm, known as 6 T RANSFER B OUNDS AND N EGATIVE
Transfer Component Analysis (TCA) to overcome the draw- T RANSFER
back of MMDE.
An important issue is to recognize the limit of the power of
transfer learning. In [68], Hassan Mahmud and Ray analyzed
5 U NSUPERVISED T RANSFER L EARNING the case of transfer learning using Kolmogorov complexity,
Definition 4 (Unsupervised Transfer Learning) Given a where some theoretical bounds are proved. In particular, the
source domain DS with a learning task TS , a target domain DT authors used conditional Kolmogorov complexity to measure
and a corresponding learning task TT , unsupervised transfer relatedness between tasks and transfer the “right” amount
learning aims to help improve the learning of the target of information in a sequential transfer learning task under a
predictive function fT (·) 7 in DT using the knowledge in DS Bayesian framework.
and TS , where TS = TT and YS and YT are not observable. Recently, Eaton et al. [69] proposed a novel graph-based
Based on the definition of the unsupervised transfer learn- method for knowledge transfer, where the relationships be-
ing setting, no labeled data are observed in the source and tween source tasks are modeled by embedding the set of
target domains in training. So far, there is little research work learned source models in a graph using transferability as the
on this setting. Recently, Self-taught clustering (STC) [26] metric. Transferring to a new task proceeds by mapping the
and transferred discriminative analysis (TDA) [27] algorithms problem into the graph and then learning a function on this
are proposed to transfer clustering and transfer dimensionality graph that automatically determines the parameters to transfer
reduction problems, respectively. to the new learning task.
Negative transfer happens when the source domain data and
task contribute to the reduced performance of learning in the
5.1 Transferring Knowledge of Feature Representa- target domain. Despite the fact that how to avoid negative
tions transfer is a very important issue, little research work has
Dai et al. [26] studied a new case of clustering problems, been published on this topic. Rosenstein et al. [70] empirically
known as self-taught clustering. Self-taught clustering is an showed that if two tasks are too dissimilar, then brute-force
instance of unsupervised transfer learning, which aims at transfer may hurt the performance of the target task. Some
clustering a small collection of unlabeled data in the target works have been exploited to analyze relatedness among tasks
domain with the help of a large amount of unlabeled data in and task clustering techniques, such as [71], [72], which
the source domain. STC tries to learn a common feature space may help provide guidance on how to avoid negative transfer
across domains, which helps in clustering in the target domain. automatically. Bakker and Heskes [72] adopted a Bayesian
The objective function of STC is shown as follows. approach in which some of the model parameters are shared
for all tasks and others more loosely connected through a joint
J(X˜T , X˜S , Z̃) (7)
  prior distribution that can be learned from the data. Thus,
= I(XT , Z) − I(X˜T , Z̃) + λ I(XS , Z) − I(X˜S , Z̃) the data are clustered based on the task parameters, where
tasks in the same cluster are supposed to be related to each
where XS and XT are the source and target domain data, others. Argyriou et al. [73] considered situations in which the
respectively. Z is a shared feature space by XS and XT , learning tasks can be divided into groups. Tasks within each
and I(·, ·) is the mutual information between two random group are related by sharing a low-dimensional representation,
variables. Suppose that there exist three clustering functions which differs among different groups. As a result, tasks within
CXT : XT → X˜T , CXS : XS → X˜S and CZ : Z → Z̃, where a group can find it easier to transfer useful knowledge.
X˜T , X˜S and Z̃ are corresponding clusters of XT , XS and Z,
respectively. The goal of STC is to learn X˜T by solving the
optimization problem (7):
7 A PPLICATIONS OF T RANSFER L EARNING
Recently, transfer learning techniques have been applied suc-
arg min J(X˜T , X˜S , Z̃) (8) cessfully in many real-world applications. Raina et al. [74] and
X˜T ,X˜S ,Z̃
Dai et al. [36], [28] proposed to use transfer learning tech-
7. In unsupervised transfer learning, the predicted labels are latent variables, niques to learn text data across domains, respectively. Blitzer
such as clusters or reduced dimensions et al. [38] proposed to use SCL for solving NLP problems. An
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
11

extension of SCL was proposed in [8] for solving sentiment data sets are categorized to a hierarchical structure.
classification problems. Wu and Dietterich [53] proposed to Data from different sub-categories under the same
use both inadequate target domain data and plenty of low parent category are considered to be from different
quality source domain data for image classification problems. but related domains. The task is to predict the labels
Arnold et al. [58] proposed to use transductive transfer learn- of the parent category.
ing methods to solve name-entity recognition problems. In Email This data set is provided by the 2006 ECML/PKDD
[75], [76], [77], [78], [79], transfer learning techniques are discovery challenge.
proposed to extract knowledge from WiFi localization models WiFi This data set is provided by the ICDM-2007 Contest
10
across time periods, space and mobile devices, to benefit WiFi . The data were collected inside a building for
localization tasks in other settings. Zhuo et al. [80] studied how localization around 145.5 × 37.5 m2 in two different
to transfer domain knowledge to learn relational action models time periods.
across domains in automated planning. Sen This data set was first used in [8]11 . This data set con-
In [81], Raykar et al. proposed a novel Bayesian multiple- tains product reviews downloaded from Amazon.com
instance learning algorithm, which can automatically identify from 4 product types (domains): Kitchen, Books,
the relevant feature subset and use inductive transfer for DVDs, and Electronics. Each domain has several
learning multiple, but conceptually related, classifiers, for thousand reviews, but the exact number varies by
computer aided design (CAD). In [82], Ling et al. proposed an domain. Reviews contain star ratings (1 to 5 stars).
information-theoretic approach for transfer learning to address Empirical Evaluation To show how much benefit transfer
the cross-language classification problem for translating Web learning methods can bring as compared to traditional learning
pages from English to Chinese. The approach addressed the methods, researchers have used some public data sets. We
problem when there are plenty of labeled English text data show a list taken from some published transfer learning
whereas there are only a small number of labeled Chinese text papers in Table 5. In [6], [84], [49], the authors used the 20
documents. Transfer learning across the two feature spaces are Newsgroups data 12 as one of the evaluation data sets. Due
achieved by designing a suitable mapping function as a bridge. to the differences in the preprocessing steps of the algorithms
So far, there are at least two international competitions based by different researchers, it is hard to compare the proposed
on transfer learning, which made available some much needed methods directly. Thus, we denote them by 20-Newsgroups1 ,
public data. In the ECML/PKDD-2006 discovery challenge 20-Newsgroups2 and 20-Newsgroups3 , respectively, and show
8
, the task was to handle personalized spam filtering and the comparison results between the proposed transfer learning
generalization across related learning tasks. For training a methods and non-transfer learning methods in the table.
spam-filtering system, we need to collect a lot of emails from On the 20 Newsgroups1 data, Dai et al. [6] showed the
a group of users with corresponding labels: spam or not spam, comparison experiments between standard Support Vector
and train a classifier based on these data. For a new email user, Machine (SVM) and the proposed TrAdaBoost algorithm. On
we might want to adapt the learned model for the user. The 20 Newsgroups2 , Shi et al. [84] applied an active learning
challenge is that the distributions of emails for the first set of algorithm to select important instances for transfer learning
users and the new user are different. Thus, this problem can (AcTraK) with TrAdaBoost and standard SVM. Gao et al. [49]
be modeled as an inductive transfer learning problem, which evaluated their proposed locally weighted ensemble learning
aims to adapt an old spam-filtering model to a new situation algorithms, pLWE and LWE, on the 20 Newsgroups3 , com-
with fewer training data and less training time. pared to SVM and Logistic Regression (LR).
A second data set was made available through the ICDM- In addition, in the table, we also show the comparison
2007 Contest, in which a task was to estimate a WiFi client’s results on the sentiment classification data set reported in [8].
indoor locations using the WiFi signal data obtained over On this data set, SGD denotes the stochastic gradient-descent
different periods of time [83]. Since the values of WiFi algorithm with Huber loss, SCL represents a linear predictor
signal strength may be a function of time, space and devices, on the new representations learned by Structural Correspon-
distributions of WiFi data over different time periods may be dence Learning algorithm, and SCL-MI is an extension of SCL
very different. Thus, transfer learning must be designed to by applying Mutual Information to select the pivot features for
reduce the data re-labeling effort. the SCL algorithm.
Data Sets for Transfer Learning: So far, several data Finally, on the WiFi localization data set, we show the
sets have been published for transfer learning research. We comparison results reported in [67], where the baseline is a
denote the text mining data sets, Email spam-filtering data regularized least square regression model (RLSR), which is
set, the WiFi localization over time periods data set and the a standard regression model, and KPCA, which represents to
Sentiment classification data set by Text, Email, WiFi and apply RLSR on the new representations of the data learned by
Sen, respectively. Kernel Principle Component Analysis. The compared transfer
Text Three data sets, 20 Newsgroups, SRAA and Reuters- learning methods include Kernel Mean Matching (KMM) and
21578 9 , have been preprocessed for a transfer learn- the proposed algorithm, Transfer Component Analysis (TCA).
ing setting by some researchers. The data in these
10. http://www.cse.ust.hk/∼qyang/ICDMDMC2007
8. http://www.ecmlpkdd2006.org/challenge.html 11. http://www.cis.upenn.edu/∼mdredze/datasets/sentiment/
9. http://apex.sjtu.edu.cn/apex wiki/dwyak 12. http://people.csail.mit.edu/jrennie/20Newsgroups/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
12

For more detail about the experimental results, the readers assume that the selected source domain is related to the target
may refer to the reference papers showed in the table. From domain.
these comparison results, we can find that the transfer learning In the future, several important research issues need to be
methods designed appropriately for real world applications can addressed. First, how to avoid negative transfer is an open
indeed improve the performance significantly compared to the problem. As mentioned in Section 6, many proposed transfer
non-transfer learning methods. learning algorithms assume that the source and target domains
Toolboxes for Transfer Learning: Researchers at UC Berke- are related to each other in some sense. However, if the as-
ley provided a MATLAB toolkit for transfer learning 13 . sumption does not hold, negative transfer may happen, which
The toolkit contains algorithms and benchmark data sets for may cause the learner to perform worse than no transferring at
transfer learning. In addition, it provides a standard platform all. Thus, how to make sure that no negative transfer happens
for developing and testing new algorithms for transfer learning. is a crucial issue in transfer learning. In order to avoid negative
transfer learning, we need to first study transferability between
7.1 Other Applications of Transfer Learning source domains or tasks and target domains or tasks. Based on
suitable transferability measures, we can then select relevant
Transfer learning has found many applications in sequential source domains or tasks to extract knowledge from for learning
machine learning as well. For example, [85] proposed a graph- the target tasks. To define the transferability between domains
based method for identifying previously encountered games, and tasks, we also need to define the criteria to measure the
and applied this technique to automate domain mapping for similarity between domains or tasks. Based on the distance
value function transfer and speed up reinforcement learning measures, we can then cluster domains or tasks, which may
on variants of previously played games. A new approach to help measure transferability. A related issue is when an entire
transfer between entirely different feature spaces is proposed domain cannot be used for transfer learning, whether we can
in translated learning, which is made possible by learning still transfer part of the domain for useful learning in the target
a mapping function for bridging features in two entirely domain.
different domains (images and text) [86]. Finally, Li et al. [87],
In addition, most existing transfer learning algorithms so far
[88] have applied transfer learning to collaborative filtering
focused on improving generalization across different distribu-
problems to solve the cold start and sparsity problems. In [87],
tions between source and target domains or tasks. In doing so,
Li et al. learned a shared rating-pattern mixture model, known
they assumed that the feature spaces between the source and
as a Rating-Matrix Generative Model (RMGM), in terms of
target domains are the same. However, in many applications,
the latent user- and item-cluster variables. RMGM bridges
we may wish to transfer knowledge across domains or tasks
multiple rating matrices from different domains by mapping
that have different feature spaces, and transfer from multiple
the users and items in each rating matrix onto the shared latent
such source domains. We refer to this type of transfer learning
user and item spaces in order to transfer useful knowledge.
as heterogeneous transfer learning.
In [88], they applied co-clustering algorithms on users and
Finally, so far transfer learning techniques have been mainly
items in an auxiliary rating matrix. They then constructed a
applied to small scale applications with a limited variety, such
cluster-level rating matrix known as a codebook. By assuming
as sensor-network-based localization, text classification and
the target rating matrix (on movies) is related to the auxiliary
image classification problems. In the future, transfer learning
one (on books), the target domain can be reconstructed by
techniques will be widely used to solve other challenging ap-
expanding the codebook, completing the knowledge transfer
plications, such as video classification, social network analysis
process.
and logical inference.

8 C ONCLUSIONS
Acknowledgment
In this survey article, we have reviewed several current trends
of transfer learning. Transfer learning is classified to three We thank the support of Hong Kong CERG Project 621307
different settings: inductive transfer learning, transductive and a grant from NEC China Lab.
transfer learning and unsupervised transfer learning. Most pre-
vious works focused on the former two settings. Unsupervised
R EFERENCES
transfer learning may attract more and more attention in the
future. [1] X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J.
McLachlan, A. F. M. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach,
Furthermore, each of the approaches to transfer learning D. J. Hand, and D. Steinberg, “Top 10 algorithms in data mining,”
can be classified into four contexts based on “what to trans- Knowl. Inf. Syst., vol. 14, no. 1, pp. 1–37, 2008.
fer” in learning. They include the instance-transfer approach, [2] Q. Yang and X. Wu, “10 challenging problems in data mining research,”
the feature-representation-transfer approach, the parameter- International Journal of Information Technology and Decision Making,
vol. 5, no. 4, pp. 597–604, 2006.
transfer approach and the relational-knowledge-transfer ap- [3] G. P. C. Fung, J. X. Yu, H. Lu, and P. S. Yu, “Text classification without
proach, respectively. The former three contexts have an i.i.d negative examples revisit,” IEEE Transactions on Knowledge and Data
assumption on the data while the last context deals with Engineering, vol. 18, no. 1, pp. 6–20, 2006.
[4] H. Al-Mubaid and S. A. Umair, “A new text categorization technique
transfer learning on relational data. Most of these approaches using distributional clustering and learning logic,” IEEE Transactions
on Knowledge and Data Engineering, vol. 18, no. 9, pp. 1156–1165,
13. http://multitask.cs.berkeley.edu/ 2006.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
13

TABLE 5
Comparison between transfer learning and non-transfer learning methods

Data Set (reference) Source v.s. Target Baselines TL Methods


20 Newsgroups1 ([6]) SVM TrAdaBoost
ACC (unit: %) rec v.s. talk 87.3% 92.0%
rec v.s. sci 83.6% 90.3%
sci v.s. talk 82.3% 87.5%
20 Newsgroups2 ([84]) SVM TrAdaBoost AcTraK
ACC (unit: %) rec v.s. talk 60.2% 72.3% 75.4%
rec v.s. sci 59.1% 67.4% 70.6%
comp v.s. talk 53.6% 74.4% 80.9%
comp v.s. sci 52.7% 57.3% 78.0%
comp v.s. rec 49.1% 77.2% 82.1%
sci v.s. talk 57.6% 71.3% 75.1%
20 Newsgroups3 ([49]) SVM LR pLWE LWE
ACC (unit: %) comp v.s. sci 71.18% 73.49% 78.72% 97.44%
rec v.s. talk 68.24% 72.17% 72.17% 99.23%
rec v.s. sci 78.16% 78.85% 88.45% 98.23%
sci v.s. talk 75.77% 79.04% 83.30% 96.92%
comp v.s. rec 81.56% 83.34% 91.93% 98.16%
comp v.s. talk 93.89% 91.76% 96.64% 98.90%
Sentiment Classification ([8]) SGD SCL SCL-MI
ACC (unit: %) DVD v.s. book 72.8% 76.8% 79.7%
electronics v.s. book 70.7% 75.4% 75.4%
kitchen v.s. book 70.9% 66.1% 68.6%
book v.s. DVD 77.2% 74.0% 75.8%
electronics v.s. DVD 70.6% 74.3% 76.2%
kitchen v.s. DVD 72.7% 75.4% 76.9%
book v.s. electronics 70.8% 77.5% 75.9%
DVD v.s. electronics 73.0% 74.1% 74.1%
kitchen v.s. electronics 82.7% 83.7% 86.8%
book v.s. kitchen 74.5% 78.7% 78.9%
DVD v.s. kitchen 74.0% 79.4% 81.4%
electronics v.s. kitchen 84.0% 84.4% 85.9%
WiFi Localization ([67]) RLSR PCA KMM TCA
AED (unit: meter) Time A v.s. Time B 6.52 3.16 5.51 2.37

[5] K. Sarinnapakorn and M. Kubat, “Combining subclassifiers in text cat- Wisconsin–Madison, Tech. Rep. 1530, 2006.
egorization: A dst-based solution and a case study,” IEEE Transactions [15] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell, “Text clas-
on Knowledge and Data Engineering, vol. 19, no. 12, pp. 1638–1651, sification from labeled and unlabeled documents using em,” Machine
2007. Learning, vol. 39, no. 2-3, pp. 103–134, 2000.
[6] W. Dai, Q. Yang, G. Xue, and Y. Yu, “Boosting for transfer learning,” in [16] A. Blum and T. Mitchell, “Combining labeled and unlabeled data with
Proceedings of the 24th International Conference on Machine Learning, co-training,” in Proceedings of the Eleventh Annual Conference on
Corvalis, Oregon, USA, June 2007, pp. 193–200. Computational Learning Theory, 1998, pp. 92–100.
[7] S. J. Pan, V. W. Zheng, Q. Yang, and D. H. Hu, “Transfer learning [17] T. Joachims, “Transductive inference for text classification using support
for wifi-based indoor localization,” in Proceedings of the Workshop on vector machines,” in Proceedings of Sixteenth International Conference
Transfer Learning for Complex Task of the 23rd AAAI Conference on on Machine Learning, 1999, pp. 825–830.
Artificial Intelligence, Chicago, Illinois, USA, July 2008. [18] X. Zhu and X. Wu, “Class noise handling for effective cost-sensitive
[8] J. Blitzer, M. Dredze, and F. Pereira, “Biographies, bollywood, boom- learning by cost-guided iterative classification filtering,” IEEE Transac-
boxes and blenders: Domain adaptation for sentiment classification,” tions on Knowledge and Data Engineering, vol. 18, no. 10, pp. 1435–
in Proceedings of the 45th Annual Meeting of the Association of 1440, 2006.
Computational Linguistics, Prague, Czech Republic, 2007, pp. 432–439. [19] Q. Yang, C. Ling, X. Chai, and R. Pan, “Test-cost sensitive classification
[9] J. Ramon, K. Driessens, and T. Croonenborghs, “Transfer learning in on data with missing values,” IEEE Transactions on Knowledge and
reinforcement learning problems through partial policy recycling,” in Data Engineering, vol. 18, no. 5, pp. 626–638, 2006.
ECML ’07: Proceedings of the 18th European conference on Machine [20] S. Thrun and L. Pratt, Eds., Learning to learn. Norwell, MA, USA:
Learning. Berlin, Heidelberg: Springer-Verlag, 2007, pp. 699–707. Kluwer Academic Publishers, 1998.
[10] M. E. Taylor and P. Stone, “Cross-domain transfer for reinforcement [21] R. Caruana, “Multitask learning,” Machine Learning, vol. 28(1), pp. 41–
learning,” in ICML ’07: Proceedings of the 24th international conference 75, 1997.
on Machine learning. New York, NY, USA: ACM, 2007, pp. 879–886. [22] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Self-taught
[11] X. Yin, J. Han, J. Yang, and P. S. Yu, “Efficient classification across learning: Transfer learning from unlabeled data,” in Proceedings of the
multiple database relations: A crossmine approach,” IEEE Transactions 24th International Conference on Machine Learning, Corvalis, Oregon,
on Knowledge and Data Engineering, vol. 18, no. 6, pp. 770–783, 2006. USA, June 2007, pp. 759–766.
[12] L. I. Kuncheva and J. J. Rodrłguez, “Classifier ensembles with a random [23] H. DauméIII and D. Marcu, “Domain adaptation for statistical classi-
linear oracle,” IEEE Transactions on Knowledge and Data Engineering, fiers,” Journal of Artificial Intelligence Research, vol. 26, pp. 101–126,
vol. 19, no. 4, pp. 500–508, 2007. 2006.
[13] E. Baralis, S. Chiusano, and P. Garza, “A lazy approach to associative [24] B. Zadrozny, “Learning and evaluating classifiers under sample selection
classification,” IEEE Transactions on Knowledge and Data Engineering, bias,” in Proceedings of the 21st International Conference on Machine
vol. 20, no. 2, pp. 156–171, 2008. Learning, Banff, Alberta, Canada, July 2004.
[14] X. Zhu, “Semi-supervised learning literature survey,” University of [25] H. Shimodaira, “Improving predictive inference under covariate shift by
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
14

weighting the log-likelihood function,” Journal of Statistical Planning Information Processing Systems. Cambridge, MA: MIT Press, 2008,
and Inference, vol. 90, pp. 227–244, 2000. pp. 153–160.
[26] W. Dai, Q. Yang, G. Xue, and Y. Yu, “Self-taught clustering,” in [47] A. Schwaighofer, V. Tresp, and K. Yu, “Learning gaussian process
Proceedings of the 25th International Conference of Machine Learning. kernels via hierarchical bayes,” in Proceedings of the 17th Annual
ACM, July 2008, pp. 200–207. Conference on Neural Information Processing Systems. Cambridge,
[27] Z. Wang, Y. Song, and C. Zhang, “Transferred dimensionality reduc- MA: MIT Press, 2005, pp. 1209–1216.
tion,” in Machine Learning and Knowledge Discovery in Databases, Eu- [48] T. Evgeniou and M. Pontil, “Regularized multi-task learning,” in
ropean Conference, ECML/PKDD 2008. Antwerp, Belgium: Springer, Proceedings of the 10th ACM SIGKDD International Conference on
September 2008, pp. 550–565. Knowledge Discovery and Data Mining. Seattle, Washington, USA:
[28] W. Dai, G. Xue, Q. Yang, and Y. Yu, “Transferring naive bayes classifiers ACM, August 2004, pp. 109–117.
for text classification,” in Proceedings of the 22rd AAAI Conference on [49] J. Gao, W. Fan, J. Jiang, and J. Han, “Knowledge transfer via multiple
Artificial Intelligence, Vancouver, British Columbia, Canada, July 2007, model local structure mapping,” in Proceedings of the 14th ACM
pp. 540–545. SIGKDD International Conference on Knowledge Discovery and Data
[29] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Mining. Las Vegas, Nevada: ACM, August 2008, pp. 283–291.
Lawrence, Dataset Shift in Machine Learning. The MIT Press, 2009. [50] L. Mihalkova, T. Huynh, and R. J. Mooney, “Mapping and revising
[30] J. Jiang and C. Zhai, “Instance weighting for domain adaptation in markov logic networks for transfer learning,” in Proceedings of the
nlp,” in Proceedings of the 45th Annual Meeting of the Association of 22nd AAAI Conference on Artificial Intelligence, Vancouver, British
Computational Linguistics. Prague, Czech Republic: Association for Columbia, Canada, July 2007, pp. 608–614.
Computational Linguistics, June 2007, pp. 264–271.
[51] L. Mihalkova and R. J. Mooney, “Transfer learning by mapping with
[31] X. Liao, Y. Xue, and L. Carin, “Logistic regression with an auxiliary
minimal target data,” in Proceedings of the AAAI-2008 Workshop on
data source,” in Proceedings of the 21st International Conference on
Transfer Learning for Complex Tasks, Chicago, Illinois, USA, July 2008.
Machine Learning, Bonn, Germany, August 2005, pp. 505–512.
[32] J. Huang, A. Smola, A. Gretton, K. M. Borgwardt, and B. Schölkopf, [52] J. Davis and P. Domingos, “Deep transfer via second-order markov
“Correcting sample selection bias by unlabeled data,” in Proceedings of logic,” in Proceedings of the AAAI-2008 Workshop on Transfer Learning
the 19th Annual Conference on Neural Information Processing Systems, for Complex Tasks, Chicago, Illinois, USA, July 2008.
2007. [53] P. Wu and T. G. Dietterich, “Improving svm accuracy by training
[33] S. Bickel, M. Brückner, and T. Scheffer, “Discriminative learning for on auxiliary data sources,” in Proceedings of the 21st International
differing training and test distributions,” in Proceedings of the 24th Conference on Machine Learning. Banff, Alberta, Canada: ACM, July
international conference on Machine learning. New York, NY, USA: 2004.
ACM, 2007, pp. 81–88. [54] U. Rückert and S. Kramer, “Kernel-based inductive transfer,” in Machine
[34] M. Sugiyama, S. Nakajima, H. Kashima, P. V. Buenau, and M. Kawan- Learning and Knowledge Discovery in Databases, European Confer-
abe, “Direct importance estimation with model selection and its appli- ence, ECML/PKDD 2008, ser. Lecture Notes in Computer Science.
cation to covariate shift adaptation,” in Proceedings of the 20th Annual Antwerp, Belgium: Springer, September 2008, pp. 220–233.
Conference on Neural Information Processing Systems, Vancouver, [55] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding
British Columbia, Canada, December 2008. algorithms,” in Proceedings of the the 19th Annual Conference on Neural
[35] W. Fan, I. Davidson, B. Zadrozny, and P. S. Yu, “An improved Information Processing Systemss. Cambridge, MA: MIT Press, 2007,
categorization of classifier’s sensitivity on sample selection bias,” in pp. 801–808.
Proceedings of the 5th IEEE International Conference on Data Mining, [56] M. Richardson and P. Domingos, “Markov logic networks,” Machine
2005. Learning Journal, vol. 62, no. 1-2, pp. 107–136, 2006.
[36] W. Dai, G. Xue, Q. Yang, and Y. Yu, “Co-clustering based classifica- [57] S. Ramachandran and R. J. Mooney, “Theory refinement of bayesian
tion for out-of-domain documents,” in Proceedings of the 13th ACM networks with hidden variables,” in Proceedings of the 14th Interna-
SIGKDD International Conference on Knowledge Discovery and Data tional Conference on Machine Learning, Madison, Wisconson, USA,
Mining, San Jose, California, USA, August 2007. July 1998, pp. 454–462.
[37] R. K. Ando and T. Zhang, “A high-performance semi-supervised learn- [58] A. Arnold, R. Nallapati, and W. W. Cohen, “A comparative study of
ing method for text chunking,” in Proceedings of the 43rd Annual methods for transductive transfer learning,” in Proceedings of the 7th
Meeting on Association for Computational Linguistics. Morristown, IEEE International Conference on Data Mining Workshops. Washing-
NJ, USA: Association for Computational Linguistics, 2005, pp. 1–9. ton, DC, USA: IEEE Computer Society, 2007, pp. 77–82.
[38] J. Blitzer, R. McDonald, and F. Pereira, “Domain adaptation with [59] T. Joachims, “Transductive inference for text classification using support
structural correspondence learning,” in Proceedings of the Conference on vector machines,” in Proceedings of the Sixteenth International Confer-
Empirical Methods in Natural Language, Sydney, Australia, July 2006, ence on Machine Learning, San Francisco, CA, USA, 1999, pp. 200–
pp. 120–128. 209.
[39] H. Daumé III, “Frustratingly easy domain adaptation,” in Proceedings [60] V. N. Vapnik, Statistical Learning Theory. New York: Wiley-
of the 45th Annual Meeting of the Association of Computational Lin- Interscience, September 1998.
guistics, Prague, Czech Republic, June 2007, pp. 256–263.
[61] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis of rep-
[40] A. Argyriou, T. Evgeniou, and M. Pontil, “Multi-task feature learning,”
resentations for domain adaptation,” in Proceedings of the 20th Annual
in Proceedings of the 19th Annual Conference on Neural Information
Conference on Neural Information Processing Systems. Cambridge,
Processing Systems, Vancouver, British Columbia, Canada, December
MA: MIT Press, 2007, pp. 137–144.
2007, pp. 41–48.
[41] A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying, “A spectral regu- [62] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman, “Learn-
larization framework for multi-task structure learning,” in Proceedings of ing bounds for domain adaptation,” in Proceedings of the 21st Annual
the 20th Annual Conference on Neural Information Processing Systems. Conference on Neural Information Processing Systems. Cambridge,
Cambridge, MA: MIT Press, 2008, pp. 25–32. MA: MIT Press, 2008, pp. 129–136.
[42] S.-I. Lee, V. Chatalbashev, D. Vickrey, and D. Koller, “Learning a [63] D. Xing, W. Dai, G.-R. Xue, and Y. Yu, “Bridged refinement for transfer
meta-level prior for feature relevance from multiple related tasks,” in learning,” in 11th European Conference on Principles and Practice of
Proceedings of the 24th International Conference on Machine Learning. Knowledge Discovery in Databases, ser. Lecture Notes in Computer
Corvalis, Oregon, USA: ACM, July 2007, pp. 489–496. Science. Warsaw, Poland: Springer, September 2007, pp. 324–335.
[43] T. Jebara, “Multi-task feature and kernel selection for svms,” in Proceed- [64] X. Ling, W. Dai, G.-R. Xue, Q. Yang, and Y. Yu, “Spectral domain-
ings of the 21st International Conference on Machine Learning. Banff, transfer learning,” in Proceedings of the 14th ACM SIGKDD Interna-
Alberta, Canada: ACM, July 2004. tional Conference on Knowledge Discovery and Data Mining. Las
[44] C. Wang and S. Mahadevan, “Manifold alignment using procrustes Vegas, Nevada: ACM, August 2008, pp. 488–496.
analysis,” in Proceedings of the 25th International Conference on [65] G.-R. Xue, W. Dai, Q. Yang, and Y. Yu, “Topic-bridged plsa for
Machine learning. Helsinki, Finland: ACM, July 2008, pp. 1120–1127. cross-domain text classification,” in Proceedings of the 31st Annual
[45] N. D. Lawrence and J. C. Platt, “Learning to learn with the informative International ACM SIGIR Conference on Research and Development
vector machine,” in Proceedings of the 21st International Conference in Information Retrieval. Singapore: ACM, July 2008, pp. 627–634.
on Machine Learning. Banff, Alberta, Canada: ACM, July 2004. [66] S. J. Pan, J. T. Kwok, and Q. Yang, “Transfer learning via dimensionality
[46] E. Bonilla, K. M. Chai, and C. Williams, “Multi-task gaussian process reduction,” in Proceedings of the 23rd AAAI Conference on Artificial
prediction,” in Proceedings of the 20th Annual Conference on Neural Intelligence, Chicago, Illinois, USA, July 2008, pp. 677–682.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
15

[67] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via 26th International Conference on Machine Learning, Montreal, Quebec,
transfer component analysis,” in Proceedings of the 21st International Canada, June 2009.
Joint Conference on Artificial Intelligence, Pasadena, California, 2009. [88] B. Li, Q. Yang, and X. Xue, “Can movies and books collaborate? - cross-
[68] M. M. H. Mahmud and S. R. Ray, “Transfer learning using kolmogorov domain collaborative filtering for sparsity reduction,” in Proceedings
complexity: Basic theory and empirical evaluations,” in Proceedings of of the 21st International Joint Conference on Artificial Intelligence,
the 20th Annual Conference on Neural Information Processing Systems. Pasadena, California, USA, July 2009.
Cambridge, MA: MIT Press, 2008, pp. 985–992.
[69] E. Eaton, M. desJardins, and T. Lane, “Modeling transfer relationships
between learning tasks for improved inductive transfer,” in Machine
Learning and Knowledge Discovery in Databases, European Confer-
ence, ECML/PKDD 2008, ser. Lecture Notes in Computer Science.
Antwerp, Belgium: Springer, September 2008, pp. 317–332.
[70] M. T. Rosenstein, Z. Marx, and L. P. Kaelbling, “To transfer or not to
transfer,” in a NIPS-05 Workshop on Inductive Transfer: 10 Years Later,
December 2005.
[71] S. Ben-David and R. Schuller, “Exploiting task relatedness for multiple
task learning,” in Proceedings of the Sixteenth Annual Conference on
Learning Theory. San Francisco: Morgan Kaufmann, 2003, pp. 825–
830.
[72] B. Bakker and T. Heskes, “Task clustering and gating for bayesian
multitask learning,” Journal of Machine Learning Reserch, vol. 4, pp.
83–99, 2003. Sinno Jialin Pan is a PhD candidate in the
[73] A. Argyriou, A. Maurer, and M. Pontil, “An algorithm for transfer learn- Department of Computer Science and Engineer-
ing in a heterogeneous environment,” in Machine Learning and Knowl- ing, the Hong Kong University of Science and
edge Discovery in Databases, European Conference, ECML/PKDD Technology. He received the MS and BS de-
2008, ser. Lecture Notes in Computer Science. Antwerp, Belgium: grees from Applied Mathematics Department,
Springer, September 2008, pp. 71–85. Sun Yat-sen University, China, in 2003 and
[74] R. Raina, A. Y. Ng, and D. Koller, “Constructing informative priors using 2005, respectively. His research interests in-
transfer learning.” in Proceedings of the 23rd International Conference clude transfer learning, semi-supervised learn-
on Machine Learning, Pittsburgh, Pennsylvania, USA, June 2006, pp. ing, and their applications in pervasive comput-
713–720. ing and Web mining. He is a member of AAAI.
[75] J. Yin, Q. Yang, and L. M. Ni, “Adaptive temporal radio maps for indoor Contact him at the Department of Computer
location estimation,” in Proceedings of the 3rd IEEE International Science and Engineering, Hong Kong Univ. of Science and Tech-
Conference on Pervasive Computing and Communications, Kauai Island, nology, Clearwater Bay, Kowloon, Hong Kong; [email protected];
Hawaii, USA, March 2005. http://www.cse.ust.hk/∼sinnopan.
[76] S. J. Pan, J. T. Kwok, Q. Yang, and J. J. Pan, “Adaptive localization in
a dynamic WiFi environment through multi-view learning,” in Proceed-
ings of the 22nd AAAI Conference on Artificial Intelligence, Vancouver,
British Columbia, Canada, July 2007, pp. 1108–1113.
[77] V. W. Zheng, Q. Yang, W. Xiang, and D. Shen, “Transferring localization
models over time,” in Proceedings of the 23rd AAAI Conference on
Artificial Intelligence, Chicago, Illinois, USA, July 2008, pp. 1421–1426.
[78] S. J. Pan, D. Shen, Q. Yang, and J. T. Kwok, “Transferring localization
models across space,” in Proceedings of the 23rd AAAI Conference on
Artificial Intelligence, Chicago, Illinois, USA, July 2008, pp. 1383–1388.
[79] V. W. Zheng, S. J. Pan, Q. Yang, and J. J. Pan, “Transferring multi-device
localization models using latent multi-task learning,” in Proceedings of
the 23rd AAAI Conference on Artificial Intelligence, Chicago, Illinois,
USA, July 2008, pp. 1427–1432.
[80] H. Zhuo, Q. Yang, D. H. Hu, and L. Li, “Transferring knowledge
from another domain for learning action models,” in Proceedings of Qiang Yang is a faculty member in the Hong
10th Pacific Rim International Conference on Artificial Intelligence, Kong University of Science and Technology’s
December 2008. Department of Computer Science and Engineer-
[81] V. C. Raykar, B. Krishnapuram, J. Bi, M. Dundar, and R. B. Rao, ing. His research interests are data mining and
“Bayesian multiple instance learning: automatic feature selection and machine learning, AI planning and sensor based
inductive transfer,” in Proceedings of the 25th International Conference activity recognition. He received his PhD de-
on Machine learning. Helsinki, Finland: ACM, July 2008, pp. 808–815. gree in Computer Science from the University of
[82] X. Ling, G.-R. Xue, W. Dai, Y. Jiang, Q. Yang, and Y. Yu, “Can chinese Maryland, College Park, and Bachelor’s degree
web pages be classified with english data source?” in Proceedings of from Peking University in Astrophysics. He is a
the 17th International Conference on World Wide Web. Beijing, China: fellow of IEEE, member of AAAI and ACM, a
ACM, April 2008, pp. 969–978. former associate editor for the IEEE TKDE, and
[83] Q. Yang, S. J. Pan, and V. W. Zheng, “Estimating location using Wi-Fi,” a current associate editor for IEEE Intelligent Systems. Contact him
IEEE Intelligent Systems, vol. 23, no. 1, pp. 8–13, 2008. at the Department of Computer Science and Engineering, Hong Kong
[84] X. Shi, W. Fan, and J. Ren, “Actively transfer domain knowledge,” in Univ. of Science and Technology, Clearwater Bay, Kowloon, Hong Kong;
Machine Learning and Knowledge Discovery in Databases, European [email protected]; http://www.cse.ust.hk/∼qyang.
Conference, ECML/PKDD 2008, ser. Lecture Notes in Computer Sci-
ence. Antwerp, Belgium: Springer, September 2008, pp. 342–357.
[85] G. Kuhlmann and P. Stone, “Graph-based domain mapping for transfer
learning in general games,” in 18th European Conference on Machine
Learning, ser. Lecture Notes in Computer Science. Warsaw, Poland:
Springer, September 2007, pp. 188–200.
[86] W. Dai, Y. Chen, G.-R. Xue, Q. Yang, and Y. Yu, “Translated learning,”
in Proceedings of 21st Annual Conference on Neural Information
Processing Systems, 2008.
[87] B. Li, Q. Yang, and X. Xue, “Transfer learning for collaborative
filtering via a rating-matrix generative model,” in Proceedings of the

You might also like