Report - Adv in CS 1st
Report - Adv in CS 1st
Report - Adv in CS 1st
Machine learning (ML) models have become an essential tool for various applications in today's world. However,
they have also raised serious security concerns, particularly in privacy and confidentiality. One such security
concern is the Membership Inference Attack (MIA), in which an adversary aims to determine whether a specific
sample was used during the training of the target model. Such an attack can reveal sensitive information about the
training data, leading to data breaches and privacy violations. Various techniques have been proposed to mitigate
the risk of MIA, including regularization, differential privacy, and adversarial training. Nevertheless, MIA remains
a significant challenge for machine learning models.
In this paper, titled "Membership Inference Attacks by Exploiting Loss Trajectory," Yiyong Liu, Zhengyu Zhao,
Michael Backes, and Yang Zhang propose a new membership inference attack method that exploits the
membership information from the entire training process of the target model. The proposed method, called
TRAJECTORYMIA, leverages knowledge distillation to represent the membership information through the loss
trajectory evaluated on a sequence of intermediate models at different distillation epochs. Experimental results
show that TRAJECTORYMIA outperforms existing methods in terms of different metrics, achieving at least 6×
higher true-positive rate at a low false-positive rate of 0.1% than existing methods.
In this report, will discuss the basic idea and motivation behind the paper, the threat model assumed in the paper,
experimental setup and results, and the relevance of the results. Finally, will provide our perspective on the paper's
strengths and weaknesses and discuss possible extensions.
Membership Inference Attack (MIA) is a serious threat to the privacy and confidentiality of machine learning
models. Various methods have been proposed to mitigate this threat, including regularization, differential privacy,
and adversarial training. However, these methods have limitations, and MIA remains a significant challenge.
Existing MIA methods commonly exploit the output information, mostly the losses, solely from the given target
model. In practical scenarios, where both member and non-member samples yield similarly small losses, these
methods are unable to differentiate between them. To address this limitation, the authors propose a new attack
method, called TRAJECTORYMIA, which can exploit the membership information from the whole training
process of the target model for improving the attack performance. The proposed method represents the
membership information through the loss trajectory evaluated on a sequence of intermediate models at different
distillation epochs, together with the loss from the given target model.
The authors conduct experiments on four image datasets and two non-image datasets to demonstrate the
effectiveness of the proposed method. The experimental results show that TRAJECTORYMIA outperforms
existing methods in terms of different metrics, achieving at least 6× higher true-positive rate at a low false-positive
rate of 0.1% than existing methods.
Threat Model:
In this paper, the authors focus on the commonly-adopted black-box scenario of MIAs, in which the adversary
only has access to the posterior output from the target model. They assume that the adversary has an auxiliary
dataset that comes from the same distribution as the target model's training set. This follows the standard setting
of most of the advanced MIAs. Both the data used to train the shadow model and the data used to distill the
target/shadow model for obtaining the corresponding distilled loss trajectory are sampled from this auxiliary
dataset. Furthermore, the authors assume that the adversary knows the architecture of the target model. In section
5 of the paper, the authors show that these two assumptions on the adversary's knowledge about the training data
distribution and the architecture of the target model can be relaxed.
Experimental Setup:
In their experiments, the authors considered four image datasets and two non-image datasets. The image datasets
were CIFAR-10, CINIC-10, CIFAR-100, and GTSRB, while the non-image datasets were Purchase and Location.
For the image datasets, they used the ResNet-18 and VGG16 architectures. For the non-image datasets, they used
a 2-layer fully connected neural network.
The authors used the standard train/test split for each dataset, with 80% of the data used for training and 20% for
testing. They also assumed a black-box scenario, where the adversary only has access to the posterior output of
the target model. In addition, the adversary has an auxiliary dataset D a that comes from the same distribution as
the target model's training set Dt.
To evaluate the effectiveness of their attack, the authors used several metrics, including true positive rate (TPR),
false positive rate (FPR), balanced accuracy, and area under the receiver operating characteristic curve (AUC-
ROC). They also evaluated their attack against five advanced baseline methods and a state-of-the-art method
called LiRA. Finally, they evaluated their attack in the label-only scenario and on a model defended with DP-
SGD.
Experimental Results:
The authors evaluated their attack on both image and non-image datasets, and they found that their method
consistently outperformed the baseline methods in terms of TPR, FPR, balanced accuracy, and AUC-ROC.
On CIFAR-10, their attack achieved a TPR of 91.3% at 0.1% FPR, which was 6x higher than the best-performing
baseline method (LiRA). On CINIC-10, their attack achieved a TPR of 98.6% at 0.1% FPR, which was 9.8x higher
than the best-performing baseline method. On CIFAR-100, their attack achieved a TPR of 89.9% at 0.1% FPR,
which was 7.3x higher than the best-performing baseline method. On GTSRB, their attack achieved a TPR of
96.2% at 0.1% FPR, which was 6.1x higher than the best-performing baseline method.
For the non-image datasets, their attack achieved a TPR of 99.8% on Purchase and a TPR of 99.9% on Location
at 0.1% FPR.
The authors also evaluated their attack in the label-only scenario, where the adversary only has access to the
predicted labels of the target model. They found that their attack still outperformed the baseline methods in this
scenario, with a TPR of 76.9% on CIFAR-10 and a TPR of 97.4% on CINIC-10 at 0.1% FPR.
Finally, the authors evaluated their attack on a model defended with DP-SGD. They found that their attack still
achieved a TPR of 76.1% on CIFAR-10 and a TPR of 96.6% on CINIC-10 at 0.1% FPR, which was higher than
the best-performing baseline method.
Overall, the experimental results showed that the proposed TRAJECTORYMIA attack method was highly
effective in the black-box setting and outperformed the state-of-the-art and baseline methods on various image
and non-image datasets.
Relevance of the Results:
Membership inference attacks (MIAs) are a significant concern in machine learning, particularly in the context of
privacy. Attackers can exploit MIAs to infer whether a particular sample was used in the training data of a target
model, leading to privacy breaches.
The results of this paper are relevant in the field of machine learning privacy and security. Membership inference
attacks are a significant concern, particularly in applications where sensitive data is used to train machine learning
models. The proposed method can be used by attackers to infer whether a particular sample was used in the
training data, highlighting the importance of developing effective defenses against such attacks. Moreover, the
proposed method can be used by researchers to evaluate the privacy of machine learning models and develop
more robust defenses against MIAs.
Perspective:
Strengths:
• The proposed method is a significant improvement over existing methods that solely exploit the output
information from the target model. By exploiting the whole training process of the target model, the
proposed method improves the attack performance in scenarios where the member and non-member
samples yield similar losses.
• The authors have evaluated the proposed method on different datasets and model architectures,
demonstrating its effectiveness in a variety of scenarios. Moreover, the authors have explored the
transferability of the attack and shown that it is effective against different architectures.
• The authors have provided detailed experimental results, including a comparison with five advanced
baseline methods and a state-of-the-art method. The results show that the proposed method outperforms
existing methods in the black-box scenario and is still effective in the label-only scenario.
Weaknesses:
• The proposed method relies on knowledge distillation, which can be computationally expensive. This
may limit the practicality of the method in some scenarios.
• The paper assumes that the adversary has access to an auxiliary dataset that comes from the same
distribution as the target model's training set. This assumption may not always hold in practical scenarios.
Possible Extensions:
• Multi-target models: In this paper, the authors assume that the attacker has only one target model.
However, in practical scenarios, an adversary may have multiple target models or access to an ensemble
of models. It would be interesting to explore how the proposed method can be extended to such scenarios.
• Other loss functions: This paper focuses on exploiting the loss trajectory of a model trained with cross-
entropy loss. However, there are other loss functions used in deep learning, such as mean squared error
and hinge loss. It would be interesting to investigate how the proposed method can be applied to models
trained with different loss functions.
• Robustness against defense mechanisms: The proposed attack method is shown to be effective against
models without any defense mechanisms. However, it would be interesting to evaluate its performance
against models that use various defense mechanisms such as differential privacy, adversarial training,
and watermarking.
• Exploring other sources of information: This paper proposes a method that exploits the loss trajectory of
a model to perform membership inference attacks. However, there may be other sources of information
that can be exploited, such as model activations or gradients. It would be interesting to explore how such
sources of information can be leveraged to perform membership inference attacks.
• Practical implications: While the proposed attack method is effective in the black-box setting, it would
be interesting to study its practical implications in real-world scenarios. For example, how realistic are
the assumptions made in this paper, such as access to an auxiliary dataset that comes from the same
distribution as the target model's training set? How can we mitigate the risk of membership inference
attacks in practice?