Arc 2 Face
Arc 2 Face
Arc 2 Face
1 Introduction
Learning statistical priors from data to generate new facial images is a highly
popular problem at the intersection of Computer Vision and Machine Learning.
Arguably, the most well-known early method was the so-called Eigenfaces [79],
which applies the Karhunen-Loeve expansion, also known as Principal Compo-
nent Analysis (PCA), to a set of facial images. The success of Eigenfaces with
well-aligned frontal facial images captured under controlled conditions spurred
a significant line of research into linear and nonlinear component analysis over
25 years ago. Probably the pinnacle of PCA applications was in 3D Morphable
Models (3DMMs) [6], where it was used to learn a prior for facial textures rep-
resented in a UV map. However, when directly applied to 2D images, PCA had
2 F. Paraperas Papantoniou et al.
Input image ID
vector
ArcFace
Fig. 1: Given the ID-embedding from [14], Arc2Face can generate high-quality images
of any subject with compelling similarity. Using popular extensions, such as Control-
Net [96], we can explicitly control facial attributes such as the pose or expression.
many shortcomings, including the requirement for images to be perfectly pixel-
aligned. Furthermore, it could not describe hair (facial or not) and also struggled
with non-linearities introduced by facial expressions and lighting conditions.
The recent evolution of deep neural networks (DNNs) has made it feasible
to overcome many of the limitations inherent in simple linear statistical models,
such as PCA, enabling the learning of statistical priors from large-scale facial
data captured under varied recording conditions. Arguably, the most success-
ful method, based on Generative Adversarial Networks (GANs) [25], is Style-
GAN [38] and its variants [37, 39]. StyleGAN learns a decoder governed by low-
dimensional latent codes, which can be sampled to generate images. Moreover,
StyleGAN can be used to fit images to its distributions (i.e., find the best la-
tent code to generate a specific image) [85], proving highly useful as a prior for
many tasks, including 3D face reconstruction for facial texture generation [24].
Additionally, several studies have discussed how StyleGAN’s latent spaces can
be utilized for controllable manipulations (i.e., changing the expressions or the
pose of facial images) [27, 73, 102]. However, to date, identity drift remains a
challenge in latent space manipulations using StyleGAN.
One of the first problems to find a robust solution with the advent of deep
learning is face recognition (FR), generally approached by identifying features
that represent facial identities (also known as ID-embeddings). These features
are so powerful that they are used in smartphones for face verification, leading
to a vast research field. Arguably, the most widely used ID-embeddings are
those produced by the ArcFace [14] loss. ArcFace features have been utilized as
perceptual features for 3D face reconstruction and generation. However, robustly
controlling the identity within the StyleGAN framework remains challenging,
mainly due to the necessity of large high-resolution datasets with significant
intra-class variability.
In the past couple of years, another revolution concerning generative methods
has occurred in deep learning. The emergence of the so-called Diffusion Mod-
els [74] has demonstrated not only the possibility of modeling the distribution
of images and sampling from it [17, 29, 76], but also the feasibility of guiding the
generation process with features that correlate images with textual descriptions,
Arc2Face: A Foundation Model of Human Faces 3
such as CLIP [59, 62]. Recently, it has been shown that other features, e.g., ID-
embeddings, can be incorporated to steer the generation process for faces, to
not only adhere to textual descriptions but also to the facial characteristics of a
person’s image [11, 56, 80, 83, 90].
A challenge arises when combining CLIP features with ID-embeddings be-
cause CLIP features contain ID-related information (otherwise, generating the
image of a famous celebrity from a textual prompt would not be possible), as
well as possess contradictory features to ID-embeddings (e.g., when requesting
a photo of a “Viking” with the ID-embeddings of a Chinese person). Hence, al-
though these models are useful for entertainment purposes and demonstrate the
power of diffusion models, they are neither suitable for the controlled generation
of facial images nor for sampling conditioned to the ID features of a specific
subject.
In this paper, we meticulously study the problem of high-resolution facial
image synthesis conditioned on ID-embeddings and propose a large-scale foun-
dation model. Developing such a model poses a significant challenge due to the
limited availability of high-quality facial image databases. Specifically:
• We show that smaller, single-image-per-person databases [38] are insufficient
to train a robust foundation model, thus, we introduce a large dataset of high-
resolution facial images with consistent identity and intra-class variability
derived from WebFace42M.
• We adapt a large facial identity encoder trained solely on ID-embeddings,
in contrast to text-based models where ID interferes with language.
• We introduce the first ID-conditioned face foundation model, which we make
available to the public.
• We propose a robust benchmark for evaluating ID-conditioned models. Our
experiments demonstrate superior performance compared to existing ap-
proaches.
2 Related Work
2.1 Generative Models for the Human Face
The task of facial image generation has witnessed great success in recent years,
with the advent of style-based GANs [37–39,51], even for view-consistent synthe-
sis [3,8,9,16,26,53,70]. Despite early attempts to condition GANs on modalities
such as text [33, 34, 67], the recent Diffusion Models [17, 29, 76, 88] have shown
tremendous progress. Trained on large-scale datasets, several prominent models,
including DALLE-2 [60], Imagen [66], or Stable Diffusion [62], have emerged,
marking a breakthrough in creative image generation. Later models have enabled
an even higher level of detail [57]. Nevertheless, dataset memorization remains
an issue in such models [75]. Moreover, recent methods have achieved even 3D
human generation using general [42,55] and text-guided diffusion models [32,95].
In terms of controlling the generation output, adding images alongside text
has been proposed. For example, ILVR [12] conditions sampling on a target
4 F. Paraperas Papantoniou et al.
image through iterative refinement, while SDEdit [50] adds noise to the input
before editing it according to the target description. DiffusionRig [18] enables ex-
plicit face editing, using a crude 3DMM [20] rendering to condition generation on
pose, illumination, and expression. Perhaps the most notable approach, Control-
Net [96], first achieved accurate spatial control of text-to-image models, with the
addition of trainable copies of network parts. In a different direction, universal
guidance was introduced [5] to avoid re-training the model during conditioning.
While the combined use of text and spatial control can enable manipulation of
an input photo to some extent, it cannot accurately generate one’s identity under
any conditions, which is addressed by subject-driven generative models.
UNet
Tokenization + Embedding Retrieval
3 Method
3.1 Preliminaries
Latent Diffusion Models: Diffusion models [17, 29, 76] employ a denoising
mechanism to approximate the distribution of real images x. During training,
images undergo distortion through the addition of Gaussian noise √ via a predeter-
mined diffusion schedule at different timesteps t, given by xt = ᾱt x0 +(1− ᾱt)ϵ.
At the same time, a denoising autoencoder ϵθ (xt , t) is trained to recover the nor-
mally distributed noise ϵ ∼ N (0, I) by minimizing the prediction error :
\mathcal {L} = \mathbb {E}_{x_t,t,\epsilon \sim \mathcal {N}(\mathbf {0},\mathbf {I})}\left [ || \epsilon - \epsilon _{\theta }(x_t, t) ||_2^2 \right ]\label {eq:unet_loss} (1)
The Latent Diffusion (LD) model [62] is a widely adopted architecture within the
diffusion framework. Notably, a Variational Autoencoder (VAE), E, is employed
to compress images into a lower-dimensional latent space z = E(x) for efficient
training. Moreover, LD introduces a universal conditioning mechanism, using a
UNet [63] as the backbone for ϵθ , with cross-attention layers to map auxiliary fea-
tures C to its intermediate layers for conditional noise prediction ϵθ (zt , t, C). Par-
ticularly, Stable Diffusion (SD) represents a text-conditioned LD model, which
uses text embeddings C, produced by a CLIP encoder [59], to enable stochastic
text-to-image (T2I) generation. Trained on the LAION-5B database [68], SD
stands as a foundation open-source T2I model which is commonly utilized by
the research community as a powerful image prior for downstream tasks.
Expanding upon this finding, we then develop Arc2Face, which enables the
synthesis of photo-realistic images at higher resolutions. This model is derived by
fine-tuning SD on a restored version of WebFace42M [103], along with FFHQ [38]
and CelebA-HQ [36] for enhanced quality, as detailed below.
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0
(a) Synth-500 (b) AgeDB
Fig. 3: ArcFace [14] similarity distributions between input and generated faces from LD
models trained on the ID-to-image task. We use two different datasets of input IDs for
evaluation (500 and 400 IDs respectively) and generate 5 images per ID. We compare
models trained on three datasets: FFHQ, WebFace42M-10%, and WebFace42M.
3.3 Arc2Face
training, we consistently employ this default pseudo-prompt for all images. This
intentional choice directs the encoder’s attention solely to the ID vector, disre-
garding any irrelevant contextual information. Consequently, through extensive
fine-tuning, we effectively transform the text encoder into a face encoder specif-
ically tailored for projecting ArcFace embeddings into the CLIP latent space.
Dataset: We retain WebFace42M [103] for its vast size and intra-class vari-
ability, which, however, suffers from low-resolution data and a tight facial crop.
Even more, the pre-trained SD backbone is designed for a resolution of 512 × 512
and a similar resolution is required to fine-tune Arc2Face. To alleviate this, we
meticulously upsample them using GFPGAN (v1.4), a state-of-the-art blind face
restoration network [84], denoted as γ. We perform degradation removal and 4×
upscaling to 448 × 448, so that x̂ = γ(x). GFPGAN is a GAN-based upsam-
pling method, with a strong face prior using both adversarial and ID-preserving
losses, achieving crisp and faithful restoration. We follow this process on a large
portion of the original database, given our computational limits, acquiring ap-
proximately 21M images of 1M identities at 448 × 448 pixels. Using the restored
images x̂, we train Arc2Face. Albeit of higher quality, this dataset is still limited
to a tightly cropped facial area for FR training. While it allows to learn a robust
ID prior, a complete face image is usually preferred. Thus, we further fine-tune it
on FFHQ [38] and CelebA-HQ [36], which consist of less constrained face images.
Our final model generates FFHQ-aligned images at 512 × 512 pixels.
4 Experiments
4.1 ID-consistent Generation
We perform extensive quantitative comparisons to evaluate the performance of
recent ID-conditioned models in generating both diverse and faithful images of a
subject. Details about the methods and comparison process are provided below.
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0
Arc2Face (Ours) InstantID
(a) Synth-500 (b) AgeDB
Fig. 4: Distribution of ArcFace similarity between input IDs, syn- Fig. 5: Percent-
thetic (a) or real (b), and generated images of them by different age of user votes
models. As all non-CLIP-based methods use [14] for condition- received by our
ing, we evaluate them with [14]. For an evaluation with a different method and In-
network, please refer to the Supp. Material, where similar obser- stantID re. ID fi-
vations can be made. delity.
Prompt Selection Since the aforementioned methods are tailored for text-
driven synthesis, their ability to generate consistent IDs depends heavily on the
input prompt. Although this allows creative stylizations, it demands prompt
engineering and user inspection. To eliminate subjective bias, we suggest an
automatic evaluation using a simple prompt, “photo of a person”, for all samples.
This enables us to assess ID-retention performance based on face features without
elaborate text descriptions, ensuring a fair comparison across all methods.
Datasets and Metrics We construct two datasets for our comparison. The
first, hereafter denoted as Synth-500, includes 500 images of never-before-seen
identities, generated by [1]. The second is a selection of 400 real images from the
public AgeDB [52] database, chosen based on higher resolution. For each image,
we generate 5 samples using all methods and calculate the following metrics: 1)
ID similarity: we compute the cosine similarity of ID features between the input
and generated faces. The corresponding distributions are plotted in Fig. 4. 2)
LPIPS distance: similarly to [45], we assess the diversity of generation by calcu-
lating the pairwise perceptual distance (LPIPS) [97] between images of the same
ID. The average distance for all pairs is reported in Tab 1. To focus on the actual
facial diversity, we first detect and align all faces and remove the background
using an off-the-shelf parser [2]. 3) Exp./Pose diversity: we predict the expres-
sion and pose (jaw/neck articulation) parameters of the FLAME model [44] for
each sample using EMOCA v2 [13, 21] and compute the average pairwise ℓ2 ex-
pression and pose distances between images of the same ID. 4) FID: we use the
FID [28, 71] metric to assess the quality of samples (after face detection and
alignment) with respect to the input images. Results are presented in Tab 1.
As depicted in Fig. 4, existing methods, particularly [45, 86] that rely on
CLIP features, struggle to preserve the ID without detailed text descriptions
of the subject. InstantID [83] attains the highest similarity, behind Arc2Face.
10 F. Paraperas Papantoniou et al.
Table 1: Quantitative comparison between Arc2Face and [45, 83, 86, 90] on 500 syn-
thetic and 400 real IDs. We produce 5 samples per ID for all methods and assess the
diversity of generated faces using perceptual and 3DMM-based distances, as well as
their quality based on FID. Bold values denote the best results in each metric.
However, this results in limited pose/exp. diversity as it further uses the in-
put facial landmarks to constrain generation. Arc2Face does not require any
text or spatial conditions and achieves the highest facial similarity as well as
diversity and realism across both datasets. This particularly highlights the effi-
ciency of ID-embeddings in the context of face generation against CLIP image
or text features. A visual comparison with the above methods is provided in
Fig. 6, whereas additional qualitative results and visualizations are provided in
the Supp. Material. We further conducted a user study to compare with the
second best-performing method in terms of ID similarity, i.e. InstantID [83]. In
particular, we asked 50 users to choose the method whose result best resembles
the input face, regardless of quality or realism, for a randomly selected set of 30
IDs from our datasets. The two methods were randomly presented side-by-side.
Fig 5 confirms a strong preference for Arc2Face in terms of ID resemblance.
Fig. 6: Visual comparison of Arc2Face with state-of-the-art methods [45, 83, 86, 90]
using the abstract prompt “photo of a person” to focus on their ID-conditioning ability.
Methods Venue # images (# IDs× # imgs/ID) LFW CFP-FP CPLFW AgeDB CALFW Avg
SynFace ICCV21 0.5M (10K × 50) 91.93 75.03 70.43 61.63 74.73 74.75
DigiFace WACV23 0.5M (10K × 50) 95.4 87.4 78.87 76.97 78.62 83.45
DCFace CVPR23 0.5M (10K × 50) 98.55 85.33 82.62 89.70 91.60 89.56
Arc2Face - 0.5M (10K × 50) 98.81 91.87 85.16 90.18 92.63 91.73
DigiFace WACV23 1.2M (10K × 72 + 100K × 5) 96.17 89.81 82.23 81.10 82.55 86.37
DCFace CVPR23 1.2M (20K × 50 + 40K × 5) 98.58 88.61 85.07 90.97 92.82 91.21
Arc2Face - 1.2M (20K × 50 + 40K × 5) 98.92 94.58 86.45 92.45 93.33 93.14
CASIA-WebFace (Real) 0.49M (approx. 10.5K × 47) 99.42 96.56 89.73 94.08 93.32 94.62
Our model can be trivially combined with ControlNet [96] for spatial control of
the output. In particular, we use EMOCA v2 [13, 21] to perform 3D reconstruc-
tion on FFHQ [38], and train a ControlNet module, conditioned on the rendered
face normals. During inference, we can render the 3D face normals of a source
person under the expression and pose extracted from a reference image. This
rendering is used to guide the synthesis of the source identity as shown in Fig. 7.
Reference
Input
Fig. 7: Samples from Arc2Face, conditioned on a 3DMM [13,44] using ControlNet [96].
Identity similarity distribution Identity similarity distribution Cumulative Explained Variance Input 100 200 300 400 512
1.0
Arc2Face Arc2Face
Arc2Face w/ MLP Arc2Face w/ MLP
0.8
0.6
0.4
0.2
0.0
0 100 200 300 400 500
0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 Number of components
(a) Synth-500 (b) AgeDB (a) Cumulative Var. (b) PCA projections
Fig. 10: ID interpolations between pairs of subjects. Arc2Face generates plausible faces
along the trajectory connecting their ArcFace vectors.
Fig. 11: We show that Arc2Face does not replicate its training data. We generate
images for 500 unseen IDs from Synth-500 and retrieve the train image with the highest
ID similarity for each one. The average similarity between each output and its closest
match is 0.37, whereas the similarity between the output and input features is 0.74.
14 F. Paraperas Papantoniou et al.
We start from stable-diffusion-v1-5 and fine-tune both the UNet and CLIP
encoder with AdamW [46] and a learning rate of 1e-6, using 8 NVIDIA A100
GPUs and a batch size of 4 per GPU. First, we train on 21M restored images
from WebFace42M for 5 epochs with a resolution of 448×448, and then fine-tune
on 512 × 512 - sized FFHQ and CelebA-HQ images for another 15 epochs. All
our results shown in this paper are generated using DPM-Solver [47, 48] with
25 inference steps and a classifier-free guidance scale of 3, which we empirically
found to produce highly realistic images. For ID-embedding extraction, we use a
frozen IR-100 ArcFace [14] trained on WebFace42M and normalize embeddings
to unit magnitude. Regarding the methods we compare against, we use the
official implementations and, in all cases, select the default hyper-parameters.
6 Conclusion
References
1. https://thispersondoesnotexist.com/
2. https://github.com/zllrunning/face-parsing.PyTorch
3. An, S., Xu, H., Shi, Y., Song, G., Ogras, U.Y., Luo, L.: Panohead: Geometry-aware
3d full-head synthesis in 360deg. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. pp. 20950–20959 (2023)
4. Bae, G., de La Gorce, M., Baltrusaitis, T., Hewitt, C., Chen, D., Valentin, J.,
Cipolla, R., Shen, J.: Digiface-1m: 1 million digital face images for face recogni-
tion. In: WACV (2023)
5. Bansal, A., Chu, H.M., Schwarzschild, A., Sengupta, S., Goldblum, M., Geiping,
J., Goldstein, T.: Universal guidance for diffusion models. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Workshops. pp. 843–852 (June 2023)
6. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: Pro-
ceedings of the 26th Annual Conference on Computer Graphics and Interac-
tive Techniques. p. 187–194. SIGGRAPH ’99, ACM Press/Addison-Wesley Pub-
lishing Co., USA (1999). https://doi.org/10.1145/311535.311556, https:
//doi.org/10.1145/311535.311556
7. Boutros, F., Damer, N., Kirchbuchner, F., Kuijper, A.: Elasticface: Elastic margin
loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. pp. 1578–1587 (2022)
8. Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo,
O., Guibas, L.J., Tremblay, J., Khamis, S., et al.: Efficient geometry-aware 3d
generative adversarial networks. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. pp. 16123–16133 (2022)
9. Chan, E.R., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: pi-gan: Peri-
odic implicit generative adversarial networks for 3d-aware image synthesis. In:
Proceedings of the IEEE/CVF conference on computer vision and pattern recog-
nition. pp. 5799–5809 (2021)
10. Chen, L., Zhao, M., Liu, Y., Ding, M., Song, Y., Wang, S., Wang, X., Yang, H.,
Liu, J., Du, K., et al.: Photoverse: Tuning-free image customization with text-to-
image diffusion models (2023)
11. Chen, Z., Fang, S., Liu, W., He, Q., Huang, M., Zhang, Y., Mao, Z.: Dreamiden-
tity: Improved editability for efficient face-identity preserved image generation.
arXiv preprint arXiv:2307.00300 (2023)
12. Choi, J., Kim, S., Jeong, Y., Gwon, Y., Yoon, S.: Ilvr: Conditioning method
for denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. pp. 14367–14376 (2021)
13. Daněček, R., Black, M.J., Bolkart, T.: Emoca: Emotion driven monocular face
capture and animation. In: Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition. pp. 20311–20322 (2022)
14. Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for
deep face recognition. In: Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition. pp. 4690–4699 (2019)
16 F. Paraperas Papantoniou et al.
15. Deng, J., Guo, J., Yang, J., Lattas, A., Zafeiriou, S.: Variational prototype learn-
ing for deep face recognition. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 11906–11915 (2021)
16. Deng, Y., Yang, J., Xiang, J., Tong, X.: Gram: Generative radiance manifolds
for 3d-aware image generation. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 10673–10683 (2022)
17. Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances
in neural information processing systems 34, 8780–8794 (2021)
18. Ding, Z., Zhang, X., Xia, Z., Jebe, L., Tu, Z., Zhang, X.: Diffusionrig: Learn-
ing personalized priors for facial appearance editing. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12736–
12746 (2023)
19. Duong, C.N., Truong, T.D., Luu, K., Quach, K.G., Bui, H., Roy, K.: Vec2face:
Unveil human faces from their blackbox features in face recognition. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
pp. 6132–6141 (2020)
20. Feng, Y., Feng, H., Black, M.J., Bolkart, T.: Learning an animatable detailed 3d
face model from in-the-wild images. ACM Transactions on Graphics (ToG) 40(4),
1–13 (2021)
21. Filntisis, P.P., Retsinas, G., Paraperas-Papantoniou, F., Katsamanis, A., Roussos,
A., Maragos, P.: Visual speech-aware perceptual 3d facial expression reconstruc-
tion from videos. arXiv preprint arXiv:2207.11094 (2022)
22. Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G.,
Cohen-Or, D.: An image is worth one word: Personalizing text-to-image genera-
tion using textual inversion. arXiv preprint arXiv:2208.01618 (2022)
23. Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D.:
Encoder-based domain tuning for fast personalization of text-to-image models.
ACM Transactions on Graphics (TOG) 42(4), 1–13 (2023)
24. Gecer, B., Ploumpis, S., Kotsia, I., Zafeiriou, S.: Ganfit: Generative adversarial
network fitting for high fidelity 3d face reconstruction. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June
2019)
25. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances in neural
information processing systems 27 (2014)
26. Gu, J., Liu, L., Wang, P., Theobalt, C.: Stylenerf: A style-based 3d aware genera-
tor for high-resolution image synthesis. In: International Conference on Learning
Representations (2022)
27. Härkönen, E., Hertzmann, A., Lehtinen, J., Paris, S.: Ganspace: Discovering in-
terpretable gan controls. Advances in neural information processing systems 33,
9841–9850 (2020)
28. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans
trained by a two time-scale update rule converge to a local nash equilibrium.
Advances in neural information processing systems 30 (2017)
29. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances
in Neural Information Processing Systems. vol. 33, pp. 6840–6851 (2020)
30. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L.,
Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint
arXiv:2106.09685 (2021)
Arc2Face: A Foundation Model of Human Faces 17
31. Huang, G.B., Mattar, M., Berg, T., Learned-Miller, E.: Labeled Faces in the Wild:
A database for studying face recognition in unconstrained environments. In: Tech.
Report (2008)
32. Huang, X., Shao, R., Zhang, Q., Zhang, H., Feng, Y., Liu, Y., Wang, Q.: Human-
norm: Learning normal diffusion model for high-quality and realistic 3d human
generation. arXiv preprint arXiv:2310.01406 (2023)
33. Kang, M., Shin, J., Park, J.: Studiogan: a taxonomy and benchmark of gans for
image synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence
(2023)
34. Kang, M., Zhu, J.Y., Zhang, R., Park, J., Shechtman, E., Paris, S., Park, T.:
Scaling up gans for text-to-image synthesis. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 10124–10134 (2023)
35. Kansy, M., Raël, A., Mignone, G., Naruniec, J., Schroers, C., Gross, M., Weber,
R.M.: Controllable inversion of black-box face recognition models via diffusion.
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
(ICCV) Workshops. pp. 3167–3177 (2023)
36. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for im-
proved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)
37. Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J., Aila,
T.: Alias-free generative adversarial networks. Advances in Neural Information
Processing Systems 34, 852–863 (2021)
38. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative
adversarial networks. In: Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition. pp. 4401–4410 (2019)
39. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing
and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition. pp. 8110–8119 (2020)
40. Kim, M., Jain, A.K., Liu, X.: Adaface: Quality adaptive margin for face recogni-
tion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition. pp. 18750–18759 (2022)
41. Kim, M., Liu, F., Jain, A., Liu, X.: Dcface: Synthetic face generation with dual
condition diffusion model. In: CVPR (2023)
42. Kirschstein, T., Giebenhain, S., Nießner, M.: Diffusionavatars: Deferred diffusion
for high-fidelity 3d head avatars. arXiv preprint arXiv:2311.18635 (2023)
43. Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept cus-
tomization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition. pp. 1931–1941 (2023)
44. Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial
shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIG-
GRAPH Asia) 36(6), 194:1–194:17 (2017), https://doi.org/10.1145/3130800.
3130813
45. Li, Z., Cao, M., Wang, X., Qi, Z., Cheng, M.M., Shan, Y.: Photomaker:
Customizing realistic human photos via stacked id embedding. arXiv preprint
arXiv:2312.04461 (2023)
46. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint
arXiv:1711.05101 (2017)
47. Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver
for diffusion probabilistic model sampling in around 10 steps. Advances in Neural
Information Processing Systems 35, 5775–5787 (2022)
18 F. Paraperas Papantoniou et al.
48. Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver++: Fast
solver for guided sampling of diffusion probabilistic models. arXiv preprint
arXiv:2211.01095 (2022)
49. Mai, G., Cao, K., Yuen, P.C., Jain, A.K.: On the reconstruction of face images
from deep face templates. IEEE transactions on pattern analysis and machine
intelligence 41(5), 1188–1202 (2018)
50. Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: SDEdit:
Guided image synthesis and editing with stochastic differential equations. In:
International Conference on Learning Representations (2022)
51. Mensah, D., Kim, N.H., Aittala, M., Laine, S., Lehtinen, J.: A hybrid generator
architecture for controllable face synthesis. In: ACM SIGGRAPH 2023 Conference
Proceedings. pp. 1–10 (2023)
52. Moschoglou, S., Papaioannou, A., Sagonas, C., Deng, J., Kotsia, I., Zafeiriou, S.:
Agedb: the first manually collected, in-the-wild age database. In: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition Workshop
(2017)
53. Niemeyer, M., Geiger, A.: Giraffe: Representing scenes as compositional genera-
tive neural feature fields. In: Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition. pp. 11453–11464 (2021)
54. Pan, X., Dong, L., Huang, S., Peng, Z., Chen, W., Wei, F.: Kosmos-g: Gener-
ating images in context with multimodal large language models. arXiv preprint
arXiv:2310.02992 (2023)
55. Papantoniou, F.P., Lattas, A., Moschoglou, S., Zafeiriou, S.: Relightify: Re-
lightable 3d faces from a single image via diffusion models. In: Proceedings of
the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 8806–
8817 (October 2023)
56. Peng, X., Zhu, J., Jiang, B., Tai, Y., Luo, D., Zhang, J., Lin, W., Jin, T., Wang,
C., Ji, R.: Portraitbooth: A versatile portrait model for fast identity-preserved
personalization. arXiv preprint arXiv:2312.06354 (2023)
57. Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna,
J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image
synthesis. arXiv preprint arXiv:2307.01952 (2023)
58. Qiu, H., Yu, B., Gong, D., Li, Z., Liu, W., Tao, D.: SynFace: Face recognition
with synthetic data. In: ICCV (2021)
59. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from
natural language supervision. In: International conference on machine learning.
pp. 8748–8763. PMLR (2021)
60. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-
conditional image generation with clip latents. arXiv preprint arXiv:2204.06125
1(2), 3 (2022)
61. Razzhigaev, A., Kireev, K., Kaziakhmedov, E., Tursynbek, N., Petiushko, A.:
Black-box face recovery from identity features. In: Computer Vision–ECCV 2020
Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. pp. 462–
475. Springer (2020)
62. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
63. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
ical image segmentation. In: Medical Image Computing and Computer-Assisted
Arc2Face: A Foundation Model of Human Faces 19
79. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of cognitive neuro-
science 3(1), 71–86 (1991)
80. Valevski, D., Lumen, D., Matias, Y., Leviathan, Y.: Face0: Instantaneously con-
ditioning a text-to-image model on a face. In: SIGGRAPH Asia 2023 Conference
Papers. pp. 1–10 (2023)
81. Vendrow, E., Vendrow, J.: Realistic face reconstruction from deep embeddings.
In: NeurIPS 2021 Workshop Privacy in Machine Learning (2021), https://
openreview.net/forum?id=-WsBmzWwPee
82. Wang, Q., Jia, X., Li, X., Li, T., Ma, L., Zhuge, Y., Lu, H.: Stableidentity: In-
serting anybody into anywhere at first sight. arXiv preprint arXiv:2401.15975
(2024)
83. Wang, Q., Bai, X., Wang, H., Qin, Z., Chen, A.: Instantid: Zero-shot identity-
preserving generation in seconds. arXiv preprint arXiv:2401.07519 (2024)
84. Wang, X., Li, Y., Zhang, H., Shan, Y.: Towards real-world blind face restoration
with generative facial prior. In: The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (2021)
85. Xia, W., Zhang, Y., Yang, Y., Xue, J.H., Zhou, B., Yang, M.H.: Gan inversion: A
survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(3),
3121–3138 (2022)
86. Xiao, G., Yin, T., Freeman, W.T., Durand, F., Han, S.: Fastcomposer: Tuning-
free multi-subject image generation with localized attention. arXiv preprint
arXiv:2305.10431 (2023)
87. Yan, Y., Zhang, C., Wang, R., Zhou, Y., Zhang, G., Cheng, P., Yu, G., Fu, B.:
Facestudio: Put your face everywhere in seconds. arXiv preprint arXiv:2312.02663
(2023)
88. Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., Zhang, W., Cui, B.,
Yang, M.H.: Diffusion models: A comprehensive survey of methods and applica-
tions. ACM Computing Surveys 56(4), 1–39 (2023)
89. Yang, Z., Zhang, J., Chang, E.C., Liang, Z.: Neural network inversion in adver-
sarial setting via background knowledge alignment. In: Proceedings of the 2019
ACM SIGSAC Conference on Computer and Communications Security. pp. 225–
240 (2019)
90. Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati-
ble image prompt adapter for text-to-image diffusion models. arXiv preprint
arXiv:2308.06721 (2023)
91. Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compatible image
prompt adapter for text-to-image diffusion models. GitHub repository https:
//github.com/tencent-ailab/IP-Adapter (2024)
92. Yi, D., Lei, Z., Liao, S., Li, S.Z.: Learning face representation from scratch. arXiv
preprint arXiv:1411.7923 (2014)
93. Yuan, G., Cun, X., Zhang, Y., Li, M., Qi, C., Wang, X., Shan, Y., Zheng,
H.: Inserting anybody in diffusion models via celeb basis. arXiv preprint
arXiv:2306.00926 (2023)
94. Yucer, S., Tektas, F., Al Moubayed, N., Breckon, T.P.: Measuring hidden bias
within face recognition via racial phenotypes. In: Proceedings of the IEEE/CVF
Winter Conference on Applications of Computer Vision. pp. 995–1004 (2022)
95. Zhang, L., Qiu, Q., Lin, H., Zhang, Q., Shi, C., Yang, W., Shi, Y., Yang, S., Xu,
L., Yu, J.: Dreamface: Progressive generation of animatable 3d faces under text
guidance. arXiv preprint arXiv:2304.03117 (2023)
Arc2Face: A Foundation Model of Human Faces 21
96. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image
diffusion models. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. pp. 3836–3847 (2023)
97. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable
effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. pp. 586–595 (2018)
98. Zheng, T., Deng, W.: Cross-Pose LFW: A database for studying cross-pose face
recognition in unconstrained environments. Tech. Report (2018)
99. Zheng, T., Deng, W., Hu, J.: Cross-Age LFW: A database for studying cross-age
face recognition in unconstrained environments. Tech. Report (2017)
100. Zheng, Y., Yang, H., Zhang, T., Bao, J., Chen, D., Huang, Y., Yuan, L., Chen,
D., Zeng, M., Wen, F.: General facial representation learning in a visual-linguistic
manner. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. pp. 18697–18709 (2022)
101. Zhou, Y., Zhang, R., Sun, T., Xu, J.: Enhancing detail preservation for cus-
tomized text-to-image generation: A regularization-free approach. arXiv preprint
arXiv:2305.13579 (2023)
102. Zhu, J., Feng, R., Shen, Y., Zhao, D., Zha, Z.J., Zhou, J., Chen, Q.: Low-rank
subspaces in gans. Advances in Neural Information Processing Systems 34, 16648–
16658 (2021)
103. Zhu, Z., Huang, G., Deng, J., Ye, Y., Huang, J., Chen, X., Zhu, J., Yang, T., Lu,
J., Du, D., et al.: Webface260m: A benchmark unveiling the power of million-scale
deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 10492–10502 (2021)
22 F. Paraperas Papantoniou et al.
Fig. 12: Visual comparison of ID-conditioned latent diffusion models trained from
scratch on FFHQ and WebFace42M. FFHQ samples are aligned to the WebFace42M
template (112 × 112 pixels) for simplicity.
Arc2Face: A Foundation Model of Human Faces 23
Fig. 13: Effect of averaging ArcFace features from Fig. 14: ID similarity [40]
multiple input images. Increasing the number of ID between 400 input IDs and
vectors used (indicated at the bottom left of each re- generated images of them (5
sult) leads to improved fidelity of samples. per ID) by different models.
Generated
Closest
0.34
0.32 0.33 0.36 0.43 0.31 0.29 0.32 0.39
Generated
Closest
0.34
0.30 0.33 0.19 0.37 0.33 0.30 0.34 0.31
Generated
Closest
0.34
0.26 0.30 0.22 0.35 0.30 0.32 0.29 0.33
Generated
Closest
Fig. 15: Faces generated from Arc2Face and their closest train samples, determined
by ArcFace similarity (displayed at the bottom left). The generated faces correspond
to ID vectors from our synthetic dataset (see Sec. 4.1). For the purpose of comparison,
the generated faces are cropped to the training face template.
Arc2Face: A Foundation Model of Human Faces 25
Fig. 16: Comparison of Arc2Face with [45, 83, 86, 90]. As described in the paper, we
use the abstract prompt “photo of a person” for the text-based methods to focus on
their ID-conditioning ability.
26 F. Paraperas Papantoniou et al.
Fig. 17: Multiple samples produced by our model conditioned on the input ID (leftmost
column).
Arc2Face: A Foundation Model of Human Faces 27
Fig. 18: Multiple samples produced by our model conditioned on the input ID (leftmost
column).
28 F. Paraperas Papantoniou et al.
Reference
Input
Fig. 19: Additional results from Arc2Face, conditioned on a 3DMM [13, 44] using
ControlNet [96].
Arc2Face: A Foundation Model of Human Faces 29
Reference
Input
Fig. 20: Additional results from Arc2Face, conditioned on a 3DMM [13, 44] using
ControlNet [96].