Speechanimation fcs2020

Front.Comput.Sci.
DOI
RESEARCH ARTICLE
Speech-Driven Facial Animation

with Spectral Gathering and Temporal Attention
Yujin Chai1 , Yanlin Weng 1
, Lvdi Wang2 , Kun Zhou1
1 State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058, China

2 FaceUnity Technology Inc., Hangzhou 310011, China
c Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract In this paper, we present an efficient algorithm time, with nothing but their own face [1, 2].
that generates lip-synchronized facial animation from a given To push the limit further, people naturally wondered
vocal audio clip. By combining spectral-dimensional bidirec- whether an excerpt of vocal recording or a transcript would
tional long short-term memory and temporal attention mech- be sufficient to deduce a corresponding facial animation, as
anism, we design a light-weight speech encoder that learns a driving face might be absent in certain scenarios. Such ex-
useful and robust vocal features from the input audio with- amples may include a virtual assistant whose utterances are
out resorting to pre-trained speech recognition modules or synthesized on the fly. It turns out that some surprisingly
large training data. To learn subject-independent facial mo- faithful results can be achieved [3–7].
tion, we use deformation gradients as the internal representa- However, while face-driven virtual avatars (such as Ap-
tion, which allows nuanced local motions to be better synthe- ple’s Animoji and Memoji) have already been entertaining
sized than using vertex offsets. Compared with state-of-the- world-wide users for a while, we still need to solve at least
art automatic-speech-recognition-based methods, our model two challenges before speech-driven facial animation tech-
is much smaller but achieves similar robustness and quality niques can reach that same maturity: robust audio processing
most of the time, and noticeably better results in certain chal- and effortless motion retargeting. Together they would al-
lenging cases. low an end-to-end system to generalize well to both unheard
Keywords speech-driven facial animation, spectral- voices and unseen avatars—an ability vital to consumer-
dimensional bidirectional long short-term memory, temporal facing products.
attention, deformation gradients
1.1 Vocal audio processing
Human beings have an elaborate articulatory system. Pro-

1 Introduction ducing speech sounds is largely a deterministic process that
involves not only the visible interactions between facial mus-
Recent years have witnessed an accelerating evolution of fa- cles and skin tissues, but also nuanced coordination across
cial animation technologies. Animating a 3D face avatar, for multiple internal organs. On the other hand, inferring the
example, is no longer the privilege of those equipped with facial motion that produces a given speech is an inherently
professional skills and devices. Thanks to some of the ex- ambiguous reverse process [3, 4]. Furthermore, real-world
citing advances in the field, anyone with a smartphone can speech signals vary tremendously due to the speaker’s age,
control the facial expression of a vivid virtual avatar in real gender, emotional state, the environment, and the record-
ing devices [8]. Previous works based on deep neural net-
Received March 31, 2020; accepted September 23, 2020
works [9, 10] can handle such ambiguities and variations to
E-mail: [email protected] some extent, provided that sufficient training data is available.
Yujin Chai et al. Speech-Driven Facial Animation with Spectral Gathering and Temporal Attention
2
While a high-quality, pre-trained automatic speech recog- 2. Introducing deformation gradients as the motion repre-
nition (ASR) module can be readily used to extract robust sentation for better handling of non-rigidity and easier
speech features [6, 11], it imposes a significant overhead to generalization to topologically different faces.
the model complexity and runtime performance.
Based on these ideas, we present an end-to-end speech-
Our insight here is that mapping speech to facial animation
driven facial animation algorithm (Fig. 1). It outperforms
is different enough from common speech recognition tasks
state-of-the-art methods [4, 6] in several challenging cases
that a feature extractor designed for the latter may not be the
(e.g., those involving lip closures or lasting vowels). With-
ideal choice for the former. We thus propose a new speech
out using any pre-trained ASR module, our model is compact
encoder that is tailored to our specific task. Unlike a fully-
and runs in real time with low latency. Once trained, it gen-
convolutional network [4], it employs a bidirectional long
eralizes robustly to unheard voices and unseen face models.
short-term memory (LSTM) [12] network along spectral di-
To assess the quality of speech animations generated by our
mension which can better capture long distance correlations
algorithm, please watch the supplementary video1) .
of formants in the mel spectrogram. We further introduce a
temporal attention mechanism that allows the model to fo-
cus on the few audio frames that are influential to the facial
2 Related works
motion but otherwise easy to miss.
The proposed encoder network compares favorably We briefly review the prior arts those are most pertinent to
against the state-of-the-art [6] built on a pre-trained ASR the task of generating facial animations from audio. For the
module [13] in terms of accuracy and robustness, but is only broader topic of facial capture and dynamic manipulation, we
a fraction of its size. refer the readers to the survey by Orvalho et al. [21].
Procedural methods map phonemes in the audio to
1.2 Motion representation visemes following certain predefined rules. One of the
main challenges is how to realistically handle coarticula-
The other key challenge in speech-driven facial animation tions [22–24]. Cohen and Massaro [25] propose dominance
lies in the way facial motions are represented. Being able functions to evaluate degree of a certain viseme in a given
to effortlessly drive an arbitrary 3D face model, even one context. Xu et al. [26] use phone bigrams to handle co-
with a drastically different mesh topology than those seen articulation. But it is difficult to cover all possible coarticula-
and learned by the model, would be highly desirable. Two tion cases in real-world speech. Edwards et al. [3] propose a
popular choices in this regard are low-dimensional expres- jaw-lip action model with an emphasis on artistic control.
sion coefficients (e.g., blendshapes) [5, 11, 14–17] and per- Bregler et al. [27] proposes an example-based method to
vertex offsets from a globally aligned expressionless template rewrite video frames to match a new audio clip via automatic
mesh [4, 6]. But a potential limitation shared by both repre- mouth tracking and image warping. Ezzat et al. [28] map the
sentations is that the face models used for training and infer- phonemes into clustered principal component analysis (PCA)
ence must have exactly the same underlying structure: same coefficients that represent the shape and texture of the lower
rig, same blendshape bases, or same mesh tessellation. face. Taylor et al. [29] use active appearance model (AAM)
Inspired by the study of deformation transfer [18–20], we to model variants in shapes and textures of the lower face,
let our model’s decoder network output deformation gradi- and match variable-length phoneme substrings with similar
ents as an intermediate representation of the target facial mo- appearances into dynamic visemes.
tion, from which we can reconstruct the facial mesh either Brand [30] estimates a hidden Markov model (HMM)
using the original template or a new 3D face, possibly with a from facial landmarks in the video and synthesize the most
different topology. probable sequence through trajectory optimization. Xie and
Liu [31] model the movements of articulators based on dy-
1.3 Contributions namic Bayesian networks. Wang et al. [32] map mel-
frequency cepstral coefficients (MFCCs) to PCA coefficients
The key technical contributions of our approach include:
with an HMM, which is further extended by Zhang et al. [33]
1. A lightweight, robust speech encoder designed specifi- with context-dependent deep neural network hidden Markov
cally for the task of animating 3D face avatars from in-
put vocal audio. 1) Also available at: https://chaiyujin.github.io/sdfa
Front. Comput. Sci.
3
model (CD-DNN-HMM) for more robust audio feature ex- method and the one proposed in this paper. In their pipeline,
traction. windowed audio features are flattened and fed directly to the
Recurrent neural networks (RNNs) and their variants stateful BiLSTM, on which attention mechanism keeps track
have been exploited by many [9, 14, 15, 17, 34–38] due to the of the entire history. On the other hand, our speech en-
sequential nature of audio and visual data. In order to take coder processes the window in two orthogonal phases, one
both past and future context into account, Fan et al. [35] adopt frequency-wise (using BiLSTM), the other frame-wise (us-
a bidirectional long short-term memory (BiLSTM), whereas ing attention) and our attention mechanism only focuses on
the dependency over all time frames prevents their method the given window, which reduces the difficulty of training a
from running in real-time. Suwajanakorn et al. [9] propose robust attention module compared to the entire history. Fur-
a time-delay long short-term memory to look only at the thermore, Tian et al. [15]’s pipeline outputs blendshape coef-
short-term future. Notably, a study conducted by Websdale et ficients of a predefined face rigs, limiting its ability to animate
al. [39] shows that at least 70 ms of look-ahead is necessary unrigged avatars.
in order to synthesize plausible coarticulations. Schwartz and Note that Karras et al. [4] also demonstrate a result where
Savariaux [40] discuss about asynchrony between visual and the facial motion is transferred from a known face model
auditory events and prove there is a typical range of 30~50 to a new one using deformation gradients [18] as a post-
ms auditory lead to 170~200 ms visual lead caused by many processing step. Our use of deformation gradients is different
cases like preparatory lip gestures. The range should be cov- in that we integrate it as the decoding network’s direct output
ered by input audio context to handle such kinds of asyn- so that the decoupled motion can be immediately applied to
chronous events. an arbitrary face mesh.
Taylor et al. [10] introduce a sliding window method that A generative adversarial networks (GANs) based
maps overlapping windows of phoneme subsequences to per- method is proposed by Vougioukas et al. [43] to generate re-
frame AAM parameters using a deep neural network. Kar- alistic talking head video from audio and one still face image.
ras et al. [4] adopt an encoder-decoder architecture, where a A frame discriminator, determining the reality of each sin-
two-phase convolutional neural network (CNN) performs gle frame, and two temporal discriminators, determining the
(feature-dimensional) formant analysis and (temporal) artic- reality and synchronization of video respectively, are used.
ulation analysis over sliding window of linear predictive cod- Whereas, Chen et al. [44] hierarchically regress landmarks
ing (LPC) feature. Fully connected layers are then used to and generate video from landmarks. Attention-based dy-
decode the mesh vertex offsets of the frame central to the namic pixel-wise loss is proposed to address pixel jittering in
sliding window. Following this idea, Pham et al. [5] and Tzi- audiovisual-non-correlated regions. A novel regression dis-
rakis et al. [16] both choose to replace fully connected lay- criminator is proposed to determine the reality of entire video
ers with RNNs to decode blendshape coefficients of template and the accuracy of landmarks for each frame. These GANs-
face rigs. Hati et al. [7] prependly attach a text-to-speech based papers tackle the speech-to-facial animation problem
module powered by Tacotron2 [41] and WaveGlow [42] to a in the image space, different from 3D mesh-based approaches
similar CNN-based architecture to generate speech and facial like Cudeiro et al. [6] and ours.
animation simultaneously from text.
Cudeiro et al. [6] present the impressive VOCASET
dataset along with two notable ideas. First, by conditioning 3 Method
on speaker labels they are able to decouple motion and speak-
ing style from face shapes. Second, by integrating a pre- Our overall algorithm follows an encoder-decoder architec-
trained ASR module, DeepSpeech [13], the audio feature ture [4–6, 16]. The raw input audio sequence is processed
extraction becomes much more robust. Our model also uti- by a sliding window. Signals within a window is converted
lizes speaker labels to distinguish idiosyncratic styles, but re- into mel spectrogram (§ 3.1) before being fed to a three-stage
places the DeepSpeech module by a carefully tailored speech deep neural network. In the first stage (§ 3.2), we perform
encoding network that offers comparable robustness while formant analysis in the spectral dimension with bidirectional
being much smaller. long short-term memory. In the second stage (§ 3.3), tem-
In a recent work, Tian et al. [15] also combine a BiLSTM poral transitions are aggregated with a frame-wise attention
with attention mechanism to generate facial animation from mechanism to yield a robust encoding of the windowed au-
audio. There are two contrasting differences between their dio signal. The third stage (§ 3.4), controlled by the one-hot
4
subject label [6], follows to decode the facial motion in the We propose an attention-based [48, 49] articulation anal-
form of deformation gradients [18]. Finally (§ 3.5), the de- ysis network that replaces the convolution layers commonly
formation gradients, together with a static template mesh, are found in state-of-the-art works [4–6, 16]. As shown in Fig. 2
combined to reconstruct the output facial mesh corresponding and Table 2, the spectral feature zspec from the previous for-
to the center frame of the temporal window. Fig. 1 illustrates mant analyzer is fed to two temporal bidirectional long short-
the entire pipeline. term memories (Time-BiLSTMs) to get a memory m with
shape Ctime × L. This step makes sure that each frame has
some knowledge about its context. Content-based attention
3.1 Audio preprocessing
proposed by Bahdanau et al. [49] is then used to decide the
We first convert the raw audio into spectrogram frames using weight of each time frame. The central Kqry frames of m are
short-time Fourier transform. Frames each has a duration of processed by a 1D convolution operator with kernel size Kqry
FFT win and are separated by FFT hop . To calculate the input and projected linearly to get the query term qatt with shape
window of mel spectrogram, we use a window of L frames Catt × 1. The memory m is also projected linearly to get
and F mel-frequency bins. We further stack the first and sec- the key term katt with shape Catt × L. The qatt is repeated
ond temporal derivatives as auxiliary features, resulting in a and added element-wise to katt . We use a tanh activation and
final tensor of shape 3 × F × L. project the summed array into a shape of 1 × L as per-frame
scores. After a softmax normalization along time frames, we
3.2 Formant analysis with spectral gathering can get the weights.
The final output, zatt with shape Ctime × 1, is the weighted
Most CNN-based methods [5,16] treat (mel) spectrograms as sum of m along the temporal dimension. Its shape can be
plain images. However, as noted by Abdel-Hamid et al. [45], squeezed into Ctime , as the size of temporal dimension is 1.
standard CNN kernels may not be suitable when handling the
spectral domain as signals in different frequency bands may
3.4 Motion decoding with deformation gradients
behave quite differently. Simply using a large kernel size is
likely to cause overfitting due to the prevalence of unimpor- When representing the 3D deformation caused by facial mo-
tant partials. tion, a common choice is vertex offsets [4, 6], i.e., the per-
Motivated by the successful application of spectral- vertex displacements of a deformed facial mesh in frame t
dimensional long short-term memory in ASR [46] and pitch in comparison with a static expressionless template mesh.
tracking [47] tasks, we propose a hybrid network architecture However, due to the complex non-linearity of human faces,
(Table 1): the mel spectrogram feature (3×F×L) is fed to two it is difficult to single out “shape-independent” vertex offsets,
2D convolution layers with kernel size 3 × 1, each followed even with the help of conditioning on the speaker label during
by a max pooling with stride 2 × 1 along the spectral dimen- training [6] (which nevertheless helps the model learn motion
sion to detect simple local features and again a 1 × 1 convo- patterns across multiple speakers).
lution, producing an output of dimensions Cconv × F4 × L. A Instead of focusing on vertices directly, we adopt defor-
spectral-dimensional bidirectional long short-term memory mation gradients [18] as a local descriptor for the non-rigid
(Spec-BiLSTM) is then applied, effectively gathering infor- deformation between the expressionless template mesh and
mation in the spectral dimension. one in motion. More concretely, let v(k) (k)
i and ṽi , k ∈ 1, 2, 3,
Finally, the sequence of output at all frequency bands are denote the three vertices of the i-th triangle in the expression-
stacked and consumed by a fully connected layer, yielding a less template and deformed mesh, respectively. To handle the
spectral feature zspec (Cspec ×1× L). Its shape can be squeezed deformation perpendicular to the triangle, we also compute a
into Cspec × L, as the size of spectral dimension is 1. fourth vertex v(4)i as:
(v(2) (1) (3) (1)

i − vi ) × (vi − vi )
3.3 Articulation analysis with temporal attention v(4)
i = v(1)
i + q (1)
(v(2) (1) (3) (1)
i − vi ) × (vi − vi )
When analyzing the phonemes in the speech, vowels can usu-
ally be recognized by their distinct formant patterns. Conso- We then define the transform matrix Ti that satisfies:
nants, on the other hand, are identified by the transition of
formants between adjacent vowels. Ti Vi = Ṽi (2)
Front. Comput. Sci.
5
Static
Template
Speaker
Onehot
Template Scaling/Shear
… Rotation
Deform
…
…
Attention
Spec-BiLSTM
Mel Spectrogam
Inputs Spectral Gathering Temporal Attention Decoding Animation
Fig. 1 Our algorithm pipeline.
m (Ctime×L) lar decomposition [19, 20], Ti = Ri Si , to separate Ti into the

scaling/shear component Si and the rotational component Ri .
The scaling/shear matrix is symmetric and can be represented
with 6 parameters. The rotation, when using Rodrigues’ for-
Kqry frames mula, can be expressed in another 3 parameters. Given a tem-
fc conv
add plate facial mesh with N triangles, the deformation gradients
katt (Catt×L) η qatt (Catt×1) have a total of 9N parameters, 6N-dimensional s denoting
tanh scaling/shear and 3N-dimensional r denoting rotation.
To let the model learn from different speakers and also
scores (Catt×L)
provide the user certain control over the speaking style, we
fc + softmax follow VOCA [6] to embed a speaker-specific label into the
weights (1×L) decoding network. Specifically, we first concatenate one-hot
coded speaker label to the output of the previous attention
weighted sum
module zatt and feed the result to a fully connected layer to
zatt (Ctime×1) get the output zdec . The decoder network then splits into two
parallel branches with the same structure to map zdec to s and
r separately.
Fig. 2 Temporal attention module. Fully connected layer is represented as
’fc’. 1D convolution layer is represented as ’conv’. In each branch, zdec is again concatenated with speaker
label and projected by three fully connection layers followed
by a final linear layer whose parameters are initialized using
where Vi and Ṽi are three stacked vectors from the template
the corresponding PCA bases and frozen during training, as
and deformed triangle respectively:
shown in Table 3.
Vi = [v(2) (1)
i − vi v(3) (1)
i − vi v(4) (1)
i − vi ]
(3)
Ṽi = [ṽ(2) (1)
i − ṽi ṽ(3) (1)
i − ṽi ṽ(4) (1)
i − ṽi ]
3.5 Mesh reconstruction
Ti can be expressed in closed form:
Given the static expressionless template mesh and the defor-
Ti = Ṽi V−1
i (4)
mation gradients s and r of a frame, we first convert s and r
N
Since the rotational component in the transform matri- back to per-triangle transform matrices {Ti }i=1 . We then solve
ces cannot be directly interpolated, we further perform po- the vertex positions x̃ of the deformed mesh by minimizing
6
the following energy [18]: 4.3 Data augmentation
E(c) = ||c − Ax̃||2 (5) We augment our training data in three ways. (i) Similar to
Karras et al. [4], we randomly shift each frame with ±0.5
N
where c is the tensor stacked by {Ti }i=1 , and A is a large, frames (about 8.3 ms). For the adjacent frames used to calcu-
sparse matrix that relates x̃ to c. x̃ can be solved in closed late temporal derivative loss terms, we use the same shift-
form by setting the gradient of E(c) to zero. ing amount to ensure the correctness. (ii) We also follow
the common practice in ASR model training to augment the
AT Ax̃ = AT c (6)
audio signal by randomly adding white or pink noise, and
Because A only depends on the static template, AT and AT A pre-emphasizing the signal with coefficient randomly picked
can be pre-computed once. in [0, 0.95]. (iii) To cover a wider range of spectral varia-
tions, we further apply several augmentation schemes on mel
spectrograms: First, we pad zeros in the lowest or the high-
est frequency bins randomly, then resize into original bins;
4 Model training
Second, we randomly squeeze or stretch the time dimension,
4.1 Dataset then resample into the original temporal window; Third, we
randomly set some bins to zero; Fourth, scale mel-frequency
We train our model on VOCASET [6], a dataset of 4D face
bins by a random sin curve. These augmentations prove to be
scans with accompanying speech. The dataset totally con-
useful in boosting the robustness of our model.
tains 480 sequences captured at 60 fps from 12 subjects. For
each sequence, an aligned mesh sequence is provided, as well
4.4 Model details
as a per-subject static expressionless mesh as template. All
meshes share the same topology and contain N = 9976 trian- For mel spectrogram extraction, we use L = 64 frames and
gles. F = 128 mel-frequency bins. Each frame processed by short-
We split the data by 8 : 2 : 2 for training, validation, and time Fourier transform has duration of FFT win = 0.064 s and
test respectively in the same way as original VOCA paper [6]. is separated by FFT hop = 0.008 s. The extracted mel spec-
All sets are fully disjoint, i.e., no overlap of subjects or sen- trogram represents a window of 0.568 s audio signal, which
tences exists. is enough for capturing coarticulation according to Websdale
For each frame t in a data sequence, we compute the mel et al. [39] and for handling the audiovisual asynchrony [40].
spectrogram and the corresponding deformation gradients to Table 1 summarizes the layers in formant analysis mod-
form the data pair (xt , yt ), where yt can be further split into the ule. Leaky ReLU activations with leaky rate 0.2 and batch
scaling/shear term st ∈ R6N and the rotation term rt ∈ R3N . normalization are used for all convolution layers. Table 2
describes the layers in articulation analysis module. In the
4.2 Loss function motion decoding module, the PCA bases of each branch cov-
ers about 97% of the variance, namely 85-dimensional vec-
The decoder outputs the deformation gradients as two sep-
tors for scaling/shear branch and 180-dimensional vectors for
arate components: s̃t and r̃t . For each component, we con-
rotation branch. N is 9976 as mentioned in § 4.1. Table 3 de-
sider the L2 loss of both the values and the temporal deriva-
picts the layers of shared part and two separated branches.
tives. The latter encourages temporal smoothness of the re-
Since training set contains 8 subjects, the size of one-hot
sults [4, 6].
speaker labels is 8 as well.
Specifically, we can define the two loss terms for the scal-
For the entire model, weight normalization is performed
ing/shear component as:
on all weights.
Lvs = ||st − s̃t ||2 (7) The model is built with PyTorch [50]. We train the model
Lds = ||(st − st−1 ) − (s̃t − s̃t−1 )||
2
(8) for 50 epochs using Adam [51] with a constant learning rate
of 0.0001 and batch size of 100. In each batch, we randomly
The losses for the rotation component, Lvr and Ldr , are defined choose 50 pairs of adjacent frames to calculate the temporal
similarly. The final loss is a weighted sum of the above four derivative terms in our loss function. The training takes about
terms. The weights are determined automatically using the 5 hours on a GeForce R GTX 1080 Ti GPU.
dynamic scalars proposed by Karras et al. [4]. Our entire uncompressed model has a memory footprint
Front. Comput. Sci.
7
of about 66.9 MB. In comparison, the VOCA model [6], sense may be picked up immediately by human eyes as some-
with the integrated DeepSpeech module [13], is around thing uncannily wrong [53]. This may explain why none of
477.3 MB—more than 7 times as large as ours. Among the most relevant works (e.g., by Karras et al. [4] and Cudeiro
66.9 MB memory size, our trainable parameters only hold et al. [6]) includes a quantitative evaluation. Therefore, we
26.63 MB and the fixed components from PCA occupy the conduct several user studies and qualitative evaluation.
rest. The small size of trainable parameters can help to avoid
overfitting. Furthermore, by converting 32-bit float into 16- 5.1 User studies
bit float and removing some less important PCA components,
one can compress the model size greatly and migrate it to mo- Three blind user studies are published on Amazon Mechan-
bile devices. ical Turk (AMT) in the form of A/B choices. In each study
a user is presented with two side-by-side synchronized ani-
mation clips driven by the same audio, and asked to choose
Table 1 Layers of formant analysis module.
the better one. The template meshes and rendering configura-
Layer Kernel Stride2) Activation Output
size2) shape tions remain the same for both results. The user can also play
Mel spectrogram - - - 3 × 128 × 64
the animation at half speed to better compare motion details.
Convolution2d 3×1 1×1 lrelu:0.21) 32×128×64 To prevent some participants from randomly picking an-
MaxPool2d 2×1 2×1 - 32 × 64 × 64
Convolution2d 3×1 1×1 lrelu:0.2 64 × 64 × 64
swers without careful evaluation, we adopt two measures.
MaxPool2d 2×1 2×1 - 64 × 32 × 64 First, a user can only give answer after watching both clips.
Convolution2d 1×1 1×1 lrelu:0.2 64 × 32 × 64
Second, several pairs with an obvious answer are included
Spec-BiLSTM - - - 64 × 32 × 64
Frequency stack - - - 2048×1×64 as qualification questions, and we only accept answers from
Fully connected - - - 256 × 1 × 64 users who have passed the qualification.
Squeezing - - - 256 × 64
1). lrelu:0.2: Leaky ReLU activation with leaky rate 0.2. We have collected a total of 5600 HITs (human intel-
2). Both kernel size and stride are described in Spectral × Temporal shape. ligence task), each represents one participant making one
Table 2 Layers of articulation analysis module. choice between two clips. Notably, the number of both utter-
ances and collected HITs in our user studies are times larger
Layer Output shape
than similar ones conducted by Cudeiro et al. [6].
Time-BiLSTM 512 × 64
Time-BiLSTM 512 × 64
Attention 512 × 1
Squeezing 512
5.1.1 Comparison with captured data
In the first study we want to know how close animations gen-

Table 3 Layers of motion decoding module.
erated from speech using our method and the “ground truth”
Layer Activation Output shape Output shape
(scaling/shear) (rotation) are. We utilize the test sequences in VOCASET [6] and ask
Identity concat - 520 (shared) the participants to choose between the captured facial anima-
Fully connected lrelu:0.2 512 (shared) tion and those produced by our method conditioned on all
Identity concat - 520 520
Fully connected lrelu:0.2 512 512
eight speaker styles. Among 800 HITs collected, users prefer
Fully connected tanh 256 256 the captured results (81.25% ± 3.78%) over ours—a number
Fully connected - 85 180
Inverse PCA - 59865 29928
quite similar to that reported in prior art [6]. This is not un-
expected though, as certain visual subtleties are known to be
absent in the speech audio [3].
5.1.2 Deformation gradients vs. vertex offsets

5 Evaluation
In our own comparison we have noticed that using deforma-
Quantitative assessment of a speech-driven facial animation tion gradients as the network output works equally well as
is difficult for several reasons. First, the mapping between vertex offsets do most of the time. But in certain cases, de-
speech signals and facial motion is known to be ambigu- formation gradients noticeably outperforms the other repre-
ous [6] and affected greatly by the speaker. Second, humans sentation. To confirm this observation, we conduct a second
have developed outstanding capability of detecting nuanced user study where participants are asked to compare 48 pairs
facial expressions [52]. A trivial error in the geometrical of results (12 sentences in 4 random styles) generated using
8
these two motion representations, with audio sources from

Deformation gradients Vertex offsets
previous work and YouTube.
60
Percentage of choice %
We have collected 50 HITs for each result pair, total-
50
ing 2400 HITs. Overall, the participants’ preference lean
only slightly toward deformation gradients (53.58%±3.42%). 40
But interestingly, deformation gradients seem to work sig- 30
nificantly better in sentences involving transition from the 20
phoneme /s/ to any of /m/, /b/, or /p/. As shown in Fig. 3,
10
the first five columns from the left are sentences with the
aforementioned phonetic transitions. Please watch the sup- 0 1 2 3 4 5 6 7 8 9 10 11 12
plementary video for dynamic comparison.
Fig. 3 User study comparing deformation gradients and vertex offsets.
5.1.3 Comparison with VOCA

Ours VOCA (Cudeiro et al. 2019)
In the third study we compare our method with VOCA [6],
70
Percentage of choice %
with a focus on the two methods’ robustness regarding au-
60
dio quality. We randomly select 12 sentences (different from
50
those in the second study), each in 4 random styles. From
40
the test set, we sampled four sentences those have similar
30
recording quality to the training data. From the supplemen-
20
tary videos of previous works, we select two clips with no
10
noise, one clip with medium noise (-24dB), and one clip
0 1 2 3 4 5 6 7 8 9 10 11 12
with slightly higher (-18dB) noise. Besides, four clips from
YouTube are considered. Two of them have low background
music and the other two are recorded in open space. The Fig. 4 User study comparing our method with VOCA [6]. The columns are
sorted roughly by increasing noise level in the audio.
clips of VOCA are generated with their code and released
model. In total, the 2400 HITs show no obvious prefer-
ence between VOCA (45.14% ± 10.02%) and our method 5.2.1 Comparison with Karras et al.
(54.86% ± 10.02%). But as Fig. 4 illustrates, VOCA han-
We also compare our results with Karras et al.’s [4]. Since
dles high noise level and open space environment much better
their model and data are not publicly available yet, we have
than our method does, probably due to the exhaustive speech
both compared directly to the clips in their supplementary
data used to train their integrated DeepSpeech module. Our
video, and implemented their algorithm ourselves, but having
lightweight model, on the other hand, works surprisingly well
the original emotional states replaced by the one-hot coded
when the noise level is relatively low, and handles different
speaker labels and training with the same VOCASET which
languages, genders, pitches etc. robustly.
our own model consumes.
Furthermore, our proposed model outperforms VOCA [6]
As shown in Fig. 5 and the supplementary video, our
at several special cases where wordless singing or long-
model handles plosives particularly better in comparison with
lasting vowel exists, as shown in our supplementary video.
theirs.
The failure of VOCA at these cases is probably caused by
the failure of ASR module, DeepSpeech, because wordless
singing and long-lasting vowel are quite hard for ASR tasks. 5.2.2 Ablation study
We perform ablation study on the key modifications we have

5.2 Qualitative evaluation made in the network architecture. The Spec-BiLSTM and
temporal attention are replaced by convolutional layers along
Please watch the video in our supplementary materials for all spectral and temporal dimensions respectively, and in both
the dynamic results mentioned in this section. None of the cases significant downgrades of the result quality can be ob-
results contains utterances from the training set. served. If Spec-BiLSTM is replaced by convolution layers,
Front. Comput. Sci.
9
(a) Karras et al. (b) Their model+our data (c) Ours (a) Without audio augment (b) Our full model
Fig. 5 Comparison with Karras et al. [4]. The speaker is about to pro- Fig. 7 Data augmentation in audio signal. Consonant /b/ from a test sen-
nounce the phoneme /b/ at this moment, when the lips are supposedly tence is pronounced at this moment.
pressed.
(a) Without mel augment (b) Our full model

(a) No Spec-BiLSTM (b) No temporal attention (c) Our full model
Fig. 8 Data augmentation in mel spectrograms. Consonant /p/ from a test
sentence is pronounced at this moment.
Fig. 6 Ablation of model architecture. Consonant /p/ from a test sentence
is pronounced at this moment.
clusive attention (e.g., /b/ in Fig. 9(b))—a phenomenon that

as shown in Fig. 6(a), the lips fail to press at the consonant matches one’s intuition well.
/p/. Fig. 6(b) depicts a similar case when temporal attention
is replaced by convolution layers. 5.2.4 Topology-independent retargeting
We also compare models trained without and with data
We have also applied the output facial motion, represented in
augmentation. Without data augmentation, the model is more
deformation gradients, to new 3D avatars that have drastically
likely to fail at certain consonants, especially plosives. Fig. 7
different mesh topologies by following the method mentioned
shows when /b/ is pronounced, the model without audio sig-
in [18]. The results can be seen in Fig. 10 and the supplemen-
nal augmentation fails to press lips. Fig. 8 shows another case
tary video.
where /p/ is pronounced, the model without mel spectrogram
augmentation fails to press lips as well.
5.3 Running time
More results of ablation study can be found in our supple-
mentary video, which conveys the comparison in a dynamic At runtime, we compare the time consumptions of different
and noticeable way. approaches (Karras et al.’s [4], VOCA [6] and ours) to gen-
erate an animation from the same input audio (5,095 ms). As
shown in Table 4, we test all the three models on CPU be-
5.2.3 Attention analysis
cause the DeepSpeech module used by VOCA and all the
Fig. 9 shows visualizations of internal weights in our tempo- preprocessing steps run on CPU. The neural networks com-
ral attention module when processing two distinct phonemes. plete inference of all frames in a single batch.
It is interesting to see that when processing a vowel, the In addition to the total running time, we also measure
weights are almost evenly distributed (e.g., /ei/ in Fig. 9(a)). the time comsumptions for individual algorithm stages. Our
However, when the input window is occupied by a consonant, method uses deformation gradients as motion representation,
the center frames that correspond to the plosive attract ex- and thus requires two extra stages to set the static mesh tem-
10
Mel Spectrogram Mel Spectrogram Table 4 Comparison of time consumptions (ms).

Stage Karras et al.’s [4] VOCA [6] Ours
Set static mesh - - 13.15
Preprocess audio 1,672.15 8.94 553.92
Get audio feature 376.09 9,253.29 4,200.69
Get anime feature 51.06 578.35 545.03
Reconstruct mesh - - 589.57
Attention (0~0.14) Attention (0~0.14) Total (with loading) 2,531.80 11,999.58 6,425.50
TM
The same input audio (5,095 ms) is used. All tests run on an Intel R Core
i7-8700K (@3.70 GHz) CPU. Karras et al.’s [4] is implemented by us due to
the lack of official release. The original authors’ code and pre-trained model
of VOCA [6] are used for the test.
0 20 40 60 0 20 40 60
(a) Vowel ’/ei/’ at center (b) Consonant ’/b/’ at center

6 Limitation and future work
Fig. 9 Visualization of attention module at vowel and consonant.
In order to reach a higher level of realism, there are several
other aspects that this paper does not account for. For ex-
ample, the abilities to easily control the emotion or speaking
style would greatly enhance the expressiveness of the facial
animation. Some recent studies [54–59] exploit the associa-
tion between faces and voices, aiming to identify a person or
detect the emotional state from face images/videos and voice
in a synergetic fashion. Oh et al. [57] attempt to infer the
speaker’s face from the input voice. In the future, it may be
possible to automatically infer a person-specific and expres-
(a) Original mesh (b) Frog (c) Dog sionless facial mesh, and animate it with given voice while
allowing speaking style and emotional state to be further con-
Fig. 10 The generated facial motions applied to other artist designed
meshes with different topologies.
trolled.
The motions our method learns and infers focus primarily
in the lip and jaw area. But evidences show that there are also
correlations (albeit weaker) between speech and other subtle
plate and reconstruct the final mesh from predicted defor- motions [3], including eye gaze, head motion, gestures [60]
mation gradients. During preprocessing Karras et al.’s [4], etc. It is interesting to see how much of these spontaneous
VOCA [6] and ours compute LPC, MFCCs and mel spectro- and non-vocal conversational signals a model can learn and
grams, respectively. To extract audio features, VOCA uses reproduce.
DeepSpeech, while Karras et al.’s and ours use custom audio
encoders. Overall, the running time of VOCA is about twice
as that of our method mainly due to the overhead of Deep- 7 Conclusion
Speech. Karras et al.’s [4] remains the fastest because of its
fully convolutional structure. We present a novel speech-driven facial animation algorithm.
A spectral-dimensional bidirectional long short-term mem-
On a GeForce R GTX 1080 Ti GPU, our method takes ory is introduced to exploit formant features spanning a wide
about 10 ms to generate one complete facial mesh frame, spectrogram. Frame-wise attention mechanism is used to bet-
including mel spectrogram computation, network inference ter synchronize lip motions with vocal signals in the temporal
(with a batch size of 1), and mesh reconstruction. Even dimension. Together they form an efficient and lightweight
though the current implementation is able to perform real- vocal feature encoder suitable for our specific task. We use
time speech-driven animation, there is still much room for deformation gradients to represent facial motion. Results
optimization. show that it handles highly non-rigid deformations, like those
Front. Comput. Sci.
11
around pressed lips, especially well. Our model is signifi- 10. Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe,
cantly smaller than those featuring pre-trained ASR modules, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. A
but offers comparable robustness and higher quality results in deep learning approach for generalized speech animation. ACM Trans-
certain challenging cases. actions on Graphics (TOG), 36(4), July 2017.

11. Ahmed Hussen Abdelaziz, Barry-John Theobald, Justin Binder,
Acknowledgements We would like to thank VOCA group for publish- Gabriele Fanelli, Paul Dixon, Nick Apostoloff, Thibaut Weise, and
ing their database. This work is partially supported by the National Key Sachin Kajareker. Speaker-independent speech-driven visual speech
Research & Development Program of China (2016YFB1001403) and NSF synthesis using domain-adapted acoustic models. In 2019 Interna-
China (No. 61572429).
tional Conference on Multimodal Interaction, ICMI ’19, pages 220—-
225, New York, NY, USA, 2019. Association for Computing Machin-
ery.
12. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
Neural computation, 9(8):1735–1780, 1997.
References
13. Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Di-
amos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta,
1. Chen Cao, Qiming Hou, and Kun Zhou. Displaced dynamic expression
Adam Coates, et al. Deep speech: Scaling up end-to-end speech recog-
regression for real-time facial tracking and animation. ACM Transac-
nition. arXiv preprint arXiv:1412.5567, 2014.
tions on graphics (TOG), 33(4), July 2014.
14. Hai X Pham, Samuel Cheung, and Vladimir Pavlovic. Speech-driven
2. Koki Nagano, Shunsuke Saito, Lain Goldwhite, Kyle San, Aaron
3d facial animation with implicit emotional awareness: a deep learning
Hong, Liwen Hu, Lingyu Wei, Jun Xing, Qingguo Xu, Han-Wei Kung,
approach. In 2017 IEEE Conference on Computer Vision and Pattern
Jiale Kuang, Aviral Agarwal, Erik Castellanos, Jaewoo Seo, Jens Fur-
Recognition Workshops (CVPRW), pages 2328–2336. IEEE, 2017.
sund, and Hao Li. Pinscreen avatars in your pocket: Mobile pagan
15. Guanzhong Tian, Yi Yuan, and Yong Liu. Audio2face: Generating
engine and personalized gaming. In SIGGRAPH Asia 2018 Real-Time
speech/face animation from single audio with attention-based bidirec-
Live!, SA ’18, New York, NY, USA, 2018. Association for Computing
tional lstm networks. In 2019 IEEE international conference on Mul-
Machinery.
timedia & Expo Workshops (ICMEW), pages 366–371. IEEE, 2019.
3. Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. Jali:
16. Panagiotis Tzirakis, Athanasios Papaioannou, Alexander Lattas,
An animator-centric viseme model for expressive lip synchronization.
Michail Tarasiou, Björn Schuller, and Stefanos Zafeiriou. Synthe-
ACM Transactions on graphics (TOG), 35(4), July 2016.
sising 3d facial motion from" in-the-wild" speech. arXiv preprint
4. Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehti-
arXiv:1904.07002, 2019.
nen. Audio-driven facial animation by joint end-to-end learning of
17. Ryosuke Nishimura, Nobuchika Sakata, Tomu Tominaga, Yoshinori
pose and emotion. ACM Transactions on Graphics (TOG), 36(4), July
Hijikata, Kensuke Harada, and Kiyoshi Kiyokawa. Speech-driven fa-
2017.
cial animation by lstm-rnn for communication use. In 2019 IEEE Con-
5. Hai Xuan Pham, Yuting Wang, and Vladimir Pavlovic. End-to-end
ference on Virtual Reality and 3D User Interfaces (VR), pages 1102–
learning for 3d facial animation from speech. In Proceedings of the
1103. IEEE, 2019.
20th ACM International Conference on Multimodal Interaction, ICMI
18. Robert W Sumner and Jovan Popović. Deformation transfer for tri-
’18, pages 361–365, New York, NY, USA, 2018. Association for Com-
angle meshes. ACM Transactions on graphics (TOG), 23(3):399–405,
puting Machinery.
2004.
6. Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and
19. Qianyi Wu, Juyong Zhang, Yu-Kun Lai, Jianmin Zheng, and Jianfei
Michael J. Black. Capture, learning, and synthesis of 3d speaking
Cai. Alive caricature from 2d to 3d. In Proceedings of the IEEE/CVF
styles. In Proceedings of the IEEE/CVF Conference on Computer Vi-
Conference on Computer Vision and Pattern Recognition (CVPR),
sion and Pattern Recognition (CVPR), pages 10101–10111, June 2019.
pages 7336–7345, 2018.
7. Yliess Hati, Francis Rousseaux, and Clément Duhart. Text-driven
20. Lin Gao, Y. Lai, Jie Yang, Ling-Xiao Zhang, L. Kobbelt, and S. Xia.
mouth animation for human computer interaction with personal assis-
Sparse data driven mesh deformation. IEEE transactions on visualiza-
tant. In ICAD 2019: The 25th International Conference on Auditory
tion and computer graphics, 2019.
Display, pages 75–82. Department of Computer and Information Sci-
21. Verónica Orvalho, Pedro Bastos, Frederic I Parke, Bruno Oliveira, and
ences, Northumbria University, 2019.
Xenxo Alvarez. A facial rigging survey. In Eurographics (STARs),
8. Dan Jurafsky and James H. Martin. Speech and language processing
pages 183–204, 2012.
: an introduction to natural language processing, computational lin-
22. Raymond D Kent and Fred D Minifie. Coarticulation in recent speech
guistics, and speech recognition. Pearson Prentice Hall, Upper Saddle
production models. Journal of phonetics, 5(2):115–133, 1977.
River, N.J., 2009.
23. Catherine Pelachaud, Norman I Badler, and Mark Steedman. Generat-
9. Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-
ing facial expressions for speech. Cognitive science, 20(1):1–46, 1996.
Shlizerman. Synthesizing obama: Learning lip sync from audio. ACM
24. Alice Wang, Michael Emmi, and Petros Faloutsos. Assembling an
Transactions on Graphics (TOG), 36(4), July 2017.
12
expressive facial animation system. In Proceedings of the 2007 ACM asynchronies varying from small audio lead to large audio lag. PLOS
SIGGRAPH symposium on Video games, pages 21–26. ACM, 2007. Computational Biology, 10(7):e1003743, 2014.
25. Michael M Cohen and Dominic W Massaro. Modeling coarticulation 41. Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep
in synthetic visual speech. In Models and techniques in computer ani- Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang,
mation, pages 139–156. Springer, 1993. Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on
26. Yuyu Xu, Andrew W Feng, Stacy Marsella, and Ari Shapiro. A prac- mel spectrogram predictions. In 2018 IEEE International Conference
tical and configurable lip sync method for games. In Proceedings of on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–
Motion on Games, pages 131–140. ACM, 2013. 4783. IEEE, 2018.
27. Christoph Bregler, Michele Covell, and Malcolm Slaney. Video 42. Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flow-
rewrite: Driving visual speech with audio. In Proceedings of the 24th based generative network for speech synthesis. In ICASSP 2019-2019
annual conference on Computer graphics and interactive techniques, IEEE International Conference on Acoustics, Speech and Signal Pro-
pages 353–360, 1997. cessing (ICASSP), pages 3617–3621. IEEE, 2019.
28. Tony Ezzat, Gadi Geiger, and Tomaso Poggio. Trainable videorealistic 43. Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. Realis-
speech animation. ACM Transactions on Graphics (TOG), 21(3):388– tic speech-driven facial animation with gans. International Journal of
398, July 2002. Computer Vision, 128(5):1398–1413, May 2020.
29. Sarah L Taylor, Moshe Mahler, Barry-John Theobald, and Iain 44. Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. Hier-
Matthews. Dynamic units of visual speech. In Proceedings of the 11th archical cross-modal talking face generation with dynamic pixel-wise
ACM SIGGRAPH/Eurographics conference on Computer Animation, loss. In Proceedings of the IEEE/CVF Conference on Computer Vision
pages 275–284, 2012. and Pattern Recognition (CVPR), pages 7832–7841, 2019.
30. Matthew Brand. Voice puppetry. In Proceedings of the 26th annual 45. Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng,
conference on Computer graphics and interactive techniques, pages Gerald Penn, and Dong Yu. Convolutional neural networks for speech
21–28. ACM Press/Addison-Wesley Publishing Co., 1999. recognition. IEEE/ACM Transactions on audio, speech, and language
31. Lei Xie and Zhi-Qiang Liu. Realistic mouth-synching for speech- processing, 22(10):1533–1545, 2014.
driven talking face using articulatory modelling. IEEE Transactions 46. Tara N. Sainath and Bo Li. Modeling time-frequency patterns with lstm
on Multimedia, 9(3):500–510, 2007. vs. convolutional architectures for lvcsr tasks. In Interspeech 2016,
32. Lijuan Wang, Wei Han, Frank K Soong, and Qiang Huo. Text driven pages 813–817, 2016.
3d photo-realistic talking head. In Interspeech 2011, pages 3307–3308, 47. Yuzhou Liu and DeLiang Wang. Time and frequency domain long
2011. short-term memory for noise robust pitch tracking. In 2017 IEEE In-
33. Xinjian Zhang, Lijuan Wang, Gang Li, Frank Seide, and Frank K ternational Conference on Acoustics, Speech and Signal Processing
Soong. A new language independent, photo-realistic talking head (ICASSP), pages 5600–5604. IEEE, 2017.
driven by voice only. In Interspeech 2013, pages 2743–2747, 2013. 48. Misha Denil, Loris Bazzani, Hugo Larochelle, and Nando de Freitas.
34. Taiki Shimba, Ryuhei Sakurai, Hirotake Yamazoe, and Joo-Ho Lee. Learning where to attend with deep architectures for image tracking.
Talking heads synthesis from audio with deep neural networks. In Neural computation, 24(8):2151–2184, 2012.
2015 IEEE/SICE International Symposium on System Integration (SII), 49. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural
pages 100–105. IEEE, 2015. machine translation by jointly learning to align and translate. arXiv
35. Bo Fan, Lei Xie, Shan Yang, Lijuan Wang, and Frank K Soong. A preprint arXiv:1409.0473, 2014.
deep bidirectional lstm approach for video-realistic talking head. Mul- 50. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Ed-
timedia Tools and Applications, 75(9):5287–5309, 2016. ward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca
36. Sefik Emre Eskimez, Ross K Maddox, Chenliang Xu, and Zhiyao Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS
Duan. Generating talking face landmarks from speech. In Interna- 2017 Workshop on Autodiff, 2017.
tional Conference on Latent Variable Analysis and Signal Separation, 51. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic
pages 372–381. Springer, 2018. optimization. arXiv preprint arXiv:1412.6980, 2014.
37. Deepali Aneja and Wilmot Li. Real-time lip sync for live 2d animation. 52. Paul Ekman, Wallace V Friesen, and Joseph C Hager. Facial action
arXiv preprint arXiv:1910.08685, 2019. coding system: The manual on CD-ROM. Instructor’s Guide. Salt Lake
38. David Greenwood, Iain Matthews, and Stephen Laycock. Joint learn- City: Network Information Research Co., 2002.
ing of facial expression and head pose from speech. In Proc. Inter- 53. Masahiro Mori, Karl F MacDorman, and Norri Kageki. The un-
speech 2018, pages 2484–2488, 2018. canny valley [from the field]. IEEE Robotics & Automation Magazine,
39. Danny Websdale, Sarah Taylor, and Ben Milner. The effect of real-time 19(2):98–100, 2012.
constraints on automatic speech animation. In Proc. Interspeech 2018, 54. Changil Kim, Hijung Valentina Shin, Tae-Hyun Oh, Alexandre Kaspar,
pages 2479–2483, 2018. Mohamed Elgharib, and Wojciech Matusik. On learning associations
40. Jean-Luc Schwartz and Christophe Savariaux. No, there is no 150 ms of faces and voices. In Proceedings of Asian Conference on Computer
lead of visual speech on auditory speech, but a range of audiovisual Vision (ACCV), pages 276–292. Springer, 2018.
Front. Comput. Sci.
13
55. Valentin Vielzeuf, Corentin Kervadec, Stéphane Pateux, Alexis

Yanlin Weng is currently an Associate
Lechervy, and Frédéric Jurie. An occam’s razor view on learning au-
Professor of the School of Computer
diovisual emotion recognition with small training sets. In Proceedings
Science and Technology at Zhejiang
of the 20th ACM International Conference on Multimodal Interaction,
University. She got her Ph.D. degree in
pages 589–593, 2018.
56. Egils Avots, Tomasz Sapiński, Maie Bachmann, and Dorota Kamińska. Computer Science from University of
Audiovisual emotion recognition in wild. Machine Vision and Appli- Wisconsin - Milwaukee, and her mas-
cations, 30(5):975–985, 2019. ter and bachelor degrees in Control Sci-
57. Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, William T ence and Engineering from Zhejiang
Freeman, Michael Rubinstein, and Wojciech Matusik. Speech2face: University. Her research interest includes computer graphics and
Learning the face behind a voice. In Proceedings of the IEEE/CVF multimedia.
Conference on Computer Vision and Pattern Recognition (CVPR),
pages 7539–7548, 2019. Lvdi Wang is currently the Director of
58. Rui Wang, Xin Liu, Yiu-ming Cheung, Kai Cheng, Nannan Wang, and FaceUnity Research Center of Beijing,
Wentao Fan. Learning discriminative joint embeddings for efficient China. He received his Ph.D. degree
face and voice association. In Proceedings of the 43rd International in Computer Science from Institute for
ACM SIGIR Conference on Research and Development in Information Advanced Study, Tsinghua University
Retrieval, SIGIR ’20, pages 1881–1884, New York, NY, USA, 2020. in 2011, under the supervision of Prof.
Association for Computing Machinery. Baining Guo. After that, he has spent
59. Hao Zhu, Mandi Luo, Rui Wang, Aihua Zheng, and Ran He. Deep
four years at Microsoft Research Asia
audio-visual learning: A survey. arXiv preprint arXiv:2001.04758,
and two years at Apple Inc. His works focus on computer graphics,
2020.
speech, and natural language processing techniques.
60. Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew
Owens, and Jitendra Malik. Learning individual styles of conversa-
tional gesture. In Proceedings of the IEEE/CVF Conference on Com- Kun Zhou is a Cheung Kong Professor
puter Vision and Pattern Recognition (CVPR), pages 3497–3506, 2019. in the Computer Science Department
of Zhejiang University, and the Direc-
tor of the State Key Lab of CAD&CG.
Yujin Chai received his B.S. degree in Prior to joining Zhejiang University in
the Computer Science and Technology 2008, Dr. Zhou was a Leader Re-
Department from Zhejiang University, searcher of the Internet Graphics Group
China in 2016. He is now a Ph.D. stu- at Microsoft Research Asia. He re-
dent at the State Key Lab of CAD&CG, ceived his B.S. degree and Ph.D. degree in computer science from
Zhejiang University. His research inter- Zhejiang University in 1997 and 2002, respectively. His research
ests include machine learning and fa- interests are in visual computing, parallel computing, human com-
cial animation. puter interaction, and virtual reality. He is a Fellow of IEEE.

Speechanimation fcs2020

Uploaded by

Copyright:

Available Formats

Speechanimation fcs2020

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speechanimation fcs2020

Uploaded by

Copyright:

Available Formats

Front.Comput.Sci.

Speech-Driven Facial Animation

1 State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058, China

Human beings have an elaborate articulatory system. Pro-

(v(2) (1) (3) (1)

Inputs Spectral Gathering Temporal Attention Decoding Animation

Fig. 1 Our algorithm pipeline.

m (Ctime×L) lar decomposition [19, 20], Ti = Ri Si , to separate Ti into the

the following energy [18]: 4.3 Data augmentation

In the first study we want to know how close animations gen-

5.1.2 Deformation gradients vs. vertex offsets

these two motion representations, with audio sources from

5.1.3 Comparison with VOCA

We perform ablation study on the key modifications we have

(a) Without mel augment (b) Our full model

clusive attention (e.g., /b/ in Fig. 9(b))—a phenomenon that

Mel Spectrogram Mel Spectrogram Table 4 Comparison of time consumptions (ms).

(a) Vowel ’/ei/’ at center (b) Consonant ’/b/’ at center

certain challenging cases. actions on Graphics (TOG), 36(4), July 2017.

55. Valentin Vielzeuf, Corentin Kervadec, Stéphane Pateux, Alexis

You might also like