Speechanimation fcs2020
Speechanimation fcs2020
Speechanimation fcs2020
DOI
RESEARCH ARTICLE
c Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract In this paper, we present an efficient algorithm time, with nothing but their own face [1, 2].
that generates lip-synchronized facial animation from a given To push the limit further, people naturally wondered
vocal audio clip. By combining spectral-dimensional bidirec- whether an excerpt of vocal recording or a transcript would
tional long short-term memory and temporal attention mech- be sufficient to deduce a corresponding facial animation, as
anism, we design a light-weight speech encoder that learns a driving face might be absent in certain scenarios. Such ex-
useful and robust vocal features from the input audio with- amples may include a virtual assistant whose utterances are
out resorting to pre-trained speech recognition modules or synthesized on the fly. It turns out that some surprisingly
large training data. To learn subject-independent facial mo- faithful results can be achieved [3–7].
tion, we use deformation gradients as the internal representa- However, while face-driven virtual avatars (such as Ap-
tion, which allows nuanced local motions to be better synthe- ple’s Animoji and Memoji) have already been entertaining
sized than using vertex offsets. Compared with state-of-the- world-wide users for a while, we still need to solve at least
art automatic-speech-recognition-based methods, our model two challenges before speech-driven facial animation tech-
is much smaller but achieves similar robustness and quality niques can reach that same maturity: robust audio processing
most of the time, and noticeably better results in certain chal- and effortless motion retargeting. Together they would al-
lenging cases. low an end-to-end system to generalize well to both unheard
Keywords speech-driven facial animation, spectral- voices and unseen avatars—an ability vital to consumer-
dimensional bidirectional long short-term memory, temporal facing products.
attention, deformation gradients
1.1 Vocal audio processing
While a high-quality, pre-trained automatic speech recog- 2. Introducing deformation gradients as the motion repre-
nition (ASR) module can be readily used to extract robust sentation for better handling of non-rigidity and easier
speech features [6, 11], it imposes a significant overhead to generalization to topologically different faces.
the model complexity and runtime performance.
Based on these ideas, we present an end-to-end speech-
Our insight here is that mapping speech to facial animation
driven facial animation algorithm (Fig. 1). It outperforms
is different enough from common speech recognition tasks
state-of-the-art methods [4, 6] in several challenging cases
that a feature extractor designed for the latter may not be the
(e.g., those involving lip closures or lasting vowels). With-
ideal choice for the former. We thus propose a new speech
out using any pre-trained ASR module, our model is compact
encoder that is tailored to our specific task. Unlike a fully-
and runs in real time with low latency. Once trained, it gen-
convolutional network [4], it employs a bidirectional long
eralizes robustly to unheard voices and unseen face models.
short-term memory (LSTM) [12] network along spectral di-
To assess the quality of speech animations generated by our
mension which can better capture long distance correlations
algorithm, please watch the supplementary video1) .
of formants in the mel spectrogram. We further introduce a
temporal attention mechanism that allows the model to fo-
cus on the few audio frames that are influential to the facial
2 Related works
motion but otherwise easy to miss.
The proposed encoder network compares favorably We briefly review the prior arts those are most pertinent to
against the state-of-the-art [6] built on a pre-trained ASR the task of generating facial animations from audio. For the
module [13] in terms of accuracy and robustness, but is only broader topic of facial capture and dynamic manipulation, we
a fraction of its size. refer the readers to the survey by Orvalho et al. [21].
Procedural methods map phonemes in the audio to
1.2 Motion representation visemes following certain predefined rules. One of the
main challenges is how to realistically handle coarticula-
The other key challenge in speech-driven facial animation tions [22–24]. Cohen and Massaro [25] propose dominance
lies in the way facial motions are represented. Being able functions to evaluate degree of a certain viseme in a given
to effortlessly drive an arbitrary 3D face model, even one context. Xu et al. [26] use phone bigrams to handle co-
with a drastically different mesh topology than those seen articulation. But it is difficult to cover all possible coarticula-
and learned by the model, would be highly desirable. Two tion cases in real-world speech. Edwards et al. [3] propose a
popular choices in this regard are low-dimensional expres- jaw-lip action model with an emphasis on artistic control.
sion coefficients (e.g., blendshapes) [5, 11, 14–17] and per- Bregler et al. [27] proposes an example-based method to
vertex offsets from a globally aligned expressionless template rewrite video frames to match a new audio clip via automatic
mesh [4, 6]. But a potential limitation shared by both repre- mouth tracking and image warping. Ezzat et al. [28] map the
sentations is that the face models used for training and infer- phonemes into clustered principal component analysis (PCA)
ence must have exactly the same underlying structure: same coefficients that represent the shape and texture of the lower
rig, same blendshape bases, or same mesh tessellation. face. Taylor et al. [29] use active appearance model (AAM)
Inspired by the study of deformation transfer [18–20], we to model variants in shapes and textures of the lower face,
let our model’s decoder network output deformation gradi- and match variable-length phoneme substrings with similar
ents as an intermediate representation of the target facial mo- appearances into dynamic visemes.
tion, from which we can reconstruct the facial mesh either Brand [30] estimates a hidden Markov model (HMM)
using the original template or a new 3D face, possibly with a from facial landmarks in the video and synthesize the most
different topology. probable sequence through trajectory optimization. Xie and
Liu [31] model the movements of articulators based on dy-
1.3 Contributions namic Bayesian networks. Wang et al. [32] map mel-
frequency cepstral coefficients (MFCCs) to PCA coefficients
The key technical contributions of our approach include:
with an HMM, which is further extended by Zhang et al. [33]
1. A lightweight, robust speech encoder designed specifi- with context-dependent deep neural network hidden Markov
cally for the task of animating 3D face avatars from in-
put vocal audio. 1) Also available at: https://chaiyujin.github.io/sdfa
Front. Comput. Sci.
3
model (CD-DNN-HMM) for more robust audio feature ex- method and the one proposed in this paper. In their pipeline,
traction. windowed audio features are flattened and fed directly to the
Recurrent neural networks (RNNs) and their variants stateful BiLSTM, on which attention mechanism keeps track
have been exploited by many [9, 14, 15, 17, 34–38] due to the of the entire history. On the other hand, our speech en-
sequential nature of audio and visual data. In order to take coder processes the window in two orthogonal phases, one
both past and future context into account, Fan et al. [35] adopt frequency-wise (using BiLSTM), the other frame-wise (us-
a bidirectional long short-term memory (BiLSTM), whereas ing attention) and our attention mechanism only focuses on
the dependency over all time frames prevents their method the given window, which reduces the difficulty of training a
from running in real-time. Suwajanakorn et al. [9] propose robust attention module compared to the entire history. Fur-
a time-delay long short-term memory to look only at the thermore, Tian et al. [15]’s pipeline outputs blendshape coef-
short-term future. Notably, a study conducted by Websdale et ficients of a predefined face rigs, limiting its ability to animate
al. [39] shows that at least 70 ms of look-ahead is necessary unrigged avatars.
in order to synthesize plausible coarticulations. Schwartz and Note that Karras et al. [4] also demonstrate a result where
Savariaux [40] discuss about asynchrony between visual and the facial motion is transferred from a known face model
auditory events and prove there is a typical range of 30~50 to a new one using deformation gradients [18] as a post-
ms auditory lead to 170~200 ms visual lead caused by many processing step. Our use of deformation gradients is different
cases like preparatory lip gestures. The range should be cov- in that we integrate it as the decoding network’s direct output
ered by input audio context to handle such kinds of asyn- so that the decoupled motion can be immediately applied to
chronous events. an arbitrary face mesh.
Taylor et al. [10] introduce a sliding window method that A generative adversarial networks (GANs) based
maps overlapping windows of phoneme subsequences to per- method is proposed by Vougioukas et al. [43] to generate re-
frame AAM parameters using a deep neural network. Kar- alistic talking head video from audio and one still face image.
ras et al. [4] adopt an encoder-decoder architecture, where a A frame discriminator, determining the reality of each sin-
two-phase convolutional neural network (CNN) performs gle frame, and two temporal discriminators, determining the
(feature-dimensional) formant analysis and (temporal) artic- reality and synchronization of video respectively, are used.
ulation analysis over sliding window of linear predictive cod- Whereas, Chen et al. [44] hierarchically regress landmarks
ing (LPC) feature. Fully connected layers are then used to and generate video from landmarks. Attention-based dy-
decode the mesh vertex offsets of the frame central to the namic pixel-wise loss is proposed to address pixel jittering in
sliding window. Following this idea, Pham et al. [5] and Tzi- audiovisual-non-correlated regions. A novel regression dis-
rakis et al. [16] both choose to replace fully connected lay- criminator is proposed to determine the reality of entire video
ers with RNNs to decode blendshape coefficients of template and the accuracy of landmarks for each frame. These GANs-
face rigs. Hati et al. [7] prependly attach a text-to-speech based papers tackle the speech-to-facial animation problem
module powered by Tacotron2 [41] and WaveGlow [42] to a in the image space, different from 3D mesh-based approaches
similar CNN-based architecture to generate speech and facial like Cudeiro et al. [6] and ours.
animation simultaneously from text.
Cudeiro et al. [6] present the impressive VOCASET
dataset along with two notable ideas. First, by conditioning 3 Method
on speaker labels they are able to decouple motion and speak-
ing style from face shapes. Second, by integrating a pre- Our overall algorithm follows an encoder-decoder architec-
trained ASR module, DeepSpeech [13], the audio feature ture [4–6, 16]. The raw input audio sequence is processed
extraction becomes much more robust. Our model also uti- by a sliding window. Signals within a window is converted
lizes speaker labels to distinguish idiosyncratic styles, but re- into mel spectrogram (§ 3.1) before being fed to a three-stage
places the DeepSpeech module by a carefully tailored speech deep neural network. In the first stage (§ 3.2), we perform
encoding network that offers comparable robustness while formant analysis in the spectral dimension with bidirectional
being much smaller. long short-term memory. In the second stage (§ 3.3), tem-
In a recent work, Tian et al. [15] also combine a BiLSTM poral transitions are aggregated with a frame-wise attention
with attention mechanism to generate facial animation from mechanism to yield a robust encoding of the windowed au-
audio. There are two contrasting differences between their dio signal. The third stage (§ 3.4), controlled by the one-hot
Yujin Chai et al. Speech-Driven Facial Animation with Spectral Gathering and Temporal Attention
4
subject label [6], follows to decode the facial motion in the We propose an attention-based [48, 49] articulation anal-
form of deformation gradients [18]. Finally (§ 3.5), the de- ysis network that replaces the convolution layers commonly
formation gradients, together with a static template mesh, are found in state-of-the-art works [4–6, 16]. As shown in Fig. 2
combined to reconstruct the output facial mesh corresponding and Table 2, the spectral feature zspec from the previous for-
to the center frame of the temporal window. Fig. 1 illustrates mant analyzer is fed to two temporal bidirectional long short-
the entire pipeline. term memories (Time-BiLSTMs) to get a memory m with
shape Ctime × L. This step makes sure that each frame has
some knowledge about its context. Content-based attention
3.1 Audio preprocessing
proposed by Bahdanau et al. [49] is then used to decide the
We first convert the raw audio into spectrogram frames using weight of each time frame. The central Kqry frames of m are
short-time Fourier transform. Frames each has a duration of processed by a 1D convolution operator with kernel size Kqry
FFT win and are separated by FFT hop . To calculate the input and projected linearly to get the query term qatt with shape
window of mel spectrogram, we use a window of L frames Catt × 1. The memory m is also projected linearly to get
and F mel-frequency bins. We further stack the first and sec- the key term katt with shape Catt × L. The qatt is repeated
ond temporal derivatives as auxiliary features, resulting in a and added element-wise to katt . We use a tanh activation and
final tensor of shape 3 × F × L. project the summed array into a shape of 1 × L as per-frame
scores. After a softmax normalization along time frames, we
3.2 Formant analysis with spectral gathering can get the weights.
The final output, zatt with shape Ctime × 1, is the weighted
Most CNN-based methods [5,16] treat (mel) spectrograms as sum of m along the temporal dimension. Its shape can be
plain images. However, as noted by Abdel-Hamid et al. [45], squeezed into Ctime , as the size of temporal dimension is 1.
standard CNN kernels may not be suitable when handling the
spectral domain as signals in different frequency bands may
3.4 Motion decoding with deformation gradients
behave quite differently. Simply using a large kernel size is
likely to cause overfitting due to the prevalence of unimpor- When representing the 3D deformation caused by facial mo-
tant partials. tion, a common choice is vertex offsets [4, 6], i.e., the per-
Motivated by the successful application of spectral- vertex displacements of a deformed facial mesh in frame t
dimensional long short-term memory in ASR [46] and pitch in comparison with a static expressionless template mesh.
tracking [47] tasks, we propose a hybrid network architecture However, due to the complex non-linearity of human faces,
(Table 1): the mel spectrogram feature (3×F×L) is fed to two it is difficult to single out “shape-independent” vertex offsets,
2D convolution layers with kernel size 3 × 1, each followed even with the help of conditioning on the speaker label during
by a max pooling with stride 2 × 1 along the spectral dimen- training [6] (which nevertheless helps the model learn motion
sion to detect simple local features and again a 1 × 1 convo- patterns across multiple speakers).
lution, producing an output of dimensions Cconv × F4 × L. A Instead of focusing on vertices directly, we adopt defor-
spectral-dimensional bidirectional long short-term memory mation gradients [18] as a local descriptor for the non-rigid
(Spec-BiLSTM) is then applied, effectively gathering infor- deformation between the expressionless template mesh and
mation in the spectral dimension. one in motion. More concretely, let v(k) (k)
i and ṽi , k ∈ 1, 2, 3,
Finally, the sequence of output at all frequency bands are denote the three vertices of the i-th triangle in the expression-
stacked and consumed by a fully connected layer, yielding a less template and deformed mesh, respectively. To handle the
spectral feature zspec (Cspec ×1× L). Its shape can be squeezed deformation perpendicular to the triangle, we also compute a
into Cspec × L, as the size of spectral dimension is 1. fourth vertex v(4)i as:
Static
Template
Speaker
Onehot
Template Scaling/Shear
… Rotation
Deform
…
…
Attention
Spec-BiLSTM
Mel Spectrogam
E(c) = ||c − Ax̃||2 (5) We augment our training data in three ways. (i) Similar to
Karras et al. [4], we randomly shift each frame with ±0.5
N
where c is the tensor stacked by {Ti }i=1 , and A is a large, frames (about 8.3 ms). For the adjacent frames used to calcu-
sparse matrix that relates x̃ to c. x̃ can be solved in closed late temporal derivative loss terms, we use the same shift-
form by setting the gradient of E(c) to zero. ing amount to ensure the correctness. (ii) We also follow
the common practice in ASR model training to augment the
AT Ax̃ = AT c (6)
audio signal by randomly adding white or pink noise, and
Because A only depends on the static template, AT and AT A pre-emphasizing the signal with coefficient randomly picked
can be pre-computed once. in [0, 0.95]. (iii) To cover a wider range of spectral varia-
tions, we further apply several augmentation schemes on mel
spectrograms: First, we pad zeros in the lowest or the high-
est frequency bins randomly, then resize into original bins;
4 Model training
Second, we randomly squeeze or stretch the time dimension,
4.1 Dataset then resample into the original temporal window; Third, we
randomly set some bins to zero; Fourth, scale mel-frequency
We train our model on VOCASET [6], a dataset of 4D face
bins by a random sin curve. These augmentations prove to be
scans with accompanying speech. The dataset totally con-
useful in boosting the robustness of our model.
tains 480 sequences captured at 60 fps from 12 subjects. For
each sequence, an aligned mesh sequence is provided, as well
4.4 Model details
as a per-subject static expressionless mesh as template. All
meshes share the same topology and contain N = 9976 trian- For mel spectrogram extraction, we use L = 64 frames and
gles. F = 128 mel-frequency bins. Each frame processed by short-
We split the data by 8 : 2 : 2 for training, validation, and time Fourier transform has duration of FFT win = 0.064 s and
test respectively in the same way as original VOCA paper [6]. is separated by FFT hop = 0.008 s. The extracted mel spec-
All sets are fully disjoint, i.e., no overlap of subjects or sen- trogram represents a window of 0.568 s audio signal, which
tences exists. is enough for capturing coarticulation according to Websdale
For each frame t in a data sequence, we compute the mel et al. [39] and for handling the audiovisual asynchrony [40].
spectrogram and the corresponding deformation gradients to Table 1 summarizes the layers in formant analysis mod-
form the data pair (xt , yt ), where yt can be further split into the ule. Leaky ReLU activations with leaky rate 0.2 and batch
scaling/shear term st ∈ R6N and the rotation term rt ∈ R3N . normalization are used for all convolution layers. Table 2
describes the layers in articulation analysis module. In the
4.2 Loss function motion decoding module, the PCA bases of each branch cov-
ers about 97% of the variance, namely 85-dimensional vec-
The decoder outputs the deformation gradients as two sep-
tors for scaling/shear branch and 180-dimensional vectors for
arate components: s̃t and r̃t . For each component, we con-
rotation branch. N is 9976 as mentioned in § 4.1. Table 3 de-
sider the L2 loss of both the values and the temporal deriva-
picts the layers of shared part and two separated branches.
tives. The latter encourages temporal smoothness of the re-
Since training set contains 8 subjects, the size of one-hot
sults [4, 6].
speaker labels is 8 as well.
Specifically, we can define the two loss terms for the scal-
For the entire model, weight normalization is performed
ing/shear component as:
on all weights.
Lvs = ||st − s̃t ||2 (7) The model is built with PyTorch [50]. We train the model
Lds = ||(st − st−1 ) − (s̃t − s̃t−1 )||
2
(8) for 50 epochs using Adam [51] with a constant learning rate
of 0.0001 and batch size of 100. In each batch, we randomly
The losses for the rotation component, Lvr and Ldr , are defined choose 50 pairs of adjacent frames to calculate the temporal
similarly. The final loss is a weighted sum of the above four derivative terms in our loss function. The training takes about
terms. The weights are determined automatically using the 5 hours on a GeForce R GTX 1080 Ti GPU.
dynamic scalars proposed by Karras et al. [4]. Our entire uncompressed model has a memory footprint
Front. Comput. Sci.
7
of about 66.9 MB. In comparison, the VOCA model [6], sense may be picked up immediately by human eyes as some-
with the integrated DeepSpeech module [13], is around thing uncannily wrong [53]. This may explain why none of
477.3 MB—more than 7 times as large as ours. Among the most relevant works (e.g., by Karras et al. [4] and Cudeiro
66.9 MB memory size, our trainable parameters only hold et al. [6]) includes a quantitative evaluation. Therefore, we
26.63 MB and the fixed components from PCA occupy the conduct several user studies and qualitative evaluation.
rest. The small size of trainable parameters can help to avoid
overfitting. Furthermore, by converting 32-bit float into 16- 5.1 User studies
bit float and removing some less important PCA components,
one can compress the model size greatly and migrate it to mo- Three blind user studies are published on Amazon Mechan-
bile devices. ical Turk (AMT) in the form of A/B choices. In each study
a user is presented with two side-by-side synchronized ani-
mation clips driven by the same audio, and asked to choose
Table 1 Layers of formant analysis module.
the better one. The template meshes and rendering configura-
Layer Kernel Stride2) Activation Output
size2) shape tions remain the same for both results. The user can also play
Mel spectrogram - - - 3 × 128 × 64
the animation at half speed to better compare motion details.
Convolution2d 3×1 1×1 lrelu:0.21) 32×128×64 To prevent some participants from randomly picking an-
MaxPool2d 2×1 2×1 - 32 × 64 × 64
Convolution2d 3×1 1×1 lrelu:0.2 64 × 64 × 64
swers without careful evaluation, we adopt two measures.
MaxPool2d 2×1 2×1 - 64 × 32 × 64 First, a user can only give answer after watching both clips.
Convolution2d 1×1 1×1 lrelu:0.2 64 × 32 × 64
Second, several pairs with an obvious answer are included
Spec-BiLSTM - - - 64 × 32 × 64
Frequency stack - - - 2048×1×64 as qualification questions, and we only accept answers from
Fully connected - - - 256 × 1 × 64 users who have passed the qualification.
Squeezing - - - 256 × 64
1). lrelu:0.2: Leaky ReLU activation with leaky rate 0.2. We have collected a total of 5600 HITs (human intel-
2). Both kernel size and stride are described in Spectral × Temporal shape. ligence task), each represents one participant making one
Table 2 Layers of articulation analysis module. choice between two clips. Notably, the number of both utter-
ances and collected HITs in our user studies are times larger
Layer Output shape
than similar ones conducted by Cudeiro et al. [6].
Time-BiLSTM 512 × 64
Time-BiLSTM 512 × 64
Attention 512 × 1
Squeezing 512
5.1.1 Comparison with captured data
Percentage of choice %
We have collected 50 HITs for each result pair, total-
50
ing 2400 HITs. Overall, the participants’ preference lean
only slightly toward deformation gradients (53.58%±3.42%). 40
But interestingly, deformation gradients seem to work sig- 30
nificantly better in sentences involving transition from the 20
phoneme /s/ to any of /m/, /b/, or /p/. As shown in Fig. 3,
10
the first five columns from the left are sentences with the
aforementioned phonetic transitions. Please watch the sup- 0 1 2 3 4 5 6 7 8 9 10 11 12
plementary video for dynamic comparison.
Fig. 3 User study comparing deformation gradients and vertex offsets.
(a) Karras et al. (b) Their model+our data (c) Ours (a) Without audio augment (b) Our full model
Fig. 5 Comparison with Karras et al. [4]. The speaker is about to pro- Fig. 7 Data augmentation in audio signal. Consonant /b/ from a test sen-
nounce the phoneme /b/ at this moment, when the lips are supposedly tence is pronounced at this moment.
pressed.
0 20 40 60 0 20 40 60
around pressed lips, especially well. Our model is signifi- 10. Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe,
cantly smaller than those featuring pre-trained ASR modules, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. A
but offers comparable robustness and higher quality results in deep learning approach for generalized speech animation. ACM Trans-
expressive facial animation system. In Proceedings of the 2007 ACM asynchronies varying from small audio lead to large audio lag. PLOS
SIGGRAPH symposium on Video games, pages 21–26. ACM, 2007. Computational Biology, 10(7):e1003743, 2014.
25. Michael M Cohen and Dominic W Massaro. Modeling coarticulation 41. Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep
in synthetic visual speech. In Models and techniques in computer ani- Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang,
mation, pages 139–156. Springer, 1993. Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on
26. Yuyu Xu, Andrew W Feng, Stacy Marsella, and Ari Shapiro. A prac- mel spectrogram predictions. In 2018 IEEE International Conference
tical and configurable lip sync method for games. In Proceedings of on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–
Motion on Games, pages 131–140. ACM, 2013. 4783. IEEE, 2018.
27. Christoph Bregler, Michele Covell, and Malcolm Slaney. Video 42. Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flow-
rewrite: Driving visual speech with audio. In Proceedings of the 24th based generative network for speech synthesis. In ICASSP 2019-2019
annual conference on Computer graphics and interactive techniques, IEEE International Conference on Acoustics, Speech and Signal Pro-
pages 353–360, 1997. cessing (ICASSP), pages 3617–3621. IEEE, 2019.
28. Tony Ezzat, Gadi Geiger, and Tomaso Poggio. Trainable videorealistic 43. Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. Realis-
speech animation. ACM Transactions on Graphics (TOG), 21(3):388– tic speech-driven facial animation with gans. International Journal of
398, July 2002. Computer Vision, 128(5):1398–1413, May 2020.
29. Sarah L Taylor, Moshe Mahler, Barry-John Theobald, and Iain 44. Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. Hier-
Matthews. Dynamic units of visual speech. In Proceedings of the 11th archical cross-modal talking face generation with dynamic pixel-wise
ACM SIGGRAPH/Eurographics conference on Computer Animation, loss. In Proceedings of the IEEE/CVF Conference on Computer Vision
pages 275–284, 2012. and Pattern Recognition (CVPR), pages 7832–7841, 2019.
30. Matthew Brand. Voice puppetry. In Proceedings of the 26th annual 45. Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng,
conference on Computer graphics and interactive techniques, pages Gerald Penn, and Dong Yu. Convolutional neural networks for speech
21–28. ACM Press/Addison-Wesley Publishing Co., 1999. recognition. IEEE/ACM Transactions on audio, speech, and language
31. Lei Xie and Zhi-Qiang Liu. Realistic mouth-synching for speech- processing, 22(10):1533–1545, 2014.
driven talking face using articulatory modelling. IEEE Transactions 46. Tara N. Sainath and Bo Li. Modeling time-frequency patterns with lstm
on Multimedia, 9(3):500–510, 2007. vs. convolutional architectures for lvcsr tasks. In Interspeech 2016,
32. Lijuan Wang, Wei Han, Frank K Soong, and Qiang Huo. Text driven pages 813–817, 2016.
3d photo-realistic talking head. In Interspeech 2011, pages 3307–3308, 47. Yuzhou Liu and DeLiang Wang. Time and frequency domain long
2011. short-term memory for noise robust pitch tracking. In 2017 IEEE In-
33. Xinjian Zhang, Lijuan Wang, Gang Li, Frank Seide, and Frank K ternational Conference on Acoustics, Speech and Signal Processing
Soong. A new language independent, photo-realistic talking head (ICASSP), pages 5600–5604. IEEE, 2017.
driven by voice only. In Interspeech 2013, pages 2743–2747, 2013. 48. Misha Denil, Loris Bazzani, Hugo Larochelle, and Nando de Freitas.
34. Taiki Shimba, Ryuhei Sakurai, Hirotake Yamazoe, and Joo-Ho Lee. Learning where to attend with deep architectures for image tracking.
Talking heads synthesis from audio with deep neural networks. In Neural computation, 24(8):2151–2184, 2012.
2015 IEEE/SICE International Symposium on System Integration (SII), 49. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural
pages 100–105. IEEE, 2015. machine translation by jointly learning to align and translate. arXiv
35. Bo Fan, Lei Xie, Shan Yang, Lijuan Wang, and Frank K Soong. A preprint arXiv:1409.0473, 2014.
deep bidirectional lstm approach for video-realistic talking head. Mul- 50. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Ed-
timedia Tools and Applications, 75(9):5287–5309, 2016. ward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca
36. Sefik Emre Eskimez, Ross K Maddox, Chenliang Xu, and Zhiyao Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS
Duan. Generating talking face landmarks from speech. In Interna- 2017 Workshop on Autodiff, 2017.
tional Conference on Latent Variable Analysis and Signal Separation, 51. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic
pages 372–381. Springer, 2018. optimization. arXiv preprint arXiv:1412.6980, 2014.
37. Deepali Aneja and Wilmot Li. Real-time lip sync for live 2d animation. 52. Paul Ekman, Wallace V Friesen, and Joseph C Hager. Facial action
arXiv preprint arXiv:1910.08685, 2019. coding system: The manual on CD-ROM. Instructor’s Guide. Salt Lake
38. David Greenwood, Iain Matthews, and Stephen Laycock. Joint learn- City: Network Information Research Co., 2002.
ing of facial expression and head pose from speech. In Proc. Inter- 53. Masahiro Mori, Karl F MacDorman, and Norri Kageki. The un-
speech 2018, pages 2484–2488, 2018. canny valley [from the field]. IEEE Robotics & Automation Magazine,
39. Danny Websdale, Sarah Taylor, and Ben Milner. The effect of real-time 19(2):98–100, 2012.
constraints on automatic speech animation. In Proc. Interspeech 2018, 54. Changil Kim, Hijung Valentina Shin, Tae-Hyun Oh, Alexandre Kaspar,
pages 2479–2483, 2018. Mohamed Elgharib, and Wojciech Matusik. On learning associations
40. Jean-Luc Schwartz and Christophe Savariaux. No, there is no 150 ms of faces and voices. In Proceedings of Asian Conference on Computer
lead of visual speech on auditory speech, but a range of audiovisual Vision (ACCV), pages 276–292. Springer, 2018.
Front. Comput. Sci.
13