Lip Reading Word Classification: Abiel Gutierrez Stanford University Zoe-Alanah Robert Stanford University
Lip Reading Word Classification: Abiel Gutierrez Stanford University Zoe-Alanah Robert Stanford University
Lip Reading Word Classification: Abiel Gutierrez Stanford University Zoe-Alanah Robert Stanford University
1
paper. Chung & Zisserman [13] made use of the work of
Graves et al. by using spatiotemporal CNNs for word
classification on the BBC TV dataset. Assael et al. [1]
created LipNET, a phrase predictor that uses spatiotemporal
convolutions and bidirectional GRUs and achieved a 11.4%
WER on unseen speakers. Our model is primarily inspired
by this work. We also took inspiration from Garg et al. [14],
where a pre-trained VGG was used for transfer learning on Figure 1: (left to right) Original Input image (part of a sequence)
in the MIRACL-VC dataset; OpenCV and dlib facial recognition
the MIRACL-V1 dataset. A much more comprehensive list
software labelling key points on around a detected face; final
of lip reading works can be found in Zhou et al. [15]. cropped image
One issue with this data set is its small size. To increase
3. Dataset and Features the number of training sequences, we performed data
We used the MIRACL-VC1 data set [0] containing both augmentation. We tripled the data set in size by adding a
depth and color images of fifteen speakers uttering ten horizontally flipped version of each image and a randomly
words and ten phrases, ten times each. The sequence of pixel-jittered version of each image.
images represents low quality video frames. The data set
contains 3000 sequences of varying lengths of images of
640 x 480 pixels, in both color and depth representations,
collected at 15 frames per second. The lengths of these
sequences range from 4 to 27 image frames. The words and
phrases are as follows:
Words: begin, choose, connection, navigation, next, Figure 2: (left to right) Original Input image (part of a sequence)
previous, start, stop, hello, web in the MIRACL-VC dataset; a horizontally flipped image; a
Phrases: Stop navigation, Excuse me, I am sorry, Thank jittered image.
you, Good bye, I love this game, Nice to meet you, You are
welcome, How are you, Have a good time In summary, each model receives a single image
sequence as input – with anywhere from 4 to 27 images in
For the sake of time and utilizing smaller data sizes, we the sequence – and produces a single word classification
focused on building a classifier that can identify which label as output.
word is being uttered from a sequence of images of the
speaker as input. We ignored the set of phrase data and also
the depth images for the spoken word data. We built
classifiers for both seen and unseen people. (Seen meaning
that the model is trained on all people saying all words but
saves certain trials for test and validation. Unseen removes Figure 3: Example full input sequence of length 5. The subject is
people from training and adds them to exclusively to either speaking “begin.”
testing or validation. The split is thirteen people for train,
one for validation, and one for test.) The resulting datasets
are (1200/150/150) (train/test/validation) examples for seen
and (1300/100/100) (train/test/validation) examples for
unseen. The class label distribution for the dataset is even
as each person performs the same number of trials per word.
Preprocessing was an important part of working with this
dataset. First, we utilized a python facial recognition
library, dlib, in conjunction with OpenCV and a pre-trained
model [2] to isolate the points of facial structure in each Figure 4: Example full input sequence of length 10. The subject
image and crop it to only include the face of the speaker, is speaking “hello.”
excluding any background that could interfere with the
training of the model. We had to limit the size of every 4. Methods
facial crop to a 90x90 pixel square in order to create
uniform input data sequences for the model. In this section we describe the different models that we
created to solve the lip reading problem. We created four
2
models: a Baseline CNN + LSTM network; a more robust The LSTM was added to package the entire sequence of
and deep layered CNN + LSTM network inspired by Deep CNN outputs into a single layer without losing the temporal
Mind’s LipNET[1]; an LSTM network placed on top of understanding of the video frames. In particular, an LSTM
bottleneck features developed by a VGG16 network pre- fixes the vanishing gradient problem present in vanilla
trained on ImageNet; and the same LSTM network on top RNNs, which inhibits the backpropagation of gradients to
of VGG16 with fine-tuning of the last convolutional block. occur [16]. It does so by adding 4 gates (input (i), forget (f),
output (o), new memory (c)) whose activations can be
4.1 CNN + LSTM Baseline learned, in order to control whether or not to hold on to
information:
Our first model ran every image of our sequenced input
through a Convolutional Neural Network and then fed the
flattened outputs as a sequence into a Long Short Term
Memory Recurrent Neural Network, which produced a
single output, making it a many-to-one RNN. We then
added a Fully Connected layer that mapped to 10 units, and
used a softmax activation layer to produce the probabilities
of every word, of which we took the highest: Given that we use softmax as our last activation, our loss
function is cross entropy loss:
3
4.4 Fine-tuned VGG-16 + LSTM
4.3 ImageNet Pretrained VGG-16 Features +
LSTM Our last model consisted of unfreezing the last
convolutional layer block of the pretrained VGG-16 model
Given that we decided to focus only on words and not (3 CNN layers with 512 3x3 kernels) and training it
phrases, we were limited to 1500 data points (15 people alongside the Bidirectional LSTM and Dense Layers that
uttering 10 words for 10 iterations). This is a really small we placed on top. This was done with the intention of
dataset by deep learning standards; as a result, we decided capturing the more complex features of our input images,
to employ transfer learning by making use of a VGG-16 since later CNN layers tend to capture less obvious
network pre-trained on ImageNet[17]. VGG networks [18] characteristics of an image. Given that the unfrozen
were developed under the premise that a smaller kernel size convolutional layer was initialized with pretrained
allows for deeper networks without increasing the number ImageNet weights, we also initialized the LSTM and Dense
of parameters, thus increasing the number of non-linearities layers with the weights of the top portion of our third model
without the need of greater memory usage. (the LSTM and Dense layer training on top of the VGG-16
bottleneck features), to prevent a random weight
initialization from recklessly modifying the VGG weights
when backpropagating. We also prevented this by
switching to a regular SGD optimizer rather than Adam,
since SGD is subtler in updating weights.
ImageNet is a dataset currently consisting of more than Frozen VGG + 100% 76% 55%
14 million images, some of which contain annotations and LSTM
object bounding boxes. 952,000 of those images are of
humans, reason for which this network could be utilized for Fine-tuned VGG + 100% 79% 59%
the lip reading problem [17]. LSTM
We extracted the bottleneck features of our dataset up
until the last convolutional layer of VGG. We then fed this Figure 9: Seen Subject Accuracy
result into the top portion of our second model -- the
Bidirectional LSTM, and performed the training. This
meant that we froze the VGG model weights, and only
updated the weights of our LSTM and Dense layer. This
sped up training considerably, given that our data was only
processed by the VGG network once per training trial.
4
Figure 10: Seen Subject Accuracy Comparison Graph
5.1 CNN + LSTM Baseline Figure 11: Loss and Accuracy Plots for Deep-Layered CNN +
LSTM
Our baseline got surprisingly good results for seen
subjects, with a 39% accuracy for the test set. As a result, Building up on the analysis made for the baseline, it
we suspect that it is relatively easy to distinguish between seems like classifying on words spoken by seen subjects is
words uttered if the word and the subject have been seen not complex, and thus having the complex representations
multiple times before. This would mean that there’s several outputted by our three layered convolutional network was
high level features that make this distinction easy -- reason unnecessary. Another distinction between this model and
for which only one CNN layer was required to obtain decent the baseline was the addition of dropout and batch
results. normalization. Again, it seems that regularization wasn’t
necessary for seen subjects, so it harmed this model’s
performance. The jagged loss plot below suggests that the
5.2 Deep Layered CNN + LSTM
model was having trouble navigating the validation loss
space, which might suggest that it did not have a good
Our deep layered model performed worse than our
intuition for what features to look for when making
baseline, with a test accuracy of 25%, significantly lower
predictions.
than validation (39%) and training (52%).
4.3 ImageNet Pretrained VGG-16 Features +
LSTM
5
Figure 13: Confusion Matrix for VGG-16 + LSTM
6
Our model is predicting “Stop” for 92% of the words. We
realized cross-validation could have helped mitigate this
issue; a possible explanation for this result is that the person
in the test set spoke faster than any other subject, and as a
result, most of the words uttered by the subject are thought
to be “stop”, since “stop” has perhaps the shortest
pronunciation within the dataset.
The confusion matrix of unseen subject for our fine-tuned
VGG model also suggests a correlation between short
words:
7
Given more time and resources, the models [5] C. Sui, R. Togneri, and M. Bennamoun, “Extracting deep
outlined in this paper could be greatly improved. We think bottleneck features for visual speech recognition,” in
the addition of regularization would reduce the overfitting ICASSP, 2015, pp. 1518–1522.
in our models even further. We also have yet to experiment [6] S. Petridis and M. Pantic, “Deep complementary bottleneck
features for visual speech recognition,” in IEEE ICASSP,
with the number of filters in the fully connected layers. We
2016, pp. 2304–2308.
only had 3 filters per layer, just like LipNET had, but other [7] Y. Li, Y. Takashima, T. Takiguchi, and Y. Ariki, “Lip
papers used anywhere from 64 to 512 filters per single CNN reading using a dynamic feature of lip images and
layer. Additionally, accuracy improvements could be found convolutional neural networks,” in IEEE/ACIS Intl. Conf. on
with further hyperparameter tuning and investigation of Computer and Information Science, 2016, pp. 1–6.
even more optimizer types. We also would have gotten [8] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T.
value from saliency maps. Without them it is hard to know Ogata, “Audio-visual speech recognition using deep
if the model is accurately focusing on mouth data or other learning,” Applied Intelligence, vol. 42, no. 4, pp. 722– 737,
aspects of the input sequences. Finally, performing analysis 2015.
[9] Ayaz A. Shaikh, Dinesh K. Kumar, Wai C. Yau, M. Z. Che
of confusion matrices earlier in our exploration process
Azemin, and Jayavardhana Gubbi. Lip reading
could have helped us mitigate the problems that we ran into using optical flow and support vector machines. 2010 3rd
with unseen subjects, given that we could have adjusted our International Congress on Image and Signal Processing,
models based on the patterns we perceived. 1:327–330, 2010.
This project is easily extendible and raises the [10] Koller, O., Ney, H., Bowden, R.: Deep learning of mouth
question of how to perform visual speech recognition on a shapes for sign language. In: Proceedings of the IEEE
much larger corpus (perhaps the entire English dictionary). International Conference on Computer Vision Workshops.
How could the addition of audio data improve our ability to pp. 85–91 (2015)
interpret the video as text? Is it easier to understand speech [11] Graves, A., Fernandez, S., Gomez, F., and Schmidhuber, ´ J.
Connectionist Temporal Classification: Labelling
from video of a single word being spoken or entire phrases
Unsegmented Sequence Data with Recurrent Neural
and sentences? This question could easily be investigated Networks. In ICML, Pittsburgh, USA, 2006.
since the MIRACL-V1 dataset includes phrase inputs and [12] M. Wand, J. Koutnik, and J. Schmidhuber. Lipreading with
would be an interesting area of exploration. Additionally, long short-term memory. In IEEE International Conference
most speech recognition tasks in real life require phrase on Acoustics, Speech and Signal Processing, pp. 6115–6119,
inputs over single words. 2016.
[13] J. S. Chung and A. Zisserman. Lip reading in the wild. In
Asian Conference on Computer Vision, 2016.
[14] Ravi Garg, Vijay Kumar B. G, and Ian D. Reid. Unsupervised
7. References CNN for single view depth estimation: Geometry to the
rescue. CoRR, abs/1603.04992, 2016.
[15] Zhou, Z., Hong, X., Zhao, G., Pietik¨ainen, M.: A compact
[0] Ahmed Rekik, Achraf Ben-
representation of visual speech data using latent variables.
Hamadou, and Walid Mahdi. A new visual speech
recognition approach for RGB-D cameras. In IEEE transactions on pattern analysis and machine
Image Analysis and Recognition - 11th International intelligence 36(1), 1–1 (2014)
[16] Hochreiter S, Schmidhuber J. Long short-term memory.
Conference, ICIAR 2014, Vilamoura, Portugal, October 22-
Neural Comput. 1997;9(8):1735–80.
24, 2014, Proceedings, Part II, pages 21–28, 2014.
[17] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.
[1] Yannis M. Assael, Brendan Shillingford, Shimon White
ImageNet: A Large-Scale Hierarchical Image Database. In
son, and Nando de Freitas. Lipnet: Sentence-level
CVPR09, 2009. http://image-net.org
lipreading. CoRR, abs/1611.01599, 2016.
[18] Karen Simonyan and Andrew Zisserman. Very deep
[2] Yuru Pei, Tae-Kyun Kim, and Hongbin Zha. Unsupervised
random forest manifold alignment for lipreading. In convolutional networks for large-scale image recognition.
The IEEE International Conference on Computer Vision CoRR, abs/1409.1556, 2014.
[19] JJ Allaire, Dirk Eddelbuettel, Nick Golding, and Yuan Tang
(ICCV), December 2013.
(2016). tensorflow: R Interface to TensorFlow.
[3] Rekik A., Ben-Hamadou A., Mahdi W. (2015) Human
https://github.com/rstudio/tensorflow
Machine Interaction via Visual Speech Spotting. In: Battiato
[20] G. Bradski. Opencv. Dr. Dobb’s Journal of Software Tools,
S., Blanc-Talon J., Gallo G., Philips W., Popescu D.,
2000.
Scheunders P. (eds) Advanced Concepts for Intelligent
[21] Stéfan van der Walt, S. Chris Colbert and Gaël Varoquaux.
Vision Systems. Lecture Notes in Computer Science, vol
9386. Springer, Cham The NumPy Array: A Structure for Efficient Numerical
[4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Computation, Computing in Science & Engineering, 13, 22-
30 (2011), DOI:10.1109/MCSE.2011.37
Imagenet classification with deep convolutional neural
[22] Fran ̧cois Chollet et
networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.
al. Keras. https://github.com/fchollet/keras, 2015.
Weinberger, editors, Advances in Neural Information
[23] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort,
Processing Systems 25, pages 1097–1105. Curran
Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu
Associates, Inc., 2012.
8
Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg,
Jake Vanderplas, Alexandre Passos, David Cournapeau,
Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay.
Scikit-learn: Machine Learning in Python, Journal of
Machine Learning Research, 12, 2825-2830 (2011)
[24] Kotikalapudi, R. Keras Visualization Toolkit. MIT.
https://github.com/raghakot/keras-vis
[25] John D. Hunter. Matplotlib: A 2D Graphics Environment,
Computing in Science & Engineering, 9, 90-95
(2007),DOI:10.1109/MCSE.2007.55
[26] Davis E. King. Dlib-ml: A Machine Learning Toolkit.
Journal of Machine Learning Research 10, pp. 1755-1758,
2009
[27] “Building powerful image classification models using very
little data“ https://blog.keras.io/building-powerful-image-
classification-models-using-very-little-data.html
[28] “Real-time facial landmark detection with OpenCV, Python,
and dlib”http://www.pyimagesearch.com/2017/04/17/real-
time-facial-landmark-detection-opencv-python-dlib/