Facial recognition has been a trending field in AI and ML for several years now, and the widespread cultural & social implications of facial recognition are far reaching. However, there exists a performance gap between human visual systems and machines that currently limits the applications of facial recognition.
To overcome the buffer created by the performance gap, and deliver human level accuracy, Meta introduced DeepFace, a facial recognition framework. The DeepFace model is trained on a large facial dataset that differs significantly from the datasets used to construct the evaluation benchmarks, and it has the potential to outperform existing frameworks with minimal adaptations. Furthermore, the DeepFace framework produces compact face representations when compared to other systems that produce thousands of facial appearance features.
The proposed DeepFace framework uses Deep Learning to train on a large dataset consisting of different forms of data including images, videos, and graphics. The DeepFace network architecture assumes that once the alignment is completed, the location of every facial region is fixed at the pixel level. Therefore, it is possible to use the raw pixel RGB values without using multiple convolutional layers as done in other frameworks.
The conventional pipeline of modern facial recognition frameworks comprises four stages: Detection, Alignment, Representation, and Classification. The DeepFace framework employs explicit 3D face modeling to apply a piecewise transformation, and uses a nine-layer deep neural network to derive a facial representation. The DeepFace framework attempts to make the following contributions
- Develop an effective DNN or Deep Neural Network architecture that can leverage a large dataset to create a facial representation that can be generalized to other datasets.
- Use explicit 3D modeling to develop an effective facial alignment system.
Understanding the Working of the DeepFace Model
Face Alignment is a technique that rotates the image of a person according to the angle of the eyes. Face Alignment is a popular practice that is used to preprocess data for facial recognition, and facially aligned datasets help in improving the accuracy of recognition algorithms by giving a normalized input. However, aligning faces in an unconstrained manner can be a challenging task because of the multiple factors involved like non-rigid expressions, body poses, and more. Several sophisticated alignment techniques like using an analytical 3D model of the face or searching for fiducial-points from external dataset might allow developers to overcome the challenges.
Although alignment is the most popular method for dealing with unconstrained face verification & recognition, there is no perfect solution at the moment. 3D models are also used, but their popularity has gone down significantly in the past few years especially when working in an unconstrained environment. However, because human faces are 3D objects, it might be the right approach if used correctly. The DeepFace model uses a system that uses fiducial points to create an analytical 3D modeling of the face. This 3D modeling is then used to warp a facial crop to a 3D frontal mode.
Furthermore, just like most alignment practices, the DeepFace alignment also uses fiducial point detectors to direct the alignment process. Although the DeepFace model uses a simple point detector, it applies it in several iterations to refine the output. A Support Vector Regressor or SVR trained to prejudice point configurations extracts the fiducial points from an image descriptor at each iteration. DeepFace’s image descriptor is based on LBP Histograms although it also considers other features.
The DeepFace model initiates the alignment process by detecting six fiducial points within the detection crop, centered at the middle of the eyes, mouth locations, and tip of the nose. They are used to rotate, scale, and translate the image into six anchor locations, and iterate on the warped image until there is no visible change. The aggregated transformation then generates a 2D aligned corp. The alignment method is quite similar to the one used in LFW-a, and it has been used over the years in an attempt to boost the model accuracy.
To align faces with out of plane rotations, the DeepFace framework uses a generic 3D shape model, and registers a 3D camera that can be used to wrap the 2D aligned corp to the 3D shape in its image plane. As a result, the model generates the 3D-aligned version of the corp, and it is achieved by localizing an additional 67 fiducial points in the 2D-aligned corp using a second SVR or Support Vector Regressor.
The model then manually places the 67 anchor points on the 3D shape and is thus able to achieve full correspondence between 3D references and their corresponding fiducial points. In the next step, a 3D-to-2D affine camera is added using generalized least squares solution to the linear systems with a known covariance matrix that minimizes certain losses.
Since non-rigid deformations and full perspective projections are not modeled, the fitted 3D to 2D camera serves only as an approximation. In an attempt to reduce the corruption of important identity-bearing factors to the final warp, the DeepFace model adds the corresponding residuals to the x-y components of each reference fiducial point. Such relaxation for the purpose of warping the 2D image with less distortions to the identity is plausible, and without it, the faces would have been warped into the same shape in 3D, and losing important discriminative factors in the process.
Finally, the model achieves frontalization by using a piecewise affine transformation directed by the Delaunay triangulation derived from 67 fiducial points.
- Detected face with 6 fiducial points.
- Induced 2D-aligned corp.
- 67 fiducial points on the 2D-aligned corp.
- Reference 3D shape transformed to 2D-aligned corp image.
- Triangle visibility with respect to the 3D-2D camera.
- 67 fiducial points induced by the 3D model.
- 3D-aligned version of the final corp.
- New view generated by the 3D model.
With an increase in the amount of training data, learning based methods have proved to be more efficient & accurate when compared with engineered features primarily because learning based methods can discover and optimize features for a specific task.
DNN Architecture and Training
The DeepFace DNN is trained on a multi-class facial recognition task that classifies the identity of a face image.
The above figure represents the overall architecture of the DeepFace model. The model has a convolutional layer (C1) with 32 filters of size 11x11x3 that is fed a 3D aligned 3-channels RGB image of size 152×152 pixels, and it results in 32 feature maps. These feature maps are then fed to a Max Pooling layer or M2 that takes the maximum over 3×3 spatial neighborhoods, and has a stride of 2, separately for each channel. Following it up is another convolutional layer (C3) that comprises 16 filters each of size 9x9x16. The primary purpose of these layers is to extract low level features like texture and simple edges. The advantage of using Max Pooling layers is that it makes the output generated by the convolutional layers more robust to local translations, and when applied to aligned face images, they make the network much more robust to registration errors on a small scale.
Multiple levels of pooling does make the network more robust to certain situations, but it also causes the network to lose information regarding the precise position of micro textures and detailed facial structures. To avoid the network losing the information, the DeepFace model uses a max pooling layer only with the first convolutional layer. These layers are then interpreted by the model as a front-end adaptive pre-processing step. Although they do most of the computation, they have limited parameters on their own, and they merely expand the input into a set of local features.
The following layers L4, L5, and L6 are connected locally, and just like a convolutional layer, they apply a filter bank where every location in the feature map learns a unique set of filters. As different regions in an aligned image have different local statistics, it cannot hold the spatial stationarity assumption. For example, the area between the eyebrows and the eyes have a higher discrimination ability when compared to the area between the mouth and the nose. The use of loyal layers affects the number of parameters subject to training but does not affect the computational burden during the feature extraction.
The DeepFace model uses three layers in the first place only because it has a large amount of well-labeled training data. The use of locally connected layers can be justified further as each output unit of a locally connected layer can be affected by a large patch of input data.
Finally, the top layers are connected fully with each output unit being connected to all inputs. The two layers can capture the correlations between features captured in different parts of the face images like position and shape of mouth, and position and shape of the eyes. The output of the first fully connected layer (F7) will be used by the network as its raw face representation feature vector. The model will then feed the output of the last fully connected layer (F8) to a K-way softmax that produces a distribution over class labels.
The DeepFace model uses a combination of datasets with the Social Face Classification or SFC dataset being the primary one. Furthermore, the DeepFace model also uses the LFW dataset, and the YTF dataset.
The SFC dataset is learned from a collection of pictures from Facebook, and it consists of 4.4 million labeled images of 4,030 people with each of them having 800 to 1200 faces. The most recent 5% of the SFC dataset’s face images of each identity are left out for testing purposes.
The LFW dataset consists of 13,323 photos of over five thousand celebrities that are then divided into 6,000 face pairs across 10 splits.
The YTF dataset consists of 3,425 videos of 1,595 subjects, and it is a subset of the celebrities in the LFW dataset.
Without frontalization and when using only the 2D alignment the model achieves an accuracy score of only about 94.3%. When the model uses the center corp of face detection, it does not use any alignment, and in this case, the model returns an accuracy score of 87.9% because some parts of the facial region may fall out of the center corp. The evaluate the it’s discriminative capability of face representation in isolation, the model follows the unsupervised learning setting to compare the inner product of normalized features. It boosts the mean accuracy of the model to 95.92%
The above model compares the performance of the DeepFace model when compared with other state of the art facial recognition models.
The above picture depicts the ROC curves on the dataset.
Ideally, a face classifier will be able to recognize faces with the accuracy of a human, and it will be able to return high accuracy irrespective of the image quality, pose, expression, or illumination. Furthermore, an ideal facial recognition framework will be able to be applied to a variety of applications with little or no modifications. Although DeepFace is one of the most advanced and efficient facial recognition frameworks currently, it is not perfect, and it might not be able to deliver accurate results in certain situations. But the DeepFace framework is a significant milestone in the facial recognition industry, and it closes the performance gap by making use of a powerful metric learning technique, and it will continue to get more efficient over time.