The face recognition problem is one of the most extensively researched problems over the past decades and recently has received an immense amount of commercial interest due to its increased reliability in comparison to past decades. The main reason for its increased reliability is the accelerated advance in both, deep learning techniques and hardware for processing.
It is possible to say that face recognition has reached a point where the main challenges for marketing a commercial system that involves face recognition are scalability, cost-efficiency, and user interfacing. However, the application of face recognition on actual surveillance systems in which the problems of noise and low resolution are more critical because they present challenges different challenges, such as the unconstrained posing, the ambient occlusion, and cluttering. Also, usually, the state-of-the-art models for performing face recognition over high-resolution images, are ill-conditioned for performing the same task over native low-quality surveillance face recognition tasks. Also, the test performance on artificially synthesized low-resolution images doesn’t reflect how good the recognition systems are working over actual surveillance systems.
Typically, surveillance systems that rely on multiple cameras have a need for human supervision to identify potential threats, which generally implies that those in charge of surveillance have to be on the lookout for the site continuously or that, by monitoring at certain times only, it is not possible to have the information of the people who are in the place at times when it is not being monitored unless the recording capacity is available, which implies the storage of large amounts of information in videos.
Figure 1. Different possible conditions for surveillance person identification. From left to right: outdoors, indoors, crowded fast-paced traffic.
The objective of incorporating computational vision in surveillance systems aims to address two problems, detection and recognition of people. Depending on the way the camera is installed in the place of interest, the operation of the required surveillance system will have subtle differences, for example with the exterior, interior, or crowded lighting.
Initially, for this demo, the way of performing detection and identification of people is mainly supported in the detection and recognition of faces instead of the full body for several reasons:
-It is easier to have access to public domain datasets for facial recognition.
-The methods to perform facial recognition are well documented and there exist dozens of libraries and useful tools that facilitate the development of an express proof of concept (as it was required for this development).
-We consider that the recognition of a person’s full body can be a good complement to facial recognition for its purposes. However, since this is a proof of concept, only facial recognition was preferred for resource reasons.
-The development of this proof of concept relies on several open-source libraries with permission to use for commercial use, modification, and distribution, such as FaceNet-PyTorch, open-cv, sci-kit learn, and face_recognition. As the time to market is key, we decided to set up the main pipelines for training (and improving), performing real-time recognition, and updating the database which is going to be described in detail below.
3. Demo architecture:
The following description of the models that have been used/implemented for the purpose of this proof of concept and cover the methods used for the realization and the proposed improvements for each of them.
3.1 General solution naïve model:
The general naïve model for the product is represented below. It consists of different stages that involve data acquisition, storage, communication, prediction using computer vision and user notification via app.
Figure 2. Naive complete architecture of the complete product.
The stages that were the main concern at the stage of the demo development are the ones that involve computer vision (face recognition, face embedding and live recognition)
3.2 Model (Face Detection):
The current model for performing face detection is based on Facenet  and is briefly described below.
3.2.1 Current model (Facenet) :
FaceNet uses a deep convolutional network, based on the structure of Zeiler&Fergus  with 1×1 convolutions, such as below that requires 1.6GFlops per image.
Figure 3. Face detection module
The performance of the model for the purpose of this project should be evaluated according to the number of faces that can be detected in a stream and that are without doubt faces, and the faces that the algorithm detects. The perfect algorithm should recognize all the possible faces that can be in a stream without making mistakes. Yet, the labeling of the total possible faces would be an extensive and tedious task. For this proof of concept, we considered a good performance when choosing the minimum threshold that allows to not have false positives. The recognition considerations are described below.
3.2.2 Proposed improvements:
The current model implementation isn’t optimal and there are some specific considerations to be considered for the purpose of this project.
3.3 Model (Face Embedding):
3.3.1 Current model:
FaceNet uses a deep convolutional network as described above for performing face detection and verification. The application of this project requires recognition and clustering (or classification). The embeddings are intended to be created using a Deep neural network that is trained by learning how to encode a huge amount of faces of different IDs and learns how to represent them in a multi
dimensional vector space. An appropriate and mainstream for training the encoding of faces is based on the triplet loss which enforces a margin between each pair of faces from one person to all other faces.
Figure 4. Left: Model structure. Right: Triplet loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between an anchor and a negative of a different identity. [citation]
These embeddings or encodings live in a manifold that enforces distance and discriminability of identities of the faces. For the purpose of fast classification of faces in the Livestream, a k nearest neighbors model is trained to cluster and predict labels of the encoded faces found on each frame.
3.3.2 Proposed improvements:
Even though facet is a good starting point for performing the face recognition to the extent of this demonstration, there exist other alternatives to be considered such as SphereFace , CentreFace, DeepID2  and VggFace , that can be considered to be used in order to benchmark and obtain an optimal solution. Also, it can be a good option to consider, to design a surveillance-specific-purpose model.
3.4 Model (Live Recognition):
3.4.1 Current model:
The main architecture of the model for performing live recognition is based on the structure below
Figure 5. Stream recognition architecture
3.4.2 Proposed improvements:
For the live stream, it is extremely necessary to define the IoT devices that are going to be used in the edge nodes and the computing capacity. The principal ideas to be considered are the use of multithreading or multiprocessing coupled with a more efficient C++ implementation that doesn’t make use of unnecessary libraries that load without need. Also, the coupling with the communication systems needs to be defined in order to define the capabilities of the system.
4. Preliminary results.
Running the algorithm on several videos with some people pre-identified, the goal of this experiment was to see if the algorithm correctly detects the pre-identified people accordingly to their labels and rejects unknown people up to the determined threshold.
Figure 6. Probability for two different people of being Jack (Left. False, Right. True).
After performing identification, the system stores relevant data, such as the faces (in order to send a notification to the user) and IDs. It also generates a report of the predicted faces for the time it ran when the program is stopped or by request.
Figure 7. Predicted labels as files. Not embedded folder stores the faces that are not good enough to be embedded.
Figure 8. Unknown faces duplicates. So far, there is no way to save only unique faces, there is a good improvement opportunity to implement a system that recognizes unique faces per lapse of time (or event).
The algorithm still performs on crowded images. Yet, is not optimal to work on frames that contain more than 10 faces because the lag of the stream reproduction is not yet constrained for those conditions. It needs improvement.
A brief and preliminary description of the what the product is intended to be was presented, along with a description of what has been used and some considerations for improving it.
 Schro F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: IEEE Conference on Computer Vision and Pattern Recognition.
 Zeiler M, Fergus R. (2013) Visualizing and understanding convolutional networks. Corr, abs/1311.290
 Liu W, Wen Y, Yu Z, Li M, Raj B, Song L (2017) Sphereface: Deep hypersphere embedding for face recognition. In: IEEE Conference on Computer Vision and Pattern Recognition
 Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: European Conference on Computer Vision.
 Sun Y, Chen Y, Wang X, Tang X (2014a) Deep learning face representation by joint identification verification. In: Advances in Neural Information Processing Systems.
 Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recognition. In: British Machine Vision Conference