September 25, 2022

How FaceNet Works, And How To Work With FaceNet

FaceNet is a face recognition system using deep neural network introduced in 2015 by researchers at Google. The main idea of this system is to train a CNN to extract an 128-element vector from a face image, called Embedding. The vectors extracted from same person’s images should be very close to each other, while distance between vectors extracted from two different persons’ face image should be sensibly much more. To reach this condition, the CNN is trained with triple images at each steps, named Anchor, Positive and Negative. The Anchor and Positive images belong to same person and the Negative is from a different individual.

FaceNet Network Training
FaceNet training architecture

The Triplet loss function aims to make CNN reduce the distance between Anchor’s Embedding and Positive’s Embedding, and to increase distance between Anchor’s Embedding and Negative’s Embedding in the meanwhile :

FaceNet Loss Function

Where a is the anchor, p is the positive, n is the negative, and alpha is margin we specify to guaranty the (alpha+) margin between Embedding of two different persons’ face. In the original paper authors of FaceNet described Triplet Loss goal as :

Here we want to ensure that an image A (anchor) of a specific person is closer to all other images P (positive) of the same person than it is to any image N (negative) of
any other person.

Visualization of Embeddings of the MNIST images produced by FaceNet
Visualization of Embeddings of the MNIST images, produced by a CNN model trained with Triplet Loss function

As the CNN trained well, it can produce unified Embedding for each given face, which represents its features :

Extracting embedding of face using FaceNet Model
Extracting embedding of face using FaceNet Model

Now, we are able to have a numerical representation of any person’s face image, and we are sure that not any two different faces have Embedding closer than Alpha . So we can apply any comparison method to find nearest face in the dataset to a given face based on their Embeddings. The process of finding best match (nearest face) to a given face is called Inference :

Finding nearest face to the given face among 4 images based on their Embedding
Finding nearest face to the given face among 4 images based on their Embedding

We can use well known distance measures to calculate distance between two Embeddings such as Minkowski or Euclidean distance measures. Then it would be easy to find minimum among calculated distances. So, what happens if a company hired a new employees and wants to identify them on the entrance gate? They just simply take a photo of the employee’s face, give it to the same FaceNet CNN model to produce its Embedding, and add it to the existing dataset :

Adding new employee's face Embedding to existing dataset
Adding new employee’s face Embedding to existing dataset

How to Work with FaceNet in Python

Here we will write a snippet code that detects and identify face using a FaceNet Model. First We use a pretrained FaceNet model to build our database of Embeddings correspond to existing face images dataset. Next we test the system on single image that contains a face that belongs to one of persons in database, and try to identify it. The important part of this code is that we must detect faces area before we can perform face identification. So, in this article we use MTCNN library to detect face area in images, it is simple tools that detect as many face as present in the given image and returns their bounding-box coordinates. When we have bounding-box of a face in an image, we can easily crop the face part of the image and use it in our system. Lets dive into the code and see what is going on in each lines :

import tensorflow as tf
import glob
from mtcnn import MTCNN
from tensorflow.keras.models import load_model
import cv2
import numpy as np
from scipy.spatial.distance import cosine
detector = MTCNN()
facenet_model = load_model("facenet_keras.h5")
encoding_dict = {}
def extract_face_and_preprocessing(image):
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
faces = detector.detect_faces(image)
x1, y1, width, height = faces[0]["box"]
x1, y1 = abs(x1), abs(y1)
x2, y2 = x1 + width, y1+height
face = image[y1:y2, x1:x2]
face = cv2.resize(face, (160, 160))
face = np.expand_dims(face, axis = 0)
face = (face - face.mean())/face.std()
return face, (x1, x2, y1, y2)
def get_encode():
for item in glob.glob("faces\\*\\*"):
img = cv2.imread(item)
face , _= extract_face_and_preprocessing(img)
encode = facenet_model.predict(face)
encoding_dict[item.split("\\")[-2]] = encode
name = "Unkown"
distance = float("inf")
test_image = cv2.imread("test\\6.jpg")
face, (x1, x2, y1, y2) = extract_face_and_preprocessing(test_image)
encode = facenet_model.predict(face)
for db_name, db_encode in encoding_dict.items():
dist = cosine(encode, db_encode)
if dist < 0.5 and dist < distance:
name = db_name
distance = dist
if name == "Unkown":
cv2.rectangle(test_image, (x1, y1), (x2, y2), (0, 0, 255), 2)
cv2.putText(test_image, "Unknown", (x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 255), 2)
cv2.rectangle(test_image, (x1, y1), (x2, y2), (0, 255, 0), 2)
cv2.putText(test_image, "{}:{:.2f}".format(name, distance), (x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0), 2)
cv2.imshow("image", test_image)

extract_face_and_preprocessing : This function detects the faces in the given image (Line 16), crops the first detected face(Line 22) , after resizing (Line 23) and normalizing the extracted image (Line 25) returns it and its coordinates on original images as output (Line 27).

get_encode : Retrieves all images belong to persons in the face folder (Line 30), read them (Line 31) and extract the face area (Line 32). In the line 34, it produces the Embedding of each image and finally adds the Embedding to a dictionary which we use as a database of person’s embedding (Line 36).

In line 38 , we call get_encode function to produce the persons dictionary. It should be like :

Dictionary of persons' embedding extracted by facenet
Dictionary of persons’ embedding

Next, in line 43 we read the test image which we are going to identify the person in it, then extract the face area (Line 45). Before we could identify the new face in image, we must produce its Embedding, so we do it on line 45 using extract_face_and_preprocessing function. Now that we produced the new image’s Embedding we are able to calculate distance between the new image’s Embedding and each person’s face Embedding which are already stored in database, as we do it in line 50. Finally, it would be easy to find minimum distance among calculated distance, by the way, in this code we don’t accept distance higher that 0.5 as a identification result (Line 52).

see WYNA post, which is implementation of Facenet in a tiny bot application.

Hope you find this post helpful !

One thought on “How FaceNet Works, And How To Work With FaceNet

Leave a Reply

Your email address will not be published. Required fields are marked *