top of page

Facial Landmark Detection

Facial landmark detection is a key component of several of our projects at ZOZO New Zealand. In particular, we need a detector that is fast, portable (runs well on mobile devices), and achieves high accuracy and stability. In this blog, we’ll examine a few of the options we’ve tried, and how they were evaluated, as well as how the family of open-source libraries and frameworks provided by OpenMMLab helped us to quickly run and evaluate candidate models.

In terms of compatibility requirements, models we investigate need to be runnable in Python (ideally PyTorch) for testing and research purposes, and also must be convertible to a CoreML trace so they can run on iOS.

For all of the landmark detectors, we use an initial cropping method (which differs depending on the model) to determine a region of interest containing a single human face, which is then passed to the detector as input.

Please note that the detectors in this blog are being used for internal evaluation purposes only, and those which are trained on datasets allowing only for non-commercial use will be retrained on an internally-developed licensed dataset before being released as part of any commercial product we develop.

OpenMMLab Frameworks

OpenMMLab provides a family of PyTorch-based python libraries which act as foundations, toolkits, bench markers, and model zoos for modern machine learning. In ZOZO NZ, we’ve found these libraries extremely useful for our work with computer vision, especially the MMPose, MMGeneration, and MMDetection libraries. They are extremely configurable, and there are libraries for almost every type of computer vision task imaginable. The config-file based setup allows users to tweak existing models or define new ones, keeping configuration decisions separate from code, and defining workflows for aspects such as data loading & augmentation, training, and deployment.

Landmarking Schemes

Most facial landmark detectors follow one of a handful of placement schemes, defining how many landmarks should be detected and their locations. Some of the most common schemes are:

  • WFLW (Wider Facial Landmarks in-the-wild): 98-landmark scheme

Detector Models


The first facial landmark detector we tried was an implementation of the Face Alignment Network, which we used as our baseline, since it achieved both reasonable accuracy, and good performance on an iPhone.

This model consists of 4 layers of stacked “hourglass blocks”, and outputs a heatmap channel for each landmark, which can then be argmaxed to get the coordinates for each landmark. The version of FAN we used outputs landmarks in the Multi-PIE (68-point) scheme.


The Dlib C++ machine learning toolkit provides python bindings, including a facial landmark detector following the Multi-Pie (68-point) scheme. Dlib’s model is based on this paper and uses an ensemble of regression trees.

Sparse Local Patch Transformer

While most landmark detectors regress heatmaps for each landmark, and use these heatmaps to estimate the most likely landmark locations, SLPT instead predicts the locations of local patches of the image centred around each landmark. Information in these local patches is then aggregated using the deep attention mechanism (which has exploded in popularity due to recent use in Large Language Models and various image-based techniques), allowing the network to learn relations between the landmarks, rather than independently predicting each separately. The version we tested used the WFLW (98-point) scheme.

While we found the SLPT model to be very robust against jittering, and highly accurate, we were unable to port it to iOS with real-time speed, due to the heavy use of large matrix multiplications in the attention modules, which causes execution to switch between CPU and GPU on the iPhone, which massively reduced the framerate.

Simulated Landmarks

We also tried a proof-of-concept implementation based on Microsoft’s “Fake It Till You Make It” paper. The model architecture used was fairly simple - a pre-made ResNet backbone with a 2D coordinate regression head, as described in the paper - but the main point of difference here is that the landmark detector was trained on a simulated dataset. If this proof-of-concept showed promise, our plan was to then construct or buy an in-house simulated dataset which we could customise with denser landmarks and face angles more suited to our use case. Our implementation detected 68 boundary unaware landmarks, following the Multi-Pie scheme.

This detector turned out to get decent accuracy with fairly low jitter, and its simple architecture meant it was easy to port to iOS with fast performance.

RTMPose Face2D

The team at OpenMMLab was kind enough to provide us with a preview build of their 2D face landmark detector, which is part of their RTMPose model zoo. This detector uses the CSPNeXt architecture as a backbone, which provides a good trade off between speed and accuracy, as well as being easy to port to a CoreML model which achieved a real-time framerate on iOS devices. An added benefit of this model is that, as part of the OpenMMLab framework, it is highly configurable and can be easily tweaked and retrained through the MMPose library. This model outputs boundary-aware landmarks in the 106-point LaPa scheme (the highest density on this list).


For evaluation, our most important criteria for assessing a model are the stability and the speed of the model on an iPhone. We are looking for a model which achieves a good trade off between all assessment criteria - e.g. a model which provides highly stable landmarks is of no use to us if it is extremely slow, and vice versa.

For assessing stability, we created a small python script which measures the “jitter” of landmarks between frames by moving a still image around a cropped window, and measuring the movement of detected landmarks compared to the movement of the image - effectively measuring the robustness against variability in the cropping window. We also include a subjective “visual” stability measure which is assessed by eye, since the jitter script can’t measure stability for real-time moving faces, including changing expressions and pose.

All speeds are tested on an iPhone 12 Pro using Xcode’s model profiling tool. Since the Dlib model is called via c++ interface on iPhone (instead of as a CoreML model like the rest) it is excluded from the speed test as this is not a fair comparison, seeing as the CoreML models can be run on the neural engine.


Jitter score

(lower is better)

Visual stability

iPhone speed (milliseconds)

# landmarks
















Simulated Landmarks










While the RTMPose model doesn’t provide the lowest “jitter score” or fastest speed, it provided the best visual stability and an excellent balance between stability and speed, while also having the most dense landmark placement, making it the best fit for our purposes. Plus, being natively integrated into MMPose means we can easily retrain the model with an in-house dataset to better suit our purposes, and configure the model to be larger or smaller as needed in order to tweak the frame-rate/accuracy trade off. Additionally, this evaluation was performed on an alpha preview provided by the OpenMMLab team, and hopefully the full release will improve the model even further.


bottom of page