Recognizing Facial Expressions with ARKit and iPhone X
I love ARKit, Apple’s Augmented Reality framework for iOS that was introduced with iOS 11. So when Apple announced that ARKit was gaining face tracking features with iPhone X, I was curious what it could do.
To learn about its capabilities, I spent a couple of hours making a quick game that tells you facial expressions to perform and gives you points based on how quickly you complete them (for example, Smile, Open Wide, etc).
It’s called FaceOff, and it’s totally open source if you want to play with it.
Setting up ARKit for faces
Recognizing facial expressions with ARKit turns out to be relatively simple. I’ll assume you know the basics of how ARKit works, and if you don’t this article is a great place to start.
First, create new ARSession
and ARFaceTrackingConfiguration
objects. Make sure to set a delegate for your ARSession
object and implement all the ARAnchor
delegate methods, since this is how you’ll receive facial data.
Once configured, call run()
on your ARSession
with your ARFaceTrackingConfiguration
object and you’re good to go. ARKit will run and use the front-facing sensors on iPhone X to detect faces as they move.
Recognizing faces & expressions in real time
Once running, ARSession
will begin calling your delegate functions with a special type of anchor: ARFaceAnchor
. Importantly, it will call session(_ session: ARSession, didUpdate anchors: [ARAnchor])
many times per second with an up-to-date ARFaceAnchor
object each time.
ARFaceAnchor
has a property called blendShapes
, which is a dictionary full of information about each part of the user’s face. Each facial part is defined by an ARBlendShapeLocation
key and a numerical value from 0–1. For example, faceAnchor.blendShapes[.mouthSmileLeft]
returns an integer telling you how much the user is smiling on the left side of their face (note that “left” in ARKit terms is from an external point of view, and not from your point of view).
There’s an impressive number of facial parts that can be tracked. As of iOS 11.2, ARKit defines 50 different ARBlendShapeLocation
and has everything from nose sneers to mouth rolls to cheek puffing. Every one of them is tracked and updated each time a new ARFaceAnchor
is passed to ARSession
’s delegate.
It’s easy to write code that determines the user’s expression by piecing these parts together. For example, here’s a snippet from FaceOff that returns true
if the user is smiling:
func isExpressing(from: ARFaceAnchor) -> Bool { guard let smileLeft = from.blendShapes[.mouthSmileLeft], let smileRight = from.blendShapes[.mouthSmileRight] else {
return false
} // from testing: 0.5 is a lightish smile and 0.9 is an exagerrated smile
return smileLeft.floatValue > 0.5 && smileRight.floatValue > 0.5}
Putting it together in a game
I’m impressed with the extremely low latency of ARKit’s facial tracking. It’s this low latency that makes a game like FaceOff possible, since there’s nearly no lag between when a user emotes and when an app receives an updated ARFaceAnchor
.
FaceOff takes advantage of this by defining one class called Expression
then implementing a bunch of subclasses like SmileExpression
, EyeBlinkLeftExpression
, EyebrowsRaisedExpression
, etc. The game picks one expression at random then asks the user to perform it.
Because ARKit has such low latency, it’s possible to make a game that feels like it’s instantly reacting as the user’s face changes. The effect is kind of magical.
Bonus: Showing a floating face mask
One other thing I did is show a 3D representation of the user’s face in our own SceneKit 3D view, like so:
FaceOff does this by creating a new ARSCNFaceGeometry
object, assigning it to an SCNNode
, then adding that to an SCNView
. Then, to keep the mask updated based on the user’s facial changes, all we have to do is call ARSCNFaceGeometry
’s update()
function and pass in faceAnchor.geometry
each time we get a new ARFaceAnchor
. (You can see all this in action in FaceOff’s Mask
class.)
Limitations
ARKit’s facial recognition isn’t perfect. In lower light conditions it will struggle to provide values for some facial parts — mainly smaller parts of the face like eye and cheek movement.
And even though simple 0–1 values for each ARBlendShapeLocation
make tracking faces easy, there is no consistent value that will work across many users for tracking expressions. For example, if I smile naturally then ARBlendShapeLocationMouthSmileLeft
returns 0.5, but it may return 0.4, 0.6, or other numbers for different users.
Trying it out
If you’re interested, I encourage you to download FaceOff’s source code and build/run it on your iPhone X. There are a few other surprises in there that you may enjoy :).
Shameless plug time! If you’re looking to have an iOS app developed, get in touch. We’re available for consulting work.