Target-based AR

The following image illustrates a more traditional target-based AR. The device camera captures a frame of video. The software analyzes the frame looking for a familiar target, such as a pre-programmed marker, using a technique called photogrammetry. As part of target detection, its deformation (for example, size and skew) is analyzed to determine its distance, position, and orientation relative to the camera in a three-dimensional space.

From that, the camera pose (position and orientation) in 3D space is determined. These values are then used in the computer graphics calculations to render virtual objects. Finally, the rendered graphics are merged with the video frame and displayed to the user:

iOS and Android phones typically have a refresh rate of 60Hz. This means the image on your screen is updated 60 times a second, or 1.67 milliseconds per frame. A lot of work goes into this quick update. Also, much effort has been invested in optimizing the software to minimize any wasted calculations, eliminate redundancy, and other tricks that improve performance without negatively impacting user experience. For example, once a target has been recognized, the software will try to simply track and follow as it appears to move from one frame to the next rather than re-recognizing the target from scratch each time.

To interact with virtual objects on your mobile screen, the input processing required is a lot like any mobile app or game. As illustrated in the following image, the app detects a touch event on the screen. Then, it determines which object you intended to tap by mathematically casting a ray from the screen's XY position into 3D space, using the current camera pose. If the ray intersects a detectable object, the app may respond to the tap (for example, move or modify the geometry). The next time the frame is updated, these changes will be rendered on the screen:

A distinguishing characteristic of handheld mobile AR is that you experience it from an arm's length viewpoint. Holding the device out in front of you, you look through its screen like a portal to the augmented real world. The field of view is defined by the size of the device screen and how close you're holding it to your face. And it's not entirely a hands-free experience because unless you're using a tripod or something to hold the device, you're using one or two hands to hold the device at all times.

Snapchat's popular augmented reality selfies go even further. Using the phone's front-facing camera, the app analyzes your face using complex AI pattern matching algorithms to identify significant points, or nodes, that correspond to the features of your face--eyes, nose, lips, chin, and so on. It then constructs a 3D mesh, like a mask of your face. Using that, it can apply alternative graphics that match up with your facial features and even morph and distort your actual face for play and entertainment. See this video for a detailed explanation from Snapchat's Vox engineers: https://www.youtube.com/watch?v=Pc2aJxnmzh0. The ability to do all of this in real time is remarkably fun and serious business:

Perhaps, by the time you are reading this book, there will be mobile devices with built-in depth sensors, including Google Project Tango and Intel RealSense technologies, capable of scanning the environment and building a 3D spatial map mesh that could be used for more advanced tracking and interactions. We will explain these capabilities in the next topic and explore them in this book in the context of wearable AR headsets, but they may apply to new mobile devices too.