1# Head-Tracking Library For Immersive Audio 2 3This library handles the processing of head-tracking information, necessary for 4Immersive Audio functionality. It goes from bare sensor reading into the final 5pose fed into a virtualizer. 6 7## Basic Usage 8 9The main entry point into this library is the `HeadTrackingProcessor` class. 10This class is provided with the following inputs: 11 12- Head pose, relative to some arbitrary world frame. 13- Screen pose, relative to some arbitrary world frame. 14- Display orientation, defined as the angle between the "physical" screen and 15 the "logical" screen. 16- Transform between the screen and the sound stage. 17- Desired operational mode: 18 - Static: only the sound stage pose is taken into account. This will result 19 in an experience where the sound stage moved with the listener's head. 20 - World-relative: both the head pose and stage pose are taken into account. 21 This will result in an experience where the sound stage is perceived to be 22 located at a fixed place in the world. 23 - Screen-relative: the head pose, screen pose and stage pose are all taken 24 into account. This will result in an experience where the sound stage is 25 perceived to be located at a fixed place relative to the screen. 26 27Once inputs are provided, the `calculate()` method will make the following 28output available: 29 30- Stage pose, relative to the head. This aggregates all the inputs mentioned 31 above and is ready to be fed into a virtualizer. 32- Actual operational mode. May deviate from the desired one in cases where the 33 desired mode cannot be calculated (for example, as result of dropped messages 34 from one of the sensors). 35 36A `recenter()` operation is also available, which indicates to the system that 37whatever pose the screen and head are currently at should be considered as the 38"center" pose, or frame of reference. 39 40## Pose-Related Conventions 41 42### Naming and Composition 43 44When referring to poses in code, it is always good practice to follow 45conventional naming, which highlights the reference and target frames clearly: 46 47Bad: 48 49``` 50Pose3f headPose; 51``` 52 53Good: 54 55``` 56Pose3f worldToHead; // “world” is the reference frame, 57 // “head” is the target frame. 58``` 59 60By following this convention, it is easy to follow correct composition of poses, 61by making sure adjacent frames are identical: 62 63``` 64Pose3f aToD = aToB * bToC * cToD; 65``` 66 67And similarly, inverting the transform simply flips the reference and target: 68 69``` 70Pose3f aToB = bToA.inverse(); 71``` 72 73### Twist 74 75“Twist” is to pose what velocity is to distance: it is the time-derivative of a 76pose, representing the change in pose over a short period of time. Its naming 77convention always states one frame, e.g.: 78Twist3f headTwist; 79 80This means that this twist represents the head-at-time-T to head-at-time-T+dt 81transform. Twists are not composable in the same way as poses. 82 83### Frames of Interest 84 85The frames of interest in this library are defined as follows: 86 87#### Head 88 89This is the listener’s head. The origin is at the center point between the 90ear-drums, the X-axis goes from left ear to right ear, Y-axis goes from the back 91of the head towards the face and Z-axis goes from the bottom of the head to the 92top. 93 94#### Screen 95 96This is the primary screen that the user will be looking at, which is relevant 97for some Immersive Audio use-cases, such as watching a movie. We will follow a 98different convention for this frame than what the Sensor framework uses. The 99origin is at the center of the screen. X-axis goes from left to right, Z-axis 100goes from the screen bottom to the screen top, Y-axis goes “into” the screen ( 101from the direction of the viewer). The up/down/left/right of the screen are 102defined as the logical directions used for display. So when flipping the display 103orientation between “landscape” and “portrait”, the frame of reference will 104change with respect to the physical screen. 105 106#### Stage 107 108This is the frame of reference used by the virtualizer for positioning sound 109objects. It is not associated with any physical frame. In a typical 110multi-channel scenario, the listener is at the origin, the X-axis goes from left 111to right, Y-axis from back to front and Z-axis from down to up. For example, a 112front-right speaker is located at positive X, Y and Z=0, a height speaker will 113have a positive Z. 114 115#### World 116 117It is sometimes convenient to use an intermediate frame when dealing with 118head-to-screen transforms. The “world” frame is a frame of reference in the 119physical world, relative to which we can measure the head pose and screen pose. 120It is arbitrary, but expected to be stable (fixed). 121 122## Processing Description 123 124![Pose processing graph](PoseProcessingGraph.png) 125 126The diagram above illustrates the processing that takes place from the inputs to 127the outputs. 128 129### Predictor 130 131The Predictor block gets pose + twist (pose derivative) and extrapolates to 132obtain a predicted head pose (w/ given latency). 133 134### Bias 135 136The Bias blocks establish the reference frame for the poses by having the 137ability to set the current pose as the reference for future poses (recentering). 138 139### Orientation Compensation 140 141The Orientation Compensation block applies the display orientation to the screen 142pose to obtain the pose of the “logical screen” frame, in which the Y-axis is 143pointing in the direction of the logical screen “up” rather than the physical 144one. 145 146### Screen-Relative Pose 147 148The Screen-Relative Pose block is provided with a head pose and a screen pose 149and estimates the pose of the head relative to the screen. Optionally, this 150module may indicate that the user is likely not in front of the screen via the 151“valid” output. 152 153### Stillness Detector 154 155The stillness detector blocks detect when their incoming pose stream has been 156stable for a given amount of time (allowing for a configurable amount of error). 157When the head is considered still, we would trigger a recenter operation 158(“auto-recentering”) and when the screen is considered not still, the mode 159selector would use this information to force static mode. 160 161### Mode Selector 162 163The Mode Selector block aggregates the various sources of pose information into 164a head-to-stage pose that is going to feed the virtualizer. It is controlled by 165the “desired mode” signal that indicates whether the preference is to be in 166either static, world-relative or screen-relative. 167 168The actual mode may diverge from the desired mode. It is determined as follows: 169 170- If the desired mode is static, the actual mode is static. 171- If the desired mode is world-relative: 172 - If head and screen poses are fresh and the screen is stable (stillness 173 detector output is true), the actual mode is world-relative. 174 - Otherwise the actual mode is static. 175- If the desired mode is screen-relative: 176 - If head and screen poses are fresh and the ‘valid’ signal is asserted, the 177 actual mode is screen-relative. 178 - Otherwise, apply the same rules as the desired mode being world-relative. 179 180### Rate Limiter 181 182A Rate Limiter block is applied to the final output to smooth out any abrupt 183transitions caused by any of the following events: 184 185- Mode switch. 186- Display orientation switch. 187- Recenter operation. 188