1# Head-Tracking Library For Immersive Audio
2
3This library handles the processing of head-tracking information, necessary for
4Immersive Audio functionality. It goes from bare sensor reading into the final
5pose fed into a virtualizer.
6
7## Basic Usage
8
9The main entry point into this library is the `HeadTrackingProcessor` class.
10This class is provided with the following inputs:
11
12- Head pose, relative to some arbitrary world frame.
13- Screen pose, relative to some arbitrary world frame.
14- Display orientation, defined as the angle between the "physical" screen and
15  the "logical" screen.
16- Transform between the screen and the sound stage.
17- Desired operational mode:
18    - Static: only the sound stage pose is taken into account. This will result
19      in an experience where the sound stage moved with the listener's head.
20    - World-relative: both the head pose and stage pose are taken into account.
21      This will result in an experience where the sound stage is perceived to be
22      located at a fixed place in the world.
23    - Screen-relative: the head pose, screen pose and stage pose are all taken
24      into account. This will result in an experience where the sound stage is
25      perceived to be located at a fixed place relative to the screen.
26
27Once inputs are provided, the `calculate()` method will make the following
28output available:
29
30- Stage pose, relative to the head. This aggregates all the inputs mentioned
31  above and is ready to be fed into a virtualizer.
32- Actual operational mode. May deviate from the desired one in cases where the
33  desired mode cannot be calculated (for example, as result of dropped messages
34  from one of the sensors).
35
36A `recenter()` operation is also available, which indicates to the system that
37whatever pose the screen and head are currently at should be considered as the
38"center" pose, or frame of reference.
39
40## Pose-Related Conventions
41
42### Naming and Composition
43
44When referring to poses in code, it is always good practice to follow
45conventional naming, which highlights the reference and target frames clearly:
46
47Bad:
48
49```
50Pose3f headPose;
51```
52
53Good:
54
55```
56Pose3f worldToHead;  // “world” is the reference frame,
57                     // “head” is the target frame.
58```
59
60By following this convention, it is easy to follow correct composition of poses,
61by making sure adjacent frames are identical:
62
63```
64Pose3f aToD = aToB * bToC * cToD;
65```
66
67And similarly, inverting the transform simply flips the reference and target:
68
69```
70Pose3f aToB = bToA.inverse();
71```
72
73### Twist
74
75“Twist” is to pose what velocity is to distance: it is the time-derivative of a
76pose, representing the change in pose over a short period of time. Its naming
77convention always states one frame, e.g.:
78Twist3f headTwist;
79
80This means that this twist represents the head-at-time-T to head-at-time-T+dt
81transform. Twists are not composable in the same way as poses.
82
83### Frames of Interest
84
85The frames of interest in this library are defined as follows:
86
87#### Head
88
89This is the listener’s head. The origin is at the center point between the
90ear-drums, the X-axis goes from left ear to right ear, Y-axis goes from the back
91of the head towards the face and Z-axis goes from the bottom of the head to the
92top.
93
94#### Screen
95
96This is the primary screen that the user will be looking at, which is relevant
97for some Immersive Audio use-cases, such as watching a movie. We will follow a
98different convention for this frame than what the Sensor framework uses. The
99origin is at the center of the screen. X-axis goes from left to right, Z-axis
100goes from the screen bottom to the screen top, Y-axis goes “into” the screen (
101from the direction of the viewer). The up/down/left/right of the screen are
102defined as the logical directions used for display. So when flipping the display
103orientation between “landscape” and “portrait”, the frame of reference will
104change with respect to the physical screen.
105
106#### Stage
107
108This is the frame of reference used by the virtualizer for positioning sound
109objects. It is not associated with any physical frame. In a typical
110multi-channel scenario, the listener is at the origin, the X-axis goes from left
111to right, Y-axis from back to front and Z-axis from down to up. For example, a
112front-right speaker is located at positive X, Y and Z=0, a height speaker will
113have a positive Z.
114
115#### World
116
117It is sometimes convenient to use an intermediate frame when dealing with
118head-to-screen transforms. The “world” frame is a frame of reference in the
119physical world, relative to which we can measure the head pose and screen pose.
120It is arbitrary, but expected to be stable (fixed).
121
122## Processing Description
123
124![Pose processing graph](PoseProcessingGraph.png)
125
126The diagram above illustrates the processing that takes place from the inputs to
127the outputs.
128
129### Predictor
130
131The Predictor block gets pose + twist (pose derivative) and extrapolates to
132obtain a predicted head pose (w/ given latency).
133
134### Bias
135
136The Bias blocks establish the reference frame for the poses by having the
137ability to set the current pose as the reference for future poses (recentering).
138
139### Orientation Compensation
140
141The Orientation Compensation block applies the display orientation to the screen
142pose to obtain the pose of the “logical screen” frame, in which the Y-axis is
143pointing in the direction of the logical screen “up” rather than the physical
144one.
145
146### Screen-Relative Pose
147
148The Screen-Relative Pose block is provided with a head pose and a screen pose
149and estimates the pose of the head relative to the screen. Optionally, this
150module may indicate that the user is likely not in front of the screen via the
151“valid” output.
152
153### Stillness Detector
154
155The stillness detector blocks detect when their incoming pose stream has been
156stable for a given amount of time (allowing for a configurable amount of error).
157When the head is considered still, we would trigger a recenter operation
158(“auto-recentering”) and when the screen is considered not still, the mode
159selector would use this information to force static mode.
160
161### Mode Selector
162
163The Mode Selector block aggregates the various sources of pose information into
164a head-to-stage pose that is going to feed the virtualizer. It is controlled by
165the “desired mode” signal that indicates whether the preference is to be in
166either static, world-relative or screen-relative.
167
168The actual mode may diverge from the desired mode. It is determined as follows:
169
170- If the desired mode is static, the actual mode is static.
171- If the desired mode is world-relative:
172    - If head and screen poses are fresh and the screen is stable (stillness
173      detector output is true), the actual mode is world-relative.
174    - Otherwise the actual mode is static.
175- If the desired mode is screen-relative:
176    - If head and screen poses are fresh and the ‘valid’ signal is asserted, the
177      actual mode is screen-relative.
178    - Otherwise, apply the same rules as the desired mode being world-relative.
179
180### Rate Limiter
181
182A Rate Limiter block is applied to the final output to smooth out any abrupt
183transitions caused by any of the following events:
184
185- Mode switch.
186- Display orientation switch.
187- Recenter operation.
188