1# Chrome OS Update Process 2 3[TOC] 4 5System updates in more modern operating systems like Chrome OS and Android are 6called A/B updates, over-the-air ([OTA]) updates, seamless updates, or simply 7auto updates. In contrast to more primitive system updates (like Windows or 8macOS) where the system is booted into a special mode to override the system 9partitions with newer updates and may take several minutes or hours, A/B updates 10have several advantages including but not limited to: 11 12* Updates maintain a workable system that remains on the disk during and after 13 an update. Hence, reducing the likelihood of corrupting a device into a 14 non-usable state. And reducing the need for flashing devices manually or at 15 repair and warranty centers, etc. 16* Updates can happen while the system is running (normally with minimum 17 overhead) without interrupting the user. The only downside for users is a 18 required reboot (or, in Chrome OS, a sign out which automatically causes a 19 reboot if an update was performed where the reboot duration is about 10 20 seconds and is no different than a normal reboot). 21* The user does not need (although they can) to request for an update. The 22 update checks happen periodically in the background. 23* If the update fails to apply, the user is not affected. The user will 24 continue on the old version of the system and the system will attempt to 25 apply the update again at a later time. 26* If the update applies correctly but fails to boot, the system will rollback 27 to the old partition and the user can still use the system as usual. 28* The user does not need to reserve enough space for the update. The system 29 has already reserved enough space in terms of two copies (A and B) of a 30 partition. The system doesn’t even need any cache space on the disk, 31 everything happens seamlessly from network to memory to the inactive 32 partitions. 33 34## Life of an A/B Update 35 36In A/B update capable systems, each partition, such as the kernel or root (or 37other artifacts like [DLC]), has two copies. We call these two copies active (A) 38and inactive (B). The system is booted into the active partition (depending on 39which copy has the higher priority at boot time) and when a new update is 40available, it is written into the inactive partition. After a successful reboot, 41the previously inactive partition becomes active and the old active partition 42becomes inactive. 43 44### Generation 45 46But everything starts with generating OTA packages on (Google) servers for 47each new system image. This is done by calling 48[ota_from_target_files](https://cs.android.com/android/platform/superproject/+/master:build/make/tools/releasetools/ota_from_target_files.py) 49with source and destination builds. This script requires target_file.zip to work, 50image files are not sufficient. 51 52### Distribution/Configuration 53Once the OTA packages are generated, they are signed with specific keys 54and stored in a location known to an update server (GOTA). 55GOTA will then make this OTA package accessible via a public URL. Optionally, 56operators an choose to make this OTA update available only to a specific 57subset of devices. 58 59### Installation 60When the device's updater client initiates an update (either periodically or user 61initiated), it first consults different device policies to see if the update 62check is allowed. For example, device policies can prevent an update check 63during certain times of a day or they require the update check time to be 64scattered throughout the day randomly, etc. 65 66Once policies allow for the update check, the updater client sends a request to 67the update server (all this communication happens over HTTPS) and identifies its 68parameters like its Application ID, hardware ID, version, board, etc. 69 70Some policities on the server might prevent the device from getting specific 71OTA updates, these server side policities are often set by operators. For 72example, the operator might want to deliver a beta version of software to only 73a subset of devices. 74 75But if the update server decides to serve an update payload, it will respond 76with all the parameters needed to perform an update like the URLs to download the 77payloads, the metadata signatures, the payload size and hash, etc. The updater 78client continues communicating with the update server after different state 79changes, like reporting that it started to download the payload or it finished 80the update, or reports that the update failed with specific error codes, etc. 81 82The device will then proceed to actually installing the OTA update. This consists 83of roughly 3 steps. 84#### Download & Install 85Each payload consists of two main sections: metadata and extra data. The 86metadata is basically a list of operations that should be performed for an 87update. The extra data contains the data blobs needed by some or all of these 88operations. The updater client first downloads the metadata and 89cryptographically verifies it using the provided signatures from the update 90server’s response. Once the metadata is verified as valid, the rest of the 91payload can easily be verified cryptographically (mostly through SHA256 hashes). 92 93Next, the updater client marks the inactive partition as unbootable (because it 94needs to write the new updates into it). At this point the system cannot 95rollback to the inactive partition anymore. 96 97Then, the updater client performs the operations defined in the metadata (in the 98order they appear in the metadata) and the rest of the payload is gradually 99downloaded when these operations require their data. Once an operation is 100finished its data is discarded. This eliminates the need for caching the entire 101payload before applying it. During this process the updater client periodically 102checkpoints the last operation performed so in the event of failure or system 103shutdown, etc. it can continue from the point it missed without redoing all 104operations from the beginning. 105 106During the download, the updater client hashes the downloaded bytes and when the 107download finishes, it checks the payload signature (located at the end of the 108payload). If the signature cannot be verified, the update is rejected. 109 110#### Hash Verification & Verity Computation 111 112After the inactive partition is updated, the updater client will compute 113Forward-Error-Correction(also known as FEC, Verity) code for each partition, 114and wriee the computed verity data to inactive partitions. In some updates, 115verity data is included in the extra data, so this step will be skipped. 116 117Then, the entire partition is re-read, hashed and compared to a hash value 118passed in the metadata to make sure the update was successfully written into 119the partition. Hash computed in this step includes the verity code written in 120last step. 121 122#### Postintall 123 124In the next step, the [Postinstall] scripts (if any) is called. From OTA's perspective, 125these postinstall scripts are just blackboxes. Usually postinstall scripts will optimize 126existings apps on the phone and run file system garbage collection, so that device can boot 127fast after OTA. But these are managed by other teams. 128 129#### Finishing Touches 130 131Then the updater client goes into a state that identifies the update has 132completed and the user needs to reboot the system. At this point, until the user 133reboots (or signs out), the updater client will not do any more system updates 134even if newer updates are available. However, it does continue to perform 135periodic update checks so we can have statistics on the number of active devices 136in the field. 137 138After the update proved successful, the inactive partition is marked to have a 139higher priority (on a boot, a partition with higher priority is booted 140first). Once the user reboots the system, it will boot into the updated 141partition and it is marked as active. At this point, after the reboot, the 142[update_verifier](https://cs.android.com/android/platform/superproject/+/master:bootable/recovery/update_verifier/) 143program runs, read all dm-verity devices to make sure the partitions aren't corrupted, 144then mark the update as successful. 145 146A/B updates are considered completed at this point. Virtual A/B updates will have an 147additional step after this, called "merging". Merging usually takes few minutes, after that 148Virtual A/B updates are considered complete. 149 150## Update Engine Daemon 151 152The `update_engine` is a single-threaded daemon process that runs all the 153times. This process is the heart of the auto updates. It runs with lower 154priorities in the background and is one of the last processes to start after a 155system boot. Different clients (like GMS Core or other services) can send requests 156for update checks to the update engine. The details of how requests are passed 157to the update engine is system dependent, but in Chrome OS it is D-Bus. Look at 158the [D-Bus interface] for a list of all available methods. On Android it is binder. 159 160There are many resiliency features embedded in the update engine that makes auto 161updates robust including but not limited to: 162 163* If the update engine crashes, it will restart automatically. 164* During an active update it periodically checkpoints the state of the update 165 and if it fails to continue the update or crashes in the middle, it will 166 continue from the last checkpoint. 167* It retries failed network communication. 168* If it fails to apply a delta payload (due to bit changes on the active 169 partition) for a few times, it switches to full payload. 170 171The updater clients writes its active preferences in 172`/data/misc/update_engine/prefs`. These preferences help with tracking changes 173during the lifetime of the updater client and allows properly continuing the 174update process after failed attempts or crashes. 175 176 177 178### Interactive vs Non-Interactive vs. Forced Updates 179 180Non-interactive updates are updates that are scheduled periodically by the 181update engine and happen in the background. Interactive updates, on the other 182hand, happen when a user specifically requests an update check (e.g. by clicking 183on “Check For Update” button in Chrome OS’s About page). Depending on the update 184server's policies, interactive updates have higher priority than non-interactive 185updates (by carrying marker hints). They may decide to not provide an update if 186they have busy server load, etc. There are other internal differences between 187these two types of updates too. For example, interactive updates try to install 188the update faster. 189 190Forced updates are similar to interactive updates (initiated by some kind of 191user action), but they can also be configured to act as non-interactive. Since 192non-interactive updates happen periodically, a forced-non-interactive update 193causes a non-interactive update at the moment of the request, not at a later 194time. We can call a forced non-interactive update with: 195 196```bash 197update_engine_client --interactive=false --check_for_update 198``` 199 200### Network 201 202The updater client has the capability to download the payloads using Ethernet, 203WiFi, or Cellular networks depending on which one the device is connected 204to. Downloading over Cellular networks will prompt permission from the user as 205it can consume a considerable amount of data. 206 207### Logs 208 209In Chrome OS the `update_engine` logs are located in `/var/log/update_engine` 210directory. Whenever `update_engine` starts, it starts a new log file with the 211current data-time format in the log file’s name 212(`update_engine.log-DATE-TIME`). Many log files can be seen in 213`/var/log/update_engine` after a few restarts of the update engine or after the 214system reboots. The latest active log is symlinked to 215`/var/log/update_engine.log`. 216 217In Android the `update_engine` logs are located in `/data/misc/update_engine_log`. 218 219## Update Payload Generation 220 221The update payload generation is the process of converting a set of 222partitions/files into a format that is both understandable by the updater client 223(especially if it's a much older version) and is securely verifiable. This 224process involves breaking the input partitions into smaller components and 225compressing them in order to help with network bandwidth when downloading the 226payloads. 227 228`delta_generator` is a tool with a wide range of options for generating 229different types of update payloads. Its code is located in 230`update_engine/payload_generator`. This directory contains all the source code 231related to mechanics of generating an update payload. None of the files in this 232directory should be included or used in any other library/executable other than 233the `delta_generator` which means this directory does not get compiled into the 234rest of the update engine tools. 235 236However, it is not recommended to use `delta_generator` directly, as it has way 237too many flags. Wrappers like [ota_from_target_files](https://cs.android.com/android/platform/superproject/+/master:build/make/tools/releasetools/ota_from_target_files.py) 238or [OTA Generator](https://github.com/google/ota-generator) should be used. 239 240### Update Payload File Specification 241 242Each update payload file has a specific structure defined in the table below: 243 244| Field | Size (bytes) | Type | Description | 245| ----------------------- | ------------ | ------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------- | 246| Magic Number | 4 | char[4] | Magic string "CrAU" identifying this is an update payload. | 247| Major Version | 8 | uint64 | Payload major version number. | 248| Manifest Size | 8 | uint64 | Manifest size in bytes. | 249| Manifest Signature Size | 4 | uint32 | Manifest signature blob size in bytes (only in major version 2). | 250| Manifest | Varies | [DeltaArchiveManifest] | The list of operations to be performed. | 251| Manifest Signature | Varies | [Signatures] | The signature of the first five fields. There could be multiple signatures if the key has changed. | 252| Payload Data | Varies | List of raw or compressed data blobs | The list of binary blobs used by operations in the metadata. | 253| Payload Signature Size | Varies | uint64 | The size of the payload signature. | 254| Payload Signature | Varies | [Signatures] | The signature of the entire payload except the metadata signature. There could be multiple signatures if the key has changed. | 255 256### Delta vs. Full Update Payloads 257 258There are two types of payload: Full and Delta. A full payload is generated 259solely from the target image (the image we want to update to) and has all the 260data necessary to update the inactive partition. Hence, full payloads can be 261quite large in size. A delta payload, on the other hand, is a differential 262update generated by comparing the source image (the active partitions) and the 263target image and producing the diffs between these two images. It is basically a 264differential update similar to applications like `diff` or `bsdiff`. Hence, 265updating the system using the delta payloads requires the system to read parts 266of the active partition in order to update the inactive partition (or 267reconstruct the target partition). The delta payloads are significantly smaller 268than the full payloads. The structure of the payload is equal for both types. 269 270Payload generation is quite resource intensive and its tools are implemented 271with high parallelism. 272 273#### Generating Full Payloads 274 275A full payload is generated by breaking the partition into 2MiB (configurable) 276chunks and either compressing them using bzip2 or XZ algorithms or keeping it as 277raw data depending on which produces smaller data. Full payloads are much larger 278in comparison to delta payloads hence require longer download time if the 279network bandwidth is limited. On the other hand, full payloads are a bit faster 280to apply because the system doesn’t need to read data from the source partition. 281 282#### Generating Delta Payloads 283 284Delta payloads are generated by looking at both the source and target images 285data on a file and metadata basis (more precisely, the file system level on each 286appropriate partition). The reason we can generate delta payloads is that Chrome 287OS partitions are read only. So with high certainty we can assume the active 288partitions on the client’s device is bit-by-bit equal to the original partitions 289generated in the image generation/signing phase. The process for generating a 290delta payload is roughly as follows: 291 2921. Find all the zero-filled blocks on the target partition and produce `ZERO` 293 operation for them. `ZERO` operation basically discards the associated 294 blocks (depending on the implementation). 2952. Find all the blocks that have not changed between the source and target 296 partitions by directly comparing one-to-one source and target blocks and 297 produce `SOURCE_COPY` operation. 2983. List all the files (and their associated blocks) in the source and target 299 partitions and remove blocks (and files) which we have already generated 300 operations for in the last two steps. Assign the remaining metadata (inodes, 301 etc) of each partition as a file. 3024. If a file is new, generate a `REPLACE`, `REPLACE_XZ`, or `REPLACE_BZ` 303 operation for its data blocks depending on which one generates a smaller 304 data blob. 3055. For each other file, compare the source and target blocks and produce a 306 `SOURCE_BSDIFF` or `PUFFDIFF` operation depending on which one generates a 307 smaller data blob. These two operations produce binary diffs between a 308 source and target data blob. (Look at [bsdiff] and [puffin] for details of 309 such binary differential programs!) 3106. Sort the operations based on their target partitions’ block offset. 3117. Optionally merge same or similar operations next to each other into larger 312 operations for better efficiency and potentially smaller payloads. 313 314Full payloads can only contain `REPLACE`, `REPLACE_BZ`, and `REPLACE_XZ` 315operations. Delta payloads can contain any operations. 316 317### Major and Minor versions 318 319The major and minor versions specify the update payload file format and the 320capability of the updater client to accept certain types of update payloads 321respectively. These numbers are [hard coded] in the updater client. 322 323Major version is basically the update payload file version specified in the 324[update payload file specification] above (second field). Each updater client 325supports a range of major versions. Currently, there are only two major 326versions: 1, and 2. And both Chrome OS and Android are on major version 2 (major 327version 1 is being deprecated). Whenever there are new additions that cannot be 328fitted in the [Manifest protobuf], we need to uprev the major version. Upreving 329major version should be done with utmost care because older clients do not know 330how to handle the newer versions. Any major version uprev in Chrome OS should be 331associated with a GoldenEye stepping stone. 332 333Minor version defines the capability of the updater client to accept certain 334operations or perform certain actions. Each updater client supports a range of 335minor versions. For example, the updater client with minor version 4 (or less) 336does not know how to handle a `PUFFDIFF` operation. So when generating a delta 337payload for an image which has an updater client with minor version 4 (or less) 338we cannot produce PUFFDIFF operation for it. The payload generation process 339looks at the source image’s minor version to decide the type of operations it 340supports and only a payload that confirms to those restrictions. Similarly, if 341there is a bug in a client with a specific minor version, an uprev in the minor 342version helps with avoiding to generate payloads that cause that bug to 343manifest. However, upreving minor versions is quite expensive too in terms of 344maintainability and it can be error prone. So one should practice caution when 345making such a change. 346 347Minor versions are irrelevant in full payloads. Full payloads should always be 348able to be applied for very old clients. The reason is that the updater clients 349may not send their current version, so if we had different types of full 350payloads, we would not have known which version to serve to the client. 351 352### Signed vs Unsigned Payloads 353 354Update payloads can be signed (with private/public key pairs) for use in 355production or be kept unsigned for use in testing. Tools like `delta_generator` 356help with generating metadata and payload hashes or signing the payloads given 357private keys. 358 359## update_payload Scripts 360 361[update_payload] contains a set of python scripts used mostly to validate 362payload generation and application. We normally test the update payloads using 363an actual device (live tests). [`brillo_update_payload`] script can be used to 364generate and test applying of a payload on a host device machine. These tests 365can be viewed as dynamic tests without the need for an actual device. Other 366`update_payload` scripts (like [`check_update_payload`]) can be used to 367statically check that a payload is in the correct state and its application 368works correctly. These scripts actually apply the payload statically without 369running the code in payload_consumer. 370 371## Postinstall 372 373[Postinstall] is a process called after the updater client writes the new image 374artifacts to the inactive partitions. One of postinstall's main responsibilities 375is to recreate the dm-verity tree hash at the end of the root partition. Among 376other things, it installs new firmware updates or any board specific 377processes. Postinstall runs in separate chroot inside the newly installed 378partition. So it is quite separated from the rest of the active running 379system. Anything that needs to be done after an update and before the device is 380rebooted, should be implemented inside the postinstall. 381 382## Building Update Engine 383 384You can build `update_engine` the same as other platform applications: 385 386### Setup 387 388Run these commands at top of Android repository before building anything. 389You only need to do this once per shell. 390 391* `source build/envsetup.sh` 392* `lunch aosp_cf_x86_64_only_phone-userdebug` (Or replace aosp_cf_x86_64_only_phone-userdebug with your own target) 393 394 395### Building 396 397`m update_engine update_engine_client delta_generator` 398 399## Running Unit Tests 400 401[Running unit tests similar to other platforms]: 402 403* `atest update_engine_unittests` You will need a device connected to 404 your laptop and accessible via ADB to do this. Cuttlefish works as well. 405* `atest update_engine_host_unittests` Run a subset of tests on host, no device 406required. 407 408## Initiating a Configured Update 409 410There are different methods to initiate an update: 411 412* Click on the “Check For Update” button in setting’s About page. There is no 413 way to configure this way of update check. 414* Use the [`scripts/update_device.py`] program and pass a path to your OTA zip file. 415 416 417 418## Note to Developers and Maintainers 419 420When changing the update engine source code be extra careful about these things: 421 422### Do NOT Break Backward Compatibility 423 424At each release cycle we should be able to generate full and delta payloads that 425can correctly be applied to older devices that run older versions of the update 426engine client. So for example, removing or not passing arguments in the metadata 427proto file might break older clients. Or passing operations that are not 428understood in older clients will break them. Whenever changing anything in the 429payload generation process, ask yourself this question: Would it work on older 430clients? If not, do I need to control it with minor versions or any other means. 431 432Especially regarding enterprise rollback, a newer updater client should be able 433to accept an older update payload. Normally this happens using a full payload, 434but care should be taken in order to not break this compatibility. 435 436### Think About The Future 437 438When creating a change in the update engine, think about 5 years from now: 439 440* How can the change be implemented that five years from now older clients 441 don’t break? 442* How is it going to be maintained five years from now? 443* How can it make it easier for future changes without breaking older clients 444 or incurring heavy maintenance costs? 445 446### Prefer Not To Implement Your Feature In The Updater Client 447If a feature can be implemented from server side, Do NOT implement it in the 448client updater. Because the client updater can be fragile at points and small 449mistakes can have catastrophic consequences. For example, if a bug is introduced 450in the updater client that causes it to crash right before checking for update 451and we can't quite catch this bug early in the release process, then the 452production devices which have already moved to the new buggy system, may no 453longer receive automatic updates anymore. So, always think if the feature is 454being implemented can be done form the server side (with potentially minimal 455changes to the client updater)? Or can the feature be moved to another service 456with minimal interface to the updater client. Answering these questions will pay 457off greatly in the future. 458 459### Be Respectful Of Other Code Bases 460 461~~The current update engine code base is used in many projects like Android.~~~ 462 463The Android and ChromeOS codebase have officially diverged. 464 465We sync the code base among these two projects frequently. Try to not break Android 466or other systems that share the update engine code. Whenever landing a change, 467always think about whether Android needs that change: 468 469* How will it affect Android? 470* Can the change be moved to an interface and stubs implementations be 471 implemented so as not to affect Android? 472* Can Chrome OS or Android specific code be guarded by macros? 473 474As a basic measure, if adding/removing/renaming code, make sure to change both 475`build.gn` and `Android.bp`. Do not bring Chrome OS specific code (for example 476other libraries that live in `system_api` or `dlcservice`) into the common code 477of update_engine. Try to separate these concerns using best software engineering 478practices. 479 480### Merging from Android (or other code bases) 481 482Chrome OS tracks the Android code as an [upstream branch]. To merge the Android 483code to Chrome OS (or vice versa) just do a `git merge` of that branch into 484Chrome OS, test it using whatever means and upload a merge commit. 485 486```bash 487repo start merge-aosp 488git merge --no-ff --strategy=recursive -X patience cros/upstream 489repo upload --cbr --no-verify . 490``` 491 492[Postinstall]: #postinstall 493[update payload file specification]: #update-payload-file-specification 494[OTA]: https://source.android.com/devices/tech/ota 495[DLC]: https://chromium.googlesource.com/chromiumos/platform2/+/master/dlcservice 496[`chromeos-setgoodkernel`]: https://chromium.googlesource.com/chromiumos/platform2/+/master/installer/chromeos-setgoodkernel 497[D-Bus interface]: /dbus_bindings/org.chromium.UpdateEngineInterface.dbus-xml 498[this repository]: / 499[UpdateManager]: /update_manager/update_manager.cc 500[update_manager]: /update_manager/ 501[P2P update related code]: https://chromium.googlesource.com/chromiumos/platform2/+/master/p2p/ 502[`cros_generate_update_payloads`]: https://chromium.googlesource.com/chromiumos/chromite/+/master/scripts/cros_generate_update_payload.py 503[`chromite/lib/paygen`]: https://chromium.googlesource.com/chromiumos/chromite/+/master/lib/paygen/ 504[DeltaArchiveManifest]: /update_metadata.proto#302 505[Signatures]: /update_metadata.proto#122 506[hard coded]: /update_engine.conf 507[Manifest protobuf]: /update_metadata.proto 508[update_payload]: /scripts/ 509[Postinstall]: https://chromium.googlesource.com/chromiumos/platform2/+/master/installer/chromeos-postinst 510[`update_engine` protobufs]: https://chromium.googlesource.com/chromiumos/platform2/+/master/system_api/dbus/update_engine/ 511[Running unit tests similar to other platforms]: https://chromium.googlesource.com/chromiumos/docs/+/master/testing/running_unit_tests.md 512[Nebraska]: https://chromium.googlesource.com/chromiumos/platform/dev-util/+/master/nebraska/ 513[upstream branch]: https://chromium.googlesource.com/aosp/platform/system/update_engine/+/upstream 514[`cros flash`]: https://chromium.googlesource.com/chromiumos/docs/+/master/cros_flash.md 515[bsdiff]: https://android.googlesource.com/platform/external/bsdiff/+/master 516[puffin]: https://android.googlesource.com/platform/external/puffin/+/master 517[`update_engine_client`]: /update_engine_client.cc 518[`brillo_update_payload`]: /scripts/brillo_update_payload 519[`check_update_payload`]: /scripts/paycheck.py 520[Dev Server]: https://chromium.googlesource.com/chromiumos/chromite/+/master/docs/devserver.md 521