1# Simpleperf 2 3Android Studio includes a graphical front end to Simpleperf, documented in 4[Inspect CPU activity with CPU Profiler](https://developer.android.com/studio/profile/cpu-profiler). 5Most users will prefer to use that instead of using Simpleperf directly. 6 7Simpleperf is a native CPU profiling tool for Android. It can be used to profile 8both Android applications and native processes running on Android. It can 9profile both Java and C++ code on Android. The simpleperf executable can run on Android >=L, 10and Python scripts can be used on Android >= N. 11 12Simpleperf is part of the Android Open Source Project. 13The source code is [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/). 14The latest document is [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/doc/README.md). 15 16[TOC] 17 18## Introduction 19 20An introduction slide deck is [here](./introduction.pdf). 21 22Simpleperf contains two parts: the simpleperf executable and Python scripts. 23 24The simpleperf executable works similar to linux-tools-perf, but has some specific features for 25the Android profiling environment: 26 271. It collects more info in profiling data. Since the common workflow is "record on the device, and 28 report on the host", simpleperf not only collects samples in profiling data, but also collects 29 needed symbols, device info and recording time. 30 312. It delivers new features for recording. 32 1) When recording dwarf based call graph, simpleperf unwinds the stack before writing a sample 33 to file. This is to save storage space on the device. 34 2) Support tracing both on CPU time and off CPU time with --trace-offcpu option. 35 3) Support recording callgraphs of JITed and interpreted Java code on Android >= P. 36 373. It relates closely to the Android platform. 38 1) Is aware of Android environment, like using system properties to enable profiling, using 39 run-as to profile in application's context. 40 2) Supports reading symbols and debug information from the .gnu_debugdata section, because 41 system libraries are built with .gnu_debugdata section starting from Android O. 42 3) Supports profiling shared libraries embedded in apk files. 43 4) It uses the standard Android stack unwinder, so its results are consistent with all other 44 Android tools. 45 464. It builds executables and shared libraries for different usages. 47 1) Builds static executables on the device. Since static executables don't rely on any library, 48 simpleperf executables can be pushed on any Android device and used to record profiling data. 49 2) Builds executables on different hosts: Linux, Mac and Windows. These executables can be used 50 to report on hosts. 51 3) Builds report shared libraries on different hosts. The report library is used by different 52 Python scripts to parse profiling data. 53 54Detailed documentation for the simpleperf executable is [here](#executable-commands-reference). 55 56Python scripts are split into three parts according to their functions: 57 581. Scripts used for recording, like app_profiler.py, run_simpleperf_without_usb_connection.py. 59 602. Scripts used for reporting, like report.py, report_html.py, inferno. 61 623. Scripts used for parsing profiling data, like simpleperf_report_lib.py. 63 64The python scripts are tested on Python >= 3.9. Older versions may not be supported. 65Detailed documentation for the Python scripts is [here](#scripts-reference). 66 67 68## Tools in simpleperf 69 70The simpleperf executables and Python scripts are located in simpleperf/ in ndk releases, and in 71system/extras/simpleperf/scripts/ in AOSP. Their functions are listed below. 72 73bin/: contains executables and shared libraries. 74 75bin/android/${arch}/simpleperf: static simpleperf executables used on the device. 76 77bin/${host}/${arch}/simpleperf: simpleperf executables used on the host, only supports reporting. 78 79bin/${host}/${arch}/libsimpleperf_report.${so/dylib/dll}: report shared libraries used on the host. 80 81*.py, inferno, purgatorio: Python scripts used for recording and reporting. Details are in [scripts_reference.md](scripts_reference.md). 82 83 84## Android application profiling 85 86See [android_application_profiling.md](./android_application_profiling.md). 87 88 89## Android platform profiling 90 91See [android_platform_profiling.md](./android_platform_profiling.md). 92 93 94## Executable commands reference 95 96See [executable_commands_reference.md](./executable_commands_reference.md). 97 98 99## Scripts reference 100 101See [scripts_reference.md](./scripts_reference.md). 102 103## View the profile 104 105See [view_the_profile.md](./view_the_profile.md). 106 107## Answers to common issues 108 109### Support on different Android versions 110 111On Android < N, the kernel may be too old (< 3.18) to support features like recording DWARF 112based call graphs. 113On Android M - O, we can only profile C++ code and fully compiled Java code. 114On Android >= P, the ART interpreter supports DWARF based unwinding. So we can profile Java code. 115On Android >= Q, we can used simpleperf shipped on device to profile released Android apps, with 116 `<profileable android:shell="true" />`. 117 118 119### Comparing DWARF based and stack frame based call graphs 120 121Simpleperf supports two ways recording call stacks with samples. One is DWARF based call graph, 122the other is stack frame based call graph. Below is their comparison: 123 124Recording DWARF based call graph: 1251. Needs support of debug information in binaries. 1262. Behaves normally well on both ARM and ARM64, for both Java code and C++ code. 1273. Can only unwind 64K stack for each sample. So it isn't always possible to unwind to the bottom. 128 However, this is alleviated in simpleperf, as explained in the next section. 1294. Takes more CPU time than stack frame based call graphs. So it has higher overhead, and can't 130 sample at very high frequency (usually <= 4000 Hz). 131 132Recording stack frame based call graph: 1331. Needs support of stack frame registers. 1342. Doesn't work well on ARM. Because ARM is short of registers, and ARM and THUMB code have 135 different stack frame registers. So the kernel can't unwind user stack containing both ARM and 136 THUMB code. 1373. Also doesn't work well on Java code. Because the ART compiler doesn't reserve stack frame 138 registers. And it can't get frames for interpreted Java code. 1394. Works well when profiling native programs on ARM64. One example is profiling surfacelinger. And 140 usually shows complete flamegraph when it works well. 1415. Takes much less CPU time than DWARF based call graphs. So the sample frequency can be 10000 Hz or 142 higher. 143 144So if you need to profile code on ARM or profile Java code, DWARF based call graph is better. If you 145need to profile C++ code on ARM64, stack frame based call graphs may be better. After all, you can 146fisrt try DWARF based call graph, which is also the default option when `-g` is used. Because it 147always produces reasonable results. If it doesn't work well enough, then try stack frame based call 148graph instead. 149 150 151### Fix broken DWARF based call graph 152 153A DWARF-based call graph is generated by unwinding thread stacks. When a sample is recorded, a 154kernel dumps up to 64 kilobytes of stack data. By unwinding the stack based on DWARF information, 155we can get a call stack. 156 157Two reasons may cause a broken call stack: 1581. The kernel can only dump up to 64 kilobytes of stack data for each sample, but a thread can have 159 much larger stack. In this case, we can't unwind to the thread start point. 160 1612. We need binaries containing DWARF call frame information to unwind stack frames. The binary 162 should have one of the following sections: .eh_frame, .debug_frame, .ARM.exidx or .gnu_debugdata. 163 164To mitigate these problems, 165 166 167For the missing stack data problem: 1681. To alleviate it, simpleperf joins callchains (call stacks) after recording. If two callchains of 169 a thread have an entry containing the same ip and sp address, then simpleperf tries to join them 170 to make the callchains longer. So we can get more complete callchains by recording longer and 171 joining more samples. This doesn't guarantee to get complete call graphs. But it usually works 172 well. 173 1742. Simpleperf stores samples in a buffer before unwinding them. If the bufer is low in free space, 175 simpleperf may decide to truncate stack data for a sample to 1K. Hopefully, this can be recovered 176 by callchain joiner. But when a high percentage of samples are truncated, many callchains can be 177 broken. We can tell if many samples are truncated in the record command output, like: 178 179```sh 180$ simpleperf record ... 181simpleperf I cmd_record.cpp:809] Samples recorded: 105584 (cut 86291). Samples lost: 6501. 182 183$ simpleperf record ... 184simpleperf I cmd_record.cpp:894] Samples recorded: 7,365 (1,857 with truncated stacks). 185``` 186 187 There are two ways to avoid truncating samples. One is increasing the buffer size, like 188 `--user-buffer-size 1G`. But `--user-buffer-size` is only available on latest simpleperf. If that 189 option isn't available, we can use `--no-cut-samples` to disable truncating samples. 190 191For the missing DWARF call frame info problem: 1921. Most C++ code generates binaries containing call frame info, in .eh_frame or .ARM.exidx sections. 193 These sections are not stripped, and are usually enough for stack unwinding. 194 1952. For C code and a small percentage of C++ code that the compiler is sure will not generate 196 exceptions, the call frame info is generated in .debug_frame section. .debug_frame section is 197 usually stripped with other debug sections. One way to fix it, is to download unstripped binaries 198 on device, as [here](#fix-broken-callchain-stopped-at-c-functions). 199 2003. The compiler doesn't generate unwind instructions for function prologue and epilogue. Because 201 they operates stack frames and will not generate exceptions. But profiling may hit these 202 instructions, and fails to unwind them. This usually doesn't matter in a frame graph. But in a 203 time based Stack Chart (like in Android Studio and Firefox profiler), this causes stack gaps once 204 in a while. We can remove stack gaps via `--remove-gaps`, which is already enabled by default. 205 206 207### Fix broken callchain stopped at C functions 208 209When using dwarf based call graphs, simpleperf generates callchains during recording to save space. 210The debug information needed to unwind C functions is in .debug_frame section, which is usually 211stripped in native libraries in apks. To fix this, we can download unstripped version of native 212libraries on device, and ask simpleperf to use them when recording. 213 214To use simpleperf directly: 215 216```sh 217# create native_libs dir on device, and push unstripped libs in it (nested dirs are not supported). 218$ adb shell mkdir /data/local/tmp/native_libs 219$ adb push <unstripped_dir>/*.so /data/local/tmp/native_libs 220# run simpleperf record with --symfs option. 221$ adb shell simpleperf record xxx --symfs /data/local/tmp/native_libs 222``` 223 224To use app_profiler.py: 225 226```sh 227$ ./app_profiler.py -lib <unstripped_dir> 228``` 229 230 231### How to solve missing symbols in report? 232 233The simpleperf record command collects symbols on device in perf.data. But if the native libraries 234you use on device are stripped, this will result in a lot of unknown symbols in the report. A 235solution is to build binary_cache on host. 236 237```sh 238# Collect binaries needed by perf.data in binary_cache/. 239$ ./binary_cache_builder.py -lib NATIVE_LIB_DIR,... 240``` 241 242The NATIVE_LIB_DIRs passed in -lib option are the directories containing unstripped native 243libraries on host. After running it, the native libraries containing symbol tables are collected 244in binary_cache/ for use when reporting. 245 246```sh 247$ ./report.py --symfs binary_cache 248 249# report_html.py searches binary_cache/ automatically, so you don't need to 250# pass it any argument. 251$ ./report_html.py 252``` 253 254 255### Show annotated source code and disassembly 256 257To show hot places at source code and instruction level, we need to show source code and 258disassembly with event count annotation. Simpleperf supports showing annotated source code and 259disassembly for C++ code and fully compiled Java code. Simpleperf supports two ways to do it: 260 2611. Through report_html.py: 262 1) Generate perf.data and pull it on host. 263 2) Generate binary_cache, containing elf files with debug information. Use -lib option to add 264 libs with debug info. Do it with 265 `binary_cache_builder.py -i perf.data -lib <dir_of_lib_with_debug_info>`. 266 3) Use report_html.py to generate report.html with annotated source code and disassembly, 267 as described [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/doc/scripts_reference.md#report_html_py). 268 2692. Through pprof. 270 1) Generate perf.data and binary_cache as above. 271 2) Use pprof_proto_generator.py to generate pprof proto file. `pprof_proto_generator.py`. 272 3) Use pprof to report a function with annotated source code, as described [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/doc/scripts_reference.md#pprof_proto_generator_py). 273 274 275### Reduce lost samples and samples with truncated stack 276 277When using `simpleperf record`, we may see lost samples or samples with truncated stack data. Before 278saving samples to a file, simpleperf uses two buffers to cache samples in memory. One is a kernel 279buffer, the other is a userspace buffer. The kernel puts samples to the kernel buffer. Simpleperf 280moves samples from the kernel buffer to the userspace buffer before processing them. If a buffer 281overflows, we lose samples or get samples with truncated stack data. Below is an example. 282 283```sh 284$ simpleperf record -a --duration 1 -g --user-buffer-size 100k 285simpleperf I cmd_record.cpp:799] Recorded for 1.00814 seconds. Start post processing. 286simpleperf I cmd_record.cpp:894] Samples recorded: 79 (16 with truncated stacks). 287 Samples lost: 2,129 (kernelspace: 18, userspace: 2,111). 288simpleperf W cmd_record.cpp:911] Lost 18.5567% of samples in kernel space, consider increasing 289 kernel buffer size(-m), or decreasing sample frequency(-f), or 290 increasing sample period(-c). 291simpleperf W cmd_record.cpp:928] Lost/Truncated 97.1233% of samples in user space, consider 292 increasing userspace buffer size(--user-buffer-size), or 293 decreasing sample frequency(-f), or increasing sample period(-c). 294``` 295 296In the above example, we get 79 samples, 16 of them are with truncated stack data. We lose 18 297samples in the kernel buffer, and lose 2111 samples in the userspace buffer. 298 299To reduce lost samples in the kernel buffer, we can increase kernel buffer size via `-m`. To reduce 300lost samples in the userspace buffer, or reduce samples with truncated stack data, we can increase 301userspace buffer size via `--user-buffer-size`. 302 303We can also reduce samples generated in a fixed time period, like reducing sample frequency using 304`-f`, reducing monitored threads, not monitoring multiple perf events at the same time. 305 306 307## Bugs and contribution 308 309Bugs and feature requests can be submitted at https://github.com/android/ndk/issues. 310Patches can be uploaded to android-review.googlesource.com as [here](https://source.android.com/setup/contribute/), 311or sent to email addresses listed [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/OWNERS). 312 313If you want to compile simpleperf C++ source code, follow below steps: 3141. Download AOSP main branch as [here](https://source.android.com/setup/build/requirements). 3152. Build simpleperf. 316```sh 317$ . build/envsetup.sh 318$ lunch aosp_arm64-trunk_staging-userdebug 319$ mmma system/extras/simpleperf -j30 320``` 321 322If built successfully, out/target/product/generic_arm64/system/bin/simpleperf is for ARM64, and 323out/target/product/generic_arm64/system/bin/simpleperf32 is for ARM. 324 325The source code of simpleperf python scripts is in [system/extras/simpleperf/scripts](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/scripts/). 326Most scripts rely on simpleperf binaries to work. To update binaries for scripts (using linux 327x86_64 host and android arm64 target as an example): 328```sh 329$ cp out/host/linux-x86/lib64/libsimpleperf_report.so system/extras/simpleperf/scripts/bin/linux/x86_64/libsimpleperf_report.so 330$ cp out/target/product/generic_arm64/system/bin/simpleperf_ndk64 system/extras/simpleperf/scripts/bin/android/arm64/simpleperf 331``` 332 333Then you can try the latest simpleperf scripts and binaries in system/extras/simpleperf/scripts. 334