1# Simpleperf
2
3Android Studio includes a graphical front end to Simpleperf, documented in
4[Inspect CPU activity with CPU Profiler](https://developer.android.com/studio/profile/cpu-profiler).
5Most users will prefer to use that instead of using Simpleperf directly.
6
7Simpleperf is a native CPU profiling tool for Android. It can be used to profile
8both Android applications and native processes running on Android. It can
9profile both Java and C++ code on Android. The simpleperf executable can run on Android >=L,
10and Python scripts can be used on Android >= N.
11
12Simpleperf is part of the Android Open Source Project.
13The source code is [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/).
14The latest document is [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/doc/README.md).
15
16[TOC]
17
18## Introduction
19
20An introduction slide deck is [here](./introduction.pdf).
21
22Simpleperf contains two parts: the simpleperf executable and Python scripts.
23
24The simpleperf executable works similar to linux-tools-perf, but has some specific features for
25the Android profiling environment:
26
271. It collects more info in profiling data. Since the common workflow is "record on the device, and
28   report on the host", simpleperf not only collects samples in profiling data, but also collects
29   needed symbols, device info and recording time.
30
312. It delivers new features for recording.
32   1) When recording dwarf based call graph, simpleperf unwinds the stack before writing a sample
33      to file. This is to save storage space on the device.
34   2) Support tracing both on CPU time and off CPU time with --trace-offcpu option.
35   3) Support recording callgraphs of JITed and interpreted Java code on Android >= P.
36
373. It relates closely to the Android platform.
38   1) Is aware of Android environment, like using system properties to enable profiling, using
39      run-as to profile in application's context.
40   2) Supports reading symbols and debug information from the .gnu_debugdata section, because
41      system libraries are built with .gnu_debugdata section starting from Android O.
42   3) Supports profiling shared libraries embedded in apk files.
43   4) It uses the standard Android stack unwinder, so its results are consistent with all other
44      Android tools.
45
464. It builds executables and shared libraries for different usages.
47   1) Builds static executables on the device. Since static executables don't rely on any library,
48      simpleperf executables can be pushed on any Android device and used to record profiling data.
49   2) Builds executables on different hosts: Linux, Mac and Windows. These executables can be used
50      to report on hosts.
51   3) Builds report shared libraries on different hosts. The report library is used by different
52      Python scripts to parse profiling data.
53
54Detailed documentation for the simpleperf executable is [here](#executable-commands-reference).
55
56Python scripts are split into three parts according to their functions:
57
581. Scripts used for recording, like app_profiler.py, run_simpleperf_without_usb_connection.py.
59
602. Scripts used for reporting, like report.py, report_html.py, inferno.
61
623. Scripts used for parsing profiling data, like simpleperf_report_lib.py.
63
64The python scripts are tested on Python >= 3.9. Older versions may not be supported.
65Detailed documentation for the Python scripts is [here](#scripts-reference).
66
67
68## Tools in simpleperf
69
70The simpleperf executables and Python scripts are located in simpleperf/ in ndk releases, and in
71system/extras/simpleperf/scripts/ in AOSP. Their functions are listed below.
72
73bin/: contains executables and shared libraries.
74
75bin/android/${arch}/simpleperf: static simpleperf executables used on the device.
76
77bin/${host}/${arch}/simpleperf: simpleperf executables used on the host, only supports reporting.
78
79bin/${host}/${arch}/libsimpleperf_report.${so/dylib/dll}: report shared libraries used on the host.
80
81*.py, inferno, purgatorio: Python scripts used for recording and reporting. Details are in [scripts_reference.md](scripts_reference.md).
82
83
84## Android application profiling
85
86See [android_application_profiling.md](./android_application_profiling.md).
87
88
89## Android platform profiling
90
91See [android_platform_profiling.md](./android_platform_profiling.md).
92
93
94## Executable commands reference
95
96See [executable_commands_reference.md](./executable_commands_reference.md).
97
98
99## Scripts reference
100
101See [scripts_reference.md](./scripts_reference.md).
102
103## View the profile
104
105See [view_the_profile.md](./view_the_profile.md).
106
107## Answers to common issues
108
109### Support on different Android versions
110
111On Android < N, the kernel may be too old (< 3.18) to support features like recording DWARF
112based call graphs.
113On Android M - O, we can only profile C++ code and fully compiled Java code.
114On Android >= P, the ART interpreter supports DWARF based unwinding. So we can profile Java code.
115On Android >= Q, we can used simpleperf shipped on device to profile released Android apps, with
116  `<profileable android:shell="true" />`.
117
118
119### Comparing DWARF based and stack frame based call graphs
120
121Simpleperf supports two ways recording call stacks with samples. One is DWARF based call graph,
122the other is stack frame based call graph. Below is their comparison:
123
124Recording DWARF based call graph:
1251. Needs support of debug information in binaries.
1262. Behaves normally well on both ARM and ARM64, for both Java code and C++ code.
1273. Can only unwind 64K stack for each sample. So it isn't always possible to unwind to the bottom.
128   However, this is alleviated in simpleperf, as explained in the next section.
1294. Takes more CPU time than stack frame based call graphs. So it has higher overhead, and can't
130   sample at very high frequency (usually <= 4000 Hz).
131
132Recording stack frame based call graph:
1331. Needs support of stack frame registers.
1342. Doesn't work well on ARM. Because ARM is short of registers, and ARM and THUMB code have
135   different stack frame registers. So the kernel can't unwind user stack containing both ARM and
136   THUMB code.
1373. Also doesn't work well on Java code. Because the ART compiler doesn't reserve stack frame
138   registers. And it can't get frames for interpreted Java code.
1394. Works well when profiling native programs on ARM64. One example is profiling surfacelinger. And
140   usually shows complete flamegraph when it works well.
1415. Takes much less CPU time than DWARF based call graphs. So the sample frequency can be 10000 Hz or
142   higher.
143
144So if you need to profile code on ARM or profile Java code, DWARF based call graph is better. If you
145need to profile C++ code on ARM64, stack frame based call graphs may be better. After all, you can
146fisrt try DWARF based call graph, which is also the default option when `-g` is used. Because it
147always produces reasonable results. If it doesn't work well enough, then try stack frame based call
148graph instead.
149
150
151### Fix broken DWARF based call graph
152
153A DWARF-based call graph is generated by unwinding thread stacks. When a sample is recorded, a
154kernel dumps up to 64 kilobytes of stack data. By unwinding the stack based on DWARF information,
155we can get a call stack.
156
157Two reasons may cause a broken call stack:
1581. The kernel can only dump up to 64 kilobytes of stack data for each sample, but a thread can have
159   much larger stack. In this case, we can't unwind to the thread start point.
160
1612. We need binaries containing DWARF call frame information to unwind stack frames. The binary
162   should have one of the following sections: .eh_frame, .debug_frame, .ARM.exidx or .gnu_debugdata.
163
164To mitigate these problems,
165
166
167For the missing stack data problem:
1681. To alleviate it, simpleperf joins callchains (call stacks) after recording. If two callchains of
169   a thread have an entry containing the same ip and sp address, then simpleperf tries to join them
170   to make the callchains longer. So we can get more complete callchains by recording longer and
171   joining more samples. This doesn't guarantee to get complete call graphs. But it usually works
172   well.
173
1742. Simpleperf stores samples in a buffer before unwinding them. If the bufer is low in free space,
175   simpleperf may decide to truncate stack data for a sample to 1K. Hopefully, this can be recovered
176   by callchain joiner. But when a high percentage of samples are truncated, many callchains can be
177   broken. We can tell if many samples are truncated in the record command output, like:
178
179```sh
180$ simpleperf record ...
181simpleperf I cmd_record.cpp:809] Samples recorded: 105584 (cut 86291). Samples lost: 6501.
182
183$ simpleperf record ...
184simpleperf I cmd_record.cpp:894] Samples recorded: 7,365 (1,857 with truncated stacks).
185```
186
187   There are two ways to avoid truncating samples. One is increasing the buffer size, like
188   `--user-buffer-size 1G`. But `--user-buffer-size` is only available on latest simpleperf. If that
189   option isn't available, we can use `--no-cut-samples` to disable truncating samples.
190
191For the missing DWARF call frame info problem:
1921. Most C++ code generates binaries containing call frame info, in .eh_frame or .ARM.exidx sections.
193   These sections are not stripped, and are usually enough for stack unwinding.
194
1952. For C code and a small percentage of C++ code that the compiler is sure will not generate
196   exceptions, the call frame info is generated in .debug_frame section. .debug_frame section is
197   usually stripped with other debug sections. One way to fix it, is to download unstripped binaries
198   on device, as [here](#fix-broken-callchain-stopped-at-c-functions).
199
2003. The compiler doesn't generate unwind instructions for function prologue and epilogue. Because
201   they operates stack frames and will not generate exceptions. But profiling may hit these
202   instructions, and fails to unwind them. This usually doesn't matter in a frame graph. But in a
203   time based Stack Chart (like in Android Studio and Firefox profiler), this causes stack gaps once
204   in a while. We can remove stack gaps via `--remove-gaps`, which is already enabled by default.
205
206
207### Fix broken callchain stopped at C functions
208
209When using dwarf based call graphs, simpleperf generates callchains during recording to save space.
210The debug information needed to unwind C functions is in .debug_frame section, which is usually
211stripped in native libraries in apks. To fix this, we can download unstripped version of native
212libraries on device, and ask simpleperf to use them when recording.
213
214To use simpleperf directly:
215
216```sh
217# create native_libs dir on device, and push unstripped libs in it (nested dirs are not supported).
218$ adb shell mkdir /data/local/tmp/native_libs
219$ adb push <unstripped_dir>/*.so /data/local/tmp/native_libs
220# run simpleperf record with --symfs option.
221$ adb shell simpleperf record xxx --symfs /data/local/tmp/native_libs
222```
223
224To use app_profiler.py:
225
226```sh
227$ ./app_profiler.py -lib <unstripped_dir>
228```
229
230
231### How to solve missing symbols in report?
232
233The simpleperf record command collects symbols on device in perf.data. But if the native libraries
234you use on device are stripped, this will result in a lot of unknown symbols in the report. A
235solution is to build binary_cache on host.
236
237```sh
238# Collect binaries needed by perf.data in binary_cache/.
239$ ./binary_cache_builder.py -lib NATIVE_LIB_DIR,...
240```
241
242The NATIVE_LIB_DIRs passed in -lib option are the directories containing unstripped native
243libraries on host. After running it, the native libraries containing symbol tables are collected
244in binary_cache/ for use when reporting.
245
246```sh
247$ ./report.py --symfs binary_cache
248
249# report_html.py searches binary_cache/ automatically, so you don't need to
250# pass it any argument.
251$ ./report_html.py
252```
253
254
255### Show annotated source code and disassembly
256
257To show hot places at source code and instruction level, we need to show source code and
258disassembly with event count annotation. Simpleperf supports showing annotated source code and
259disassembly for C++ code and fully compiled Java code. Simpleperf supports two ways to do it:
260
2611. Through report_html.py:
262   1) Generate perf.data and pull it on host.
263   2) Generate binary_cache, containing elf files with debug information. Use -lib option to add
264     libs with debug info. Do it with
265     `binary_cache_builder.py -i perf.data -lib <dir_of_lib_with_debug_info>`.
266   3) Use report_html.py to generate report.html with annotated source code and disassembly,
267     as described [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/doc/scripts_reference.md#report_html_py).
268
2692. Through pprof.
270   1) Generate perf.data and binary_cache as above.
271   2) Use pprof_proto_generator.py to generate pprof proto file. `pprof_proto_generator.py`.
272   3) Use pprof to report a function with annotated source code, as described [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/doc/scripts_reference.md#pprof_proto_generator_py).
273
274
275### Reduce lost samples and samples with truncated stack
276
277When using `simpleperf record`, we may see lost samples or samples with truncated stack data. Before
278saving samples to a file, simpleperf uses two buffers to cache samples in memory. One is a kernel
279buffer, the other is a userspace buffer. The kernel puts samples to the kernel buffer. Simpleperf
280moves samples from the kernel buffer to the userspace buffer before processing them. If a buffer
281overflows, we lose samples or get samples with truncated stack data. Below is an example.
282
283```sh
284$ simpleperf record -a --duration 1 -g --user-buffer-size 100k
285simpleperf I cmd_record.cpp:799] Recorded for 1.00814 seconds. Start post processing.
286simpleperf I cmd_record.cpp:894] Samples recorded: 79 (16 with truncated stacks).
287                                 Samples lost: 2,129 (kernelspace: 18, userspace: 2,111).
288simpleperf W cmd_record.cpp:911] Lost 18.5567% of samples in kernel space, consider increasing
289                                 kernel buffer size(-m), or decreasing sample frequency(-f), or
290                                 increasing sample period(-c).
291simpleperf W cmd_record.cpp:928] Lost/Truncated 97.1233% of samples in user space, consider
292                                 increasing userspace buffer size(--user-buffer-size), or
293                                 decreasing sample frequency(-f), or increasing sample period(-c).
294```
295
296In the above example, we get 79 samples, 16 of them are with truncated stack data. We lose 18
297samples in the kernel buffer, and lose 2111 samples in the userspace buffer.
298
299To reduce lost samples in the kernel buffer, we can increase kernel buffer size via `-m`. To reduce
300lost samples in the userspace buffer, or reduce samples with truncated stack data, we can increase
301userspace buffer size via `--user-buffer-size`.
302
303We can also reduce samples generated in a fixed time period, like reducing sample frequency using
304`-f`, reducing monitored threads, not monitoring multiple perf events at the same time.
305
306
307## Bugs and contribution
308
309Bugs and feature requests can be submitted at https://github.com/android/ndk/issues.
310Patches can be uploaded to android-review.googlesource.com as [here](https://source.android.com/setup/contribute/),
311or sent to email addresses listed [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/OWNERS).
312
313If you want to compile simpleperf C++ source code, follow below steps:
3141. Download AOSP main branch as [here](https://source.android.com/setup/build/requirements).
3152. Build simpleperf.
316```sh
317$ . build/envsetup.sh
318$ lunch aosp_arm64-trunk_staging-userdebug
319$ mmma system/extras/simpleperf -j30
320```
321
322If built successfully, out/target/product/generic_arm64/system/bin/simpleperf is for ARM64, and
323out/target/product/generic_arm64/system/bin/simpleperf32 is for ARM.
324
325The source code of simpleperf python scripts is in [system/extras/simpleperf/scripts](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/scripts/).
326Most scripts rely on simpleperf binaries to work. To update binaries for scripts (using linux
327x86_64 host and android arm64 target as an example):
328```sh
329$ cp out/host/linux-x86/lib64/libsimpleperf_report.so system/extras/simpleperf/scripts/bin/linux/x86_64/libsimpleperf_report.so
330$ cp out/target/product/generic_arm64/system/bin/simpleperf_ndk64 system/extras/simpleperf/scripts/bin/android/arm64/simpleperf
331```
332
333Then you can try the latest simpleperf scripts and binaries in system/extras/simpleperf/scripts.
334