Voozh

Ready to give LWN a try?
With a subscription to LWN, you can stay current with what is happening in the Linux and free-software community and take advantage of subscriber-only site features. We are pleased to offer you a free trial subscription, no credit card required, so that you can see for yourself. Please, join us!

By Jake Edge
July 1, 2009

We last looked at the perfcounters patches back in December, shortly after they appeared. Since that time, a great deal of work has been done, culminating in perfcounters being included into the mainline during the recently completed 2.6.31 merge window. Along the way, a tool to use perfcounters, called , was added to the mainline as well.

Adding to the kernel directory is one of the more surprising aspects of the perfcounters merge. Kernel hackers have long been leery of adding user-space tools into the kernel source tree, but Linus Torvalds was unconvinced by multiple complaints about that approach. He pointed to to explain:

It took literally months for the user mode tools to catch up and get the patches to support new functionality into CVS (or is it SVN?), and after that it took even longer for them to become part of a release and be picked up by distributions. In fact, I'm not sure it is part of a release even now - I had to make a bug report to Fedora to get atom and Nehalem support in my tools: I think they took the unofficial patch.

Others were not so sure that the being developed separately from the kernel was the root cause of those failures. Christoph Hellwig had other ideas: "I don't think oprofile has been a [disaster] because of any kind of split, but because the design has been a failure from day 1." But, Torvalds wants to try including the tool to see where it leads: "Let's give a _new_ approach a chance, and see if we can avoid the mistakes of yesteryear this time."

The tool itself is a fairly simple command-line tool, which can be built from the directory. It also includes some documentation, in the form of man pages that are also available via (as well as in HTML and other formats). At its simplest, it gathers and reports some statistics for a particular command:

 $ perf stat ./hackbench 10
 Time: 4.174

 Performance counter stats for './hackbench 10':

	8134.135358 task-clock-msecs # 1.859 CPUs
	 23524 context-switches # 0.003 M/sec
	 1095 CPU-migrations # 0.000 M/sec
	 16964 page-faults # 0.002 M/sec
	10734363561 cycles # 1319.669 M/sec
	12281522014 instructions # 1.144 IPC
	 121964514 cache-references # 14.994 M/sec
	 10280836 cache-misses # 1.264 M/sec

	4.376588249 seconds time elapsed.

This summarizes the performance events that occurred while running the micro-benchmark program. There are a combination of hardware events (cycles, instructions, cache-references, and cache-misses) as well as software events (task-clock-msecs, context-switches, CPU-migrations, and page-faults) that are derived from the kernel code and not the processor-specific performance monitoring unit (PMU). Currently, support for hardware events is available for Intel, AMD, and PowerPC PMUs, but other architectures still have support for the software events.

There is also a -like mode for observing which kernel functions are being executed most frequently in a continuously updating display:

 $ perf top -c 1000 -p 3216

 ------------------------------------------------------------------------------
 PerfTop: 360 irqs/sec kernel:65.0% [1000 cycles], (target_pid: 3216)
 ------------------------------------------------------------------------------

		 samples pcnt RIP kernel function
 ______ _______ _____ ________________ _______________

		 1214.00 - 5.3% - 00000000c045cb4d : lock_acquire
		 1148.00 - 5.0% - 00000000c045d1d3 : lock_release
		 911.00 - 4.0% - 00000000c045d377 : lock_acquired
		 509.00 - 2.2% - 00000000c05a0cbc : debug_locks_off
		 490.00 - 2.2% - 00000000c05a2f08 : _raw_spin_trylock
		 489.00 - 2.1% - 00000000c041d1d8 : read_hpet
		 488.00 - 2.1% - 00000000c04419b8 : run_timer_softirq
		 483.00 - 2.1% - 00000000c04d5f72 : do_sys_poll
		 477.00 - 2.1% - 00000000c05a34a0 : debug_smp_processor_id
		 462.00 - 2.0% - 00000000c043df85 : __do_softirq
		 404.00 - 1.8% - 00000000c074d93f : sub_preempt_count
		 353.00 - 1.5% - 00000000c074d9d2 : add_preempt_count
		 338.00 - 1.5% - 00000000c0408a76 : native_sched_clock
		 318.00 - 1.4% - 00000000c074b4c3 : _spin_lock_irqsave
		 309.00 - 1.4% - 00000000c044ea10 : enqueue_hrtimer

This is a static version of the output from looking at a largely quiescent firefox process (pid 3216), sampling every 1000 cycles.

There is quite a bit more that can do. There is a sub-function that gathers the performance counter data into a file which can be used by other commands:

 $ perf record ./hackbench 10
 Time: 4.348
 [ perf record: Captured and wrote 2.528 MB perf.data (~110448 samples) ]

 $ perf report --sort comm,dso,symbol | head -15

 #
 # (110146 samples)
 #
 # Overhead Command Shared Object Symbol
 # ........ ................ ......................... ......
 #
	10.70% hackbench [kernel] [k] check_bytes_and_report
	 9.07% hackbench [kernel] [k] slab_pad_check
	 5.67% hackbench [kernel] [k] on_freelist
	 5.28% hackbench [kernel] [k] lock_acquire
	 5.03% hackbench [kernel] [k] lock_release
	 3.19% hackbench [kernel] [k] init_object
	 3.02% hackbench [kernel] [k] lock_acquired
	 2.47% hackbench [kernel] [k] _raw_spin_trylock

This output shows the top eight kernel functions executed while running . The same data file can also be used by (when given a symbol name and the appropriate file) to show the disassembled code for a function, along with the number of samples recorded on each instruction. There is clearly a wealth of information that can be derived from the tool.

The original posting of the perfcounters patches came as somewhat of a surprise to Stéphane Eranian, who had long been working on another performance monitoring solution, "perfmon". While he is still a bit skeptical of perfcounters, which were originally proposed by Ingo Molnar and Thomas Gleixner, he has been reviewing the patches, and providing lengthy comments. Molnar, also responded at length, breaking his reply into multiple chunks which can be found in the thread.

Perfmon was targeted at exposing as much of the underlying PMU data as possible to user space, but Molnar explicitly rejects that goal:

So for every "will you support advanced PMU feature X, Y and Z" question you ask, the first-level answer is: 'please show the developer usecase and integrate it into our tools so we can see how it all works and how useful it is'.

"A tool might want to do this" is not a good enough answer. We now have a working OSS tool-space with 'perf' where such arguments for more PMU features can be made in very specific terms: patches, numbers and comparisons. Actual hands-on utility, happy developers and faster apps is what matters in the end - not just the list of PMU features we expose.

His focus, presumably shared with his co-maintainers Peter Zijlstra and Paul Mackerras, is to generalize performance measurement features so that they are not dependent on any particular CPU and that they fit well with developer work flow: "I do claim we had few if any sane performance analysis tools before under Linux, and i think we are still in the stone ages and still have a lot of work to do in this area." From Molnar's perspective, that ease of use for users and developers is one of the main areas where perfmon fell short.

Molnar is not shy about pointing out that perfcounters still needs a lot of work, but the framework is there, so features can be added to that. As yet, there is no documentation in the kernel directory, but one presumes that will be handled sometime soon. Overall, perfcounters and the tool look to be a highly useful addition to the kernel, one that should start providing benefits—in the form of better performance—in the near term.

Index entries for this article
Kernel	Performance monitoring

URL: https://lwn.net/Articles/339361/

⇱ Perfcounters added to the mainline [LWN.net]

Perfcounters added to the mainline