Benchmarking Python Functions the Easy Way: perftester

You can use perftester to benchmark Python functions the easy way

Dec 20, 2022

15 min read

PYTHON PROGRAMMING

Recently, I described how to benchmark time with the timeit module. I explained that timeit constitutes the basic Python approach to benchmarking time, and promised you to show more. This article is my first step to keep this promise.

I described there two APIs that timeit offers: snippet-based and callable-based APIs. The former is well-known but the latter is not, probably because it’s much less natural and requires you to use a lambda function. Here, I will explore the perftester package, which allows for benchmarking callables, just like the latter API; unlike it, however, it offers an API that is simple and feels natural.

perftester, however, comes with much more than just time benchmarking – it enables you to benchmark callables in terms of both execution time and memory usage, but most of all – it’s a framework for performance testing of Python callables.

We’ll discuss this rich offer step by step, starting off with benchmarking. Since the two types of benchmarking – of time and memory— are quite different, we will focus here on benchmarking time, putting benchmarking of memory consumption aside; we will discuss this topic some other time. We will then be ready to discuss perftester as a testing framework – to the best of my knowledge, the first Python framework for testing performance of callables.

Basic usage

The above-mentioned article showed that the timeit module is easy to use. While it’s true, perftester can be even easier. Its API enables you to write concise and clear benchmarks of Python functions and other callables. In order to analyze this API, let’s use a particular example.

Imagine we have a list x of items of any type. We want to extend the list in such a way that, given integer n, we multiply each element n number of times. We do not do it just by multiplying the lists (x*n), but we want to preserve the order of the list.

So, we expect the following behavior:

&gt;&gt;&gt; extend([1, 4, 'a'], 2)
[1, 1, 4, 4, 'a', 'a']
&gt;&gt;&gt; extend([1, 4, 'a'], 3)
[1, 1, 1, 4, 4, 4, 'a', 'a', 'a']
&gt;&gt;&gt; extend([2, 2, 4, 1], 2)
[2, 2, 2, 2, 4, 4, 1, 1]
&gt;&gt;&gt; extend([1, -1, 1, -1], 3)
[1, 1, 1, -1, -1, -1, 1, 1, 1, -1, -1, -1]

This is one version of the function:

# extender.py
def extend(x: list, n: int) -&gt; list:
 """Extend x n number of times, keeping the original order.

 &gt;&gt;&gt; extend([1, 4, 'a'], 2)
 [1, 1, 4, 4, 'a', 'a']
 &gt;&gt;&gt; extend([1, 4, 'a'], 3)
 [1, 1, 1, 4, 4, 4, 'a', 'a', 'a']
 &gt;&gt;&gt; extend([2, 2, 4, 1], 2)
 [2, 2, 2, 2, 4, 4, 1, 1]
 &gt;&gt;&gt; extend([1, -1, 1, -1], 3)
 [1, 1, 1, -1, -1, -1, 1, 1, 1, -1, -1, -1]
 """
 modified_x = []
 for x_i in x:
 for _ in range(n):
 modified_x.append(x_i)
 return modified_x

As you see, I added a docstring with doctests. If you’d like to learn about this useful testing framework, you can read it in the below Towards Data Science article:

Python Documentation Testing with doctest: The Easy Way

To run the tests, use the following shell command, which assumes that you saved the above file as extender.py and that you’re in this very folder in the shell.

$ python -m doctest extender.py

No output means that all the tests have passed.

Okay, so we’re ready to benchmark the extend() function. Let’s create a main.py Python file to run the benchmarks, located in the same folder as the extender.py file:

# main.py
import extender
import perftester

if __name__ == "__main__":
 t = perftester.time_benchmark(
 extender.extend,
 [1, 1, 4, 4, 'a', 'a'],
 3
 )
 print(t)

This will use the default values of the arguments of perftester.time_benchmark(), that is, Number=100_000 and Repeat=5. If you’re wondering why on earth would anyone use upper case for the first letters of argument names, you will find the explanation in Appendix, located close to the end of the article.

The above code produced the following results on my machine:

We will not focus on the actual results, as they are not of particular interest to us. On a different machine, we could get different results. Therefore, perftester offers also relative results, which should be approximately machine independent. We will discuss this in a moment.

First, however, let’s have a closer look at the output. It’s not too readable, is it? That’s why perftester offers us a nice little solution, the pp() function. Its name stands for pretty print, and its pretty printing is based on two things:

the pprint() function from the built-in pprint module,
the signif_object() function from the site package rounder.

The rounder package enables you to round numbers in any Python object in a very simple way. If you’re interested, you can read more in the following article:

rounder: Rounding Numbers in Complex Python Objects

Let’s see what perftester.pp() does with the above dictionary. This is the code of our main.py module:

import extender
import perftester

if __name__ == "__main__":
 t = perftester.time_benchmark(
 extender.extend,
 [1, 1, 4, 4, 'a', 'a'],
 3
 )
 perftester.pp(t)

And this is the output:

Quite nice, isn’t it? We obtained this with just one function, perftester.pp(), so you may want to remember it: perftester.pp as in perftester pretty print.

perftester.pp as in perftester pretty print.

We can now analyze the output. This is what we have there:

min, mean, max: this is the minimum, mean and maximum mean execution time of running a function across all runs (there are repeat runs) → so this is the mean time of execution of the function once, unlike in timeit.timeit() and timeit.repeat(), which both show their whole execution time. Thus, perftester benchmarks are comparable between experiments; timeit benchmarks are not – at least not without additional calculations. Among these three, it is the min that’s most interesting to us, as in benchmarking, we should be looking at the best result (you can find more in the above-mentioned article on timeit benchmarking).
min_relative: this is another value of our interest. The relative benchmarks are performed against the execution time of an empty function (i.e., one that does nothing, just pass). Relative benchmarks should be more or less consistent from machine to machine that have the same OS, but they will unlikely be consistent among different OSs (even on the same machine, as follows from my experiments).
raw_times and raw_times_relative: these two show raw values, that is, the mean execution time of the benchmarked function in each run (we have Repeat number of runs), and the mean relative execution time (so, divided by the mean relative execution time of the empty function). These values are seldom of our interest; see below for an example, though.

perftester benchmarks are comparable between experiments; timeit benchmarks are not

In some situations, we may wish to look at the min raw result: this is because it provides the minimum execution time of the benchmarked function on our machine. We definitely should be interested in this value. We may then have a look at raw_times, too, since they show how fast this function is in our machine, with all the background processes being run, so in a real life scenario. We see that for the arguments we used, our function needs, on average, 1.95e-06 seconds; so, running it a million times will take almost 2 seconds. In the best run, the mean execution time was 1.84e-06, so it’s not much smaller. The variation does not seem to be big, as we see from raw_times.

Normally we benchmark functions for various combinations of arguments, in order to understand how the function performs in various scenarios. We will do this below, when comparing two functions.

I suppose many of you have thought that I could have done a better job when writing this function… And you were right! It’s not really the best of functions I’ve coded in my life. A for loop to create a list? A list comprehension should, of course, do a much better job, also in terms of performance; see more here:

A Guide to Python Comprehensions

Let’s improve the extend() function, then. But since we want to check if our changes improve performance, we will change the function name so that our extender module has both versions. Here’s the code of our new function, extend_2():

# added to main.py

def extend_2(x: list, n: int) -&gt; list:
 """Extend x n number of times, keeping the original order.

 &gt;&gt;&gt; extend_2([1, 4, 'a'], 2)
 [1, 1, 4, 4, 'a', 'a']
 &gt;&gt;&gt; extend_2([1, 4, 'a'], 3)
 [1, 1, 1, 4, 4, 4, 'a', 'a', 'a']
 &gt;&gt;&gt; extend_2([2, 2, 4, 1], 2)
 [2, 2, 2, 2, 4, 4, 1, 1]
 &gt;&gt;&gt; extend_2([1, -1, 1, -1], 3)
 [1, 1, 1, -1, -1, -1, 1, 1, 1, -1, -1, -1]
 """
 return [x_i for x_i in x for _ in range(n)]

All the doctests pass, so the function works as expected. The function is visibly shorter and clearer – and more elegant – than the original extend(), which is good.

So, let’s benchmark both functions. But since we want to compare two functions, we should not use just one combination of arguments, as it’s possible that for small n, the gain (if any) will be different than that for large n. Here’s the revisited code of the main module:

# main.py
import extender
import perftester

from collections import namedtuple

Benchmarks = namedtuple("Benchmarks", "extend extend_2 better")

if __name__ == "__main__":
 orig_list = [1, 1, 4, 4, 'a', 'a']
 results = {}
 for n in (2, 5, 10, 100, 1000, 10_000):
 number = int(1_000_000 / n)
 t = perftester.time_benchmark(
 extender.extend,
 orig_list,
 n,
 Number=number
 )
 t_2 = perftester.time_benchmark(
 extender.extend_2,
 orig_list,
 n,
 Number=number
 )
 better = 'extend' if t['min'] &lt; t_2['min'] else 'extend_2'
 nn = f"{n: 6}"
 results[nn] = Benchmarks(t['min'], t_2['min'], better)
 perftester.pp(results)

And here’s the output:

{' 2': Benchmarks(extend=1.531e-06, extend_2=1.358e-06, better='extend_2'),
 ' 5': Benchmarks(extend=2.185e-06, extend_2=1.739e-06, better='extend_2'),
 ' 10': Benchmarks(extend=3.524e-06, extend_2=2.308e-06, better='extend_2'),
 ' 100': Benchmarks(extend=2.513e-05, extend_2=1.288e-05, better='extend_2'),
 ' 1000': Benchmarks(extend=0.0002717, extend_2=0.0001432, better='extend_2'),
 ' 10000': Benchmarks(extend=0.002942, extend_2=0.001435, better='extend_2')}

The new version, based on a list comprehension, is definitely faster; and the greater n, the faster the revised function is. For n=10000, extend_2() is about two times faster than the original extend().

We should now benchmark the functions for various lists of different lengths, but my aim here is not to compare these two functions but to show you how to simply benchmark functions using perftester. Thus, I will leave these additional benchmarks for you as exercise.

Advanced usage

perftester functions use default settings, which are usually what we need. Sometimes, we may wish to change Number or Repeat, as we did above, choosing Number based on the number of operations to be done by a function. Sometimes we may want to change Repeat, too. When the benchmark I am conducting is important, I usually increase both Number and Repeat. I do so also when I get small differences; I increase these two arguments to make the benchmarks more stable.

If you want to use the same Number and Repeat for all the benchmarks, you do not have to do that manually every time you run the perftester.time_benchmark() function. You can change it once, in the perftester.config object, which regulates the behavior of the perftester functions.

In order to do so, enough to do the following:

perftester.config.set_defaults("time", Number=1_000_000, Repeat=10)

This changes the default values of Number and Repeat for all the functions to be benchmarked; they will not be used when the user changes the values of these arguments, or one of them, when calling the perftester.time_benchmark() function.

The above command changed the defaults for every function to be benchmarked. You can do so also for a particular function. For instance:

perftester.config.set(foo, "time", Number=1_000_000, Repeat=10)

will change Number and Repeat for function foo() – this function must have been defined before. So, you cannot change settings for a function that has not been defined yet.

You can also change the defaults for only one of the two arguments:

perftester.config.set_defaults("time", Number=1_000_000)
perftester.config.set(foo, "time", Number=1_000)

The other one – that not changed – will simply remain unchanged; that is, equal to the default setting.

As already mentioned, relative benchmarks are performed against the performance of an empty function, stored as perftester.config.benchmark_function. Such an approach makes sense, as this function represents the overhead cost of calling a function. So, the remaining execution time was spent on doing what the benchmarked function was intended to do.

Sometimes, you may want to change this empty function to another one; the relative benchmarks will be done against the performance of this function. It’s simple to do: you can overwrite the perftester.config.benchmark_function()with another one; for example:

def foo():
 return [i for i in range(10)]

perftester.config.benchmark_function = foo

You can learn more about this topic from the below documentation file from perftester‘s repository:

perftester/benchmarking_against_another_function.md at implement-profiling-decorator ·…

Conclusion

Benchmarking Python functions and other callables with perftester is easy. Actually, it’s easier than with timeit. Enough to call the perftester.time_benchmark() function, whose API is as simple and intuitive as it can be. The only thing to remember is to start arguments Number and Repeat with upper-case letters; the same goes for the Func argument, but you will rarely use it as a keyword argument, as it’s the first argument of the perftester.time_benchmark() function, providing the function to be benchmarked. So, from the two calls below, it is the former that will be far more frequent:

# Rather this:
t = perftester.time_benchmark(
 extender.extend,
 [1, 1, 4, 4, 'a', 'a'],
 3
 )
# than this:
t = perftester.time_benchmark(
 Func=extender.extend,
 [1, 1, 4, 4, 'a', 'a'],
 3
 )

This does not mean that perftester is generally simpler than timeit functions, timeit() and repeat(). While perftester is easier to benchmark the execution time of callables, timeit is easier when it comes to benchmarking code snippets, formatted as strings, such as "[i**2 for i in range(1000)]". You can benchmark such snippets using perftester, but you would have to define a function that does what such a snippet does. This would mean that the benchmark would not only measure the execution time, but also the overhead cost of additional time by calling a function. Hence, when you have a code snippet to benchmark, you should choose the timeit module.

While perftester is easier to benchmark the execution time of callables, timeit is easier when it comes to benchmarking code snippets.

In terms of benchmarking callables, however, perftester shines. Its API is dedicated to this very scenario while timeit‘s API is not. You can do it, but you need to define a non-argument lambda. Compare these two benchmarks of our extend() function:

# perftester, using default settings
perftester.time_benchmark(extender.extend, [1, 1, 4, 4, 'a', 'a'], 3)
# timeit, using defaults settings
timeit.repeat(lambda: extender.extend([1, 1, 4, 4, 'a', 'a'], 3))

# perftester, changed settings
perftester.time_benchmark(
 extender.extend, [1, 1, 4, 4, 'a', 'a'], 3,
 Number=1000, Repeat=3,
)
# timeit, changed settings
timeit.repeat(
 lambda: extender.extend([1, 1, 4, 4, 'a', 'a'], 3),
 number=1000, repeat=3
)

The calls to perftester.time_benchmark() are more natural and much easier to understand at first glance. The necessity to use lambda in the timeit function makes this call much less readable.

One more thing. You may choose to import perftester using the full name, as I did in this article. But you can also do it as

import perftester as pt

as is used in the package’s repository. The full-name import is a little clearer but definitely longer, so choose whichever you prefer.

Be aware that when benchmarking, perftester uses the very same backend as timeit does, as the former in fact calls the latter. So the difference lies in the API only. This difference would not be enough to learn a new framework to benchmark a function. But as I wrote above, perftester comes with much more than just benchmarking execution time, and this is why I believe you will not regret the time spent on learning this package – and so, on reading this article. You can benchmark not only time but also memory, but most of all, the package enables you to write performance tests in terms of both time and memory. I decided to cover these different use cases in separate articles, in order to deal with one topic at a time – and to make learning easier.

Thank you for reading. I hope to deliver the next articles on perftester soon, and you will see that perftester offers something you’ve not seen before: a performance testing framework for Python callables.

In the meantime, you can use it for benchmarking time. If you want to read more about the package and its use, you can do so in perftester‘s GitHub repository:

GitHub – nyggus/perftester: A lightweight Python package for performance testing of Python…

Appendix

On Func, Number and Repeat

You may wonder why these arguments (of which Number and Repeat and keyword arguments) start with upper-case letters. It’s a fair question, as this seems unPythonic; I answer this question in this appendix, based on the explanation you can find in the package’s repository.

This approach minimizes the risk that the function to be benchmarked will have arguments of the same name as the perftester function you’re planning to use, in which case you would have to use one argument twice – and this would mean SyntaxError. Out there, in the Python code base, there are quite a few functions that have an argument named func, number or repeat. But far fewer – if any – functions have an argument named Func, Number or Repeat. And this is exactly why perftester arguments start with upper-case letters.

If, nevertheless, a function does have an argument Func, Number or Repeat, there’s a solution. You can define a functools.partial() function and benchmark this very function. You can read more about functools.partial() here:

functools – Higher-order functions and operations on callable objects

Below, you will find an example. Imagine you have a function foo() whose arguments are Number and Repeat. To use perftester.time_benchmark(), you need to do the following:

from functools import partial

def foo(Number, Repeat):
 return [Number] * Repeat

foo_partial = partial(foo, Number=20.5, Repeat=100)
perftester.time_benchmark(foo_partial, Number=1000, Repeat=10)

First of all, however, you should not use use Number and Repeat arguments in foo(). I showed this solution because you may find yourself in a situation in which you want to benchmark a function that does so, written by someone else; better safe than sorry.

functools.partial() occurs to be a very useful solution in quite a few other use cases , not only this one. So, it’s good to know this function anyway.

Resources

Benchmarking Python code with timeit

functools – Higher-order functions and operations on callable objects

GitHub – nyggus/perftester: A lightweight Python package for performance testing of Python…

GitHub – nyggus/rounder: Python package for rounding floats and complex numbers in complex Python…

rounder: Rounding Numbers in Complex Python Objects

A Guide to Python Comprehensions

Written By

Marcin Kozak

See all from Marcin Kozak

Benchmark, Data Science, Performance, Performance Testing, Python Programming

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/benchmarking-python-functions-the-easy-way-perftester-77f75596bc81/