Post

A Survey of Intermediate Python Features

I’ve been writing Python for about 9 years now. In that time, I’ve stumbled on a lot of language features that confused me at first or that I wish I had learned about earlier; now that several friends of mine are learning Python, I figured I’d give a brief overview of some of the major language features that I consider to be the next steps after the basics. (The basics are the type of stuff you’d see in Automate the Boring Stuff: primitives, containers, control flow, functions, classes, I/O, etc.)

My goal here isn’t to provide an exhaustive discussion of any of these features, but rather to just point out that they exist and demonstrate one or two of the most common use cases and clear up common misconceptions. I won’t cover features that are common to most programming languages unless I think there’s something important or substantially different about how they’re used in Python.

if __name__ == “__main__”:

This is a common stumbling block for new Python programmers when looking at someone else’s code. Many open source projects, in their main file, will have

1
2
if __name__ == "__main__":
    ...

__name__ is a global string variable that indicates whether the file is being run as top-level (in which case __name__ is equal to "__main__"), or as an imported module (in which case __name__ is just the module’s name).

Usually, it’s standard to structure a .py file to have only imports, globals, and function definitions at the top level, and have this check at the very end of the file to determine whether a main() function (or anything besides definitions) should be executed. If you’re publishing any code publicly, it’s usually good to include this so it’s easier for other people to reuse your code without triggering a whole bunch of side effects when they import what you’ve written.

Sentinel Values

A sentinel value (also known as a flag value) is a unique value to which we attach some kind of special meaning. For example, in C, strings end with the null character \0. There are a couple reasons to dislike flag arguments(Martin Fowler criticizes them here, there’s also boolean blindness) and a couple workarounds, but if you need a flag or sentinel, you should know how to implement it.

Besides True and False, usually None is the first choice for a sentinel because it suffices most of the time. However, there are two reasons this might be a bad choice: first, you’re assigning extra meaning to None, which might be code smell. Second, you might need to work over arbitrary values including None. The best way to create a sentinel is just:

1
sentinel_value = object()

object() creates a unique object that has its own hash and will never be equal to anything else (unless you monkey patch the __eq__ method, but that’s on you). You can create any number of sentinels this way and they’ll all be treated as different values.

Strings

Formatting Strings

Python’s philosophy is that there should be one clear, unambiguous way to accomplish a goal. This is mostly true except for string formatting, for which there are no less than ten thousand ways to accomplish the same thing. The most modern way to do it is with f-strings:

1
2
i = 10
f"the value of i is {i}"

This outputs the value of i is 10. There are a handful of other techniques, like the str.format() method and regular string concatenation, but f-strings are the most performant and easiest to read.

It’s also worth mentioning that if you want to print the value of a variable quickly, you can print f"{x=}", which will print x= followed by the value of x. For example, if I have a list l = [1, 2, 3], then printing f"{l=}" prints l=[1, 2, 3].

Concatenating Strings

It’s also important to avoid concatenating multiple strings with +. Adding s = s1 + s2 + s3 + ... + sn or using + or += in a loop causes performance issues so it’s better to use ''.join(strings):

1
2
3
4
strings = []
for i in range(10):
    strings.append[str(i)]
string = ''.join(strings)

Booleans

Comparison Chaining

Often, we have one value x and want to ensure that it falls within a specific range: that is, we want a < x (or a <= x) and x < b (or x <= b). Python allows us to chain this comparison like this by writing a < x < b (replacing < with <= as needed). This essentially simplifies to a < x and x < b. It’s also important to note that if we write something like

1
a < f() < b

then f will only be called once, rather than twice.

However, comparison chaining introduces some weird edge cases. For example, the following two statements

  • (False == False) in [False]
  • False == (False in [False])

are both False, as we would expect. However, False == False in [False] is True, because it simplifies to False == False and False in [False]. Both expressions on either side of and are True, so the whole thing is True; when we parenthesize this statement as we did in the two examples above, they cease to be chained expressions and just evaluate to True in False and False == True, which are both False.

Truthy and Falsy

Truthy and falsy sound like words only children would ever use, so naturally computer scientists use them all the time. Truthiness is a property of an object that determines whether it evaluates to True or False when it’s (implicitly) cast to a boolean. For example, the empty string "", None, 0, and empty collections like lists [] and dictionaries {} will evaluate to False when cast to booleans. Falsy values tend to be identity elements: a + 0 = a for any number a, s + "" = s for any string s, and so on. (None is the odd one out here because it doesn’t have any operations defined on it, but it gives off a “falsy vibe”, so that’s what it is.) Non-empty strings and collections are truthy, as are nonzero numbers and so on.

Newcomers to Python often write things like:

1
2
if len(mylist) == 0:
    ...

However, if and while statements implicitly cast to booleans, so we can simplify this to

1
2
if mylist:
    ...

No need to even call bool()!

There are three important notes about truthiness:

  1. A truthy value is not equal to True and a falsy value is not equal to False: 1 == True is False, as is "" == False. (Python might play fast and loose with its types, but at least it’s not JavaScript.) However, you can cast to a boolean explicitly: bool(1) == True and bool("") == False are both True.
  2. Many functions like any and all just check for truthiness instead of requiring values to be exactly True or False.
  3. New Python classes are truthy by default, because every object in Python inherits from object whose __bool__ method just returns True. By overriding the __bool__ method, you can set the truth value for a class you’ve written. For example, if I have a collection that implements len(), a reasonable __bool__ override might be:
1
2
def __bool__(self):
    return bool(len(self))

Consider Using in Instead of or

Suppose I have code like this:

1
2
if x == 1 or x == 3 or x == 4:
    ...

It’s usually neater to just write

1
2
if x in (1, 3, 4):
    ...

Indexing

Python allows you to get slices when indexing. Suppose we have a list l = list(range(10)). Then

  • l[start:] gets everything whose index i satisfies start <= i, so l[5:] evaluates to [5, 6, 7, 8, 9]. Note that the item at index start is included.
  • l[:stop] gets everything whose index i satisfies i < stop, so l[:5] evaluates to [0, 1, 2, 3, 4]. Note that the item at index stop is excluded.
  • l[start:stop] gets everything whose index i satisfies start <= i < stop, so l[2:6] evaluates to [2, 3, 4, 5].
  • l[start:stop:step] will get everything whose index is one of start, start + i, start + i * 2 until it exceeds stop. Importantly, you can also omit some of the entries here: l[::2] gets every 2nd element (so those with even indices), l[::3] gets every 3rd element, l[1::2] gets every 2nd element starting from l[0] (so those with even indices), etc.
  • Negative list indices, like l[-1], index from the right, so l[-1] gets the last element, l[-2] gets the item second from the last, and so on. You can combine this with slicing to get l[-5:-2], which gets the fifth-, fourth-, and third- to last elements: [5, 6, 7]
  • Finally, you can also assign the final index. For example, the documentation points out that l[len(l):] = [x] is equivalent to l.append(x).

(If you think this is complicated, wait until you try numpy!)

Looping, Iterables, and Iterators

Any item that we can iterate over with a for loop in Python is called an iterable. When I write a for loop like

1
2
for i in range(10):
    ...

Python does a handful of things. It evaluates range(10) to get a range object back. Python then calls the iter method of that range object, which produces something called an iterator. Iterators are basically a stream of data that let us loop over an iterable object once. The for loop then assigns i = iterator.__next__() and executes the code in the body of the for loop, then repeats this process over and over until iterator.__next__() raises a StopIteration exception to indicate that it’s done producing values, at which point the loop ends.

It’s worth mentioning that this is why

1
2
for element in mylist:
    element = ...

fails to mutate the values in mylist but

1
2
for i, element in enumerate(mylist):
    mylist[i] = ...

succeeds.

We usually only deal with iterators when we’re designing them ourselves, but we deal with iterables all the time. Functions like range, zip, map, and filter return iterables, and collections like lists are also iterables.

Mutating a Collection with a Loop

This shows why the following code attempting to reassign the values of a list with a for loop fails to change the list:

1
2
for item in mylist:
    item = ...

Here, the loop variable item is a copy of the element in the list. You can still access item’s methods and fields and mutate item internally, but changing what’s in the list requires operating on the list itself. Loop using each item’s index and indexing the list accomplishes this:

1
2
for i in range(len(mylist)):
    mylist[i] = ...

Generators

Lists are more or less the standard way to store sequential data in Python; however, there are several problems with lists:

  • They require us to know every that has to be in the list
  • They require us to store all the list data in memory
  • They always hold a specific number of elements at any given time

Generators are an alternative to lists that don’t have any of these pitfalls. Generators are also iterables, but rather than storing a pre-computed collection of elements in memory like a list does, a generator generates each element “on demand” in sequence; this is called lazy evaluation.

Generators are particularly for computing things which are only needed on an “on-demand” basis and usually only used in sequence, like the elements of range objects.

We can write our own generator like this:

1
2
3
4
5
6
7
def generator_function():
    i = 0
    while True:
        yield i
        i += 2

g = even_generator()

Here, generator_function is a function that produces a generator object, and g is an instance of a generator object produces by generator_function. From here, if we repeatedly call g.__next__(), it will produce 0, 2, 4, and so on ad infinitum. You can think of a generator function like a function where the return statement is replaced by yield. Every time a generator is called, it runs until it hits the yield statement, returns that value, and then pauses until it’s called again, at which point it runs from where it left off. In this case, that means the second time we called g.__next__(), the line i += 2 was the first one evaluated, and the next lines were all in the while loop. Generators can also pull from other collections, read from a file, or do pretty much anything a regular function can do; they’re very flexible.

Generators can’t be indexed because they would require the generator to call .__next__ several times, which produces side effects. It also means that if I could index elements that have already been previously computed, I would have needed a mechanism for storing those, which might defeat the memory-saving purpose of having a generator in the first place. Of course, you can just repeatedly pull elements out of a generator calling the generator’s .__next__ method and just saving the results in a list. You can simply call the list function on a generator, but this will cause your program to hang if the generator never stops generating!

If you want the generator to stop generating, just raise the StopIteration exception yourself:

1
2
3
4
5
6
7
def generator_function():
    i = 0
    while True:
        yield i
        i += 2
        if i > 10:
            raise StopIteration

Yes, this uses the syntax for raising an exception. No, this doesn’t mean that you need to wrap all your generators in try/except: Python already internally checks for this StopIteration exceptions while looping, but it doesn’t check for it if you’re just calling the .__next__ method raw; you’ll have to catch that exception yourself.

Because of the prevalence of iterables, most of Python’s standard library functions assume their inputs are only iterables (rather than something with more functionality) as inputs and produce iterables as outputs. This is both more memory efficient and more flexible than just using lists for everything, which was common in Python 2. There are also other kinds of iterables like sequences and so on. (range objects are actually sequences, and that’s why you can index them.)

Enumerating

Frequently in Python, we want to loop over a collection of n-tuples. The most common situation is the fantastic standard library enumerate function, which allows us to keep track of the index and associated element of the collection we’re looping over. If I have a list like mylist = ['a', 'b', 'c'], then the code

1
2
for i in enumerate(mylist):
    ...

The I will range over the tuples (0, "a"), (1, "b"), and (2, "c"). Similarly, I might have a dictionary

1
2
3
4
5
mydict = {
    "x": 3,
    "y": 2,
    "z": 1,
}

Looping over mydict.values() will produce a similar result: ("x", 3), ("y", 2), and ("z", 1). This is unfortunately kind of a pain if I want to refer to i[0] and i[1] individually rather than the tuple itself. I could write

1
2
3
for i in mytuples:
    first, second = i
    ...

but Python allows us to unpack in the loop itself:

1
2
3
4
5
for i, element in enumerate(mylist):
    ...

for key, value in mydict.values():
    ...

This can be extended to three, four, or more variables, depending on how many your iterator produces.

Comprehensions

If you have experience with functional programming, you may be familiar with functions like map and filter. Map essentially takes a function and a list and applies that function to every element in the list. Filter takes a predicate and a list and removes all items from the list that don’t satisfy the predicate. These functions are very useful, but in Python are usually overshadowed by much more “Pythonic” expressions: list, set, and dictionary comprehensions and generator expressions.

map as a List Comprehension

To see where list comprehensions come in handy, let’s look at the naive way to create a list that just includes the first five even numbers:

1
2
3
evens = []
for i in range(5):
    evens.append(i * 2)

This produces the list [0, 2, 4, 6, 8]. However, there’s a more elegant way to do this:

1
evens = [i * 2 for i in range(5)]

This is much more compact and easy to read (once you get used to it). It’s meant to mimic mathematical setbuilding notation, and is also an expression rather than a statement. We can also incorporate ternary expressions:

1
odd_or_even = ["even" if i % 2 == 0 else "odd" for i in range(5)]

This is essentially the preferred way to do map; the general form of a basic list comprehension is

1
mylist = [expression(x) for x in iterable]

You can also construct the iterable using a nested list comprehension, but that can get ugly quickly.

Because str.join() takes an iterable as its argument, we can clean up the string concatenation code we had earlier from this:

1
2
3
4
strings = []
for i in range(10):
    strings.append[str(i)]
string = ''.join(strings)

to this:

1
string = ''.join(str(i) for i in range(10))

filter as a List Comprehension

Python has a similar alternative to filter, too:

1
mylist = [x for x in range(10) if x % 2 == 0]

This also generates the list [0, 2, 4, 6, 8]. Notice that when if x % 2 == 0 is false, nothing gets added to the list: there’s no need to specify some value for when the condition is False. These two kinds of list comprehensions can be combined to accomplish a map and a filter in a single expression!

Dictionary and Set Comprehensions

Two other kinds of comprehensions I’d like to talk about are set comprehensions and dictionary comprehensions. As the names imply, these are the equivalent of list comprehensions for sets.

For sets, list comprehensions work exactly the same, with the added bonus that sets don’t allow duplicates:

1
myset = {expression(i) for i in iterable}

This can be an easy way to screen out duplicates.

For dictionaries, the syntax is slightly different:

1
mydict = {key_expression(i) : value_expression(i) for i in iterable}

Don’t forget the : or you might accidentally create a set comprehension instead of a dict comprehension!

Generator Expressions

Generator expressions are the way I most often create generators. They’re essentially the same as a list or set comprehension, except they use ( ) instead of [ ] or { }. Importantly, unlike lists, sets, or dictionaries, this generator expression is still evaluated lazily. This also directly produces a generator, rather than a generator function.

You might notice that there doesn’t appear to be a way to do tuple comprehensions, and that’s correct. Instead, you can just cast:

1
mytuple = tuple(expression(x) for x in iterable)

Control Flow

with … as …

Often, we want to handle an object that needs some kind of “setup” and some kind of “teardown”. Common examples are files or other kinds of streams, where we want to open the file, read or write to it, and we have to close it at the end or else we’re squatting on important resources like memory and file locks, any changes we write might not be properly saved, etc. Usually, this means we need to remember to call close(), usually in the finally part of a try/finally statement. Doing this a lot results in a lot of boilerplate.

The with statement is the best practice for handling objects like this, and essentially replaces try/finally. Rather than writing

1
2
3
4
5
try:
    f = open("path/to/file")
    ...
finally:
    f.close()

We can instead write

1
2
with open("path/to/file") as f:
    ...

You can also manage multiple objects at the same time:

1
2
with expression1 as thing1, expression2 as thing2:
    ...

We can do this with any context manager objects, meaning objects that have an __enter__ and an __exit__ method defined.

match

Python provides the match statement for structural pattern matching, another common and powerful feature from functional programming. Despite the similarities to switch statements in languages with C-style syntax, Python’s match is not a switch statement, and treating them the same may introduce unwanted side effects and mutation.

The use cases for match are for control flavor based on the structure and format of data, rather than just by value. This is one of the topics that would be hard to give a full tutorial of here, but I can direct you to a solid tutorial. I can also provide an example: if I were designing a simple command line shell, it might look something like

1
2
3
4
5
6
7
8
9
10
11
12
13
command = input()

match command.split():
    case ["quit"] | ["exit"]:
        ...
    case ["update"]:
        ...
    case ["run", program_name]:
        ...
    case ["upgrade", *package_names]:
        ...
    case _:
        ...

Let’s break down a couple of these cases:

  • The ["quit"] | ["exit"] case will match both the values ["quit"] and ["exit"], with the pipe | functioning as an or.
  • The ["update"] case will match exactly the value ["update"].
  • The ["run", program_name] case will match a two-element list with "run" as the first element and anything as the second element; that second value will be treated as a variable named program_name in the body of that case.
  • The ["upgrade", *package_names]: case will match a list with "upgrade" as the first element and any number of additional elements, which will be bundled together in a list called package_names.
  • The _ case matches anything; case _: is to match what default: is to a switch. The _ is a wildcard, meaning that it matches anything.

You can also match positional attributes and stuff like abstract syntax trees (I assume this is how you would make tooling like parsers and linters with Python).

Ternary

There are many cases where a variable’s value is determined by a conditional, like

1
2
3
4
if cond:
    x = value1
else:
    x = value2

This chunk of code works, but it’s a couple lines for something relatively simple and programmers from other languages might think it’s a bit weird that x is defined in a smaller scope than where it will eventually be used. Fortunately, Python allows us to rewrite this into a single line, like so:

1
x = value1 if cond else value2

This code may appear to do more or less the same thing, and in addition to being more aesthetically pleasing is different in one crucial aspect: the first chunk of code is made up of statements, and the second chunk of code is one expression. There are many places that an expression can go that a statement can’t, like in a list comprehension, so ternary is not only more concise, but also more flexible.

for ... else

Sometimes, we need to break out of a loop. However, we might also want to execute a piece of code if and only if we break out of a loop early. This functionality is provided by for ... else:

1
2
3
4
5
6
7
8
9
10
print("Please input three odd integers.")
numbers = []
for i in range(3):
    numbers.append(int(input()))

for n in numbers:
    if n % 2 == 0:
        break
else:
    print("You inputted an even integer!")

In most cases this control flow can be accomplished differently, and there aren’t many cases where you absolutely must execute a piece of code after the break, but it’s worth recognizing in case you see it in the wild and might be simpler than other control flow.

The Walrus Operator

This := operator a controversial inclusion from 3.8 affectionately called the Walrus operator. The difference between x = y and x := y is that the former is a statement and the latter is an expression that evaluates to y.

There are several places you might want to use the walrus. Most generally, you want the walrus when you need to use a piece of data in a conditional statement like if or while and you need to save that data for use in the body of that conditional. For example, this is good in places where you have a function that behaves somewhat like __next__:

1
2
3
4
5
while (data := file.read(64)) != '':
    # do something with data

while (item := myqueue.pop()) > 0:
    # do something with item

Or when you have a function that you don’t need or want to compute twice, and don’t want to keep the result around for longer than you need:

1
2
if (result := expensive_function(x)) > 0.5:
    # do something with result

A word of warning: don’t get clever with the walrus. For example, you might be tempted to write

1
2
while len(mystring := mystring[:-1]) > 10:
    pass

This is often abuse of the walrus. I don’t think this makes code that’s more elegant or easier to read; it often leads to large conditionals and empty bodies and it can be annoying to add things to the loop body later on.

Functions

Positional-Only Parameters

In Python, you can treat most regular parameters as keyword parameters: if I have a function like

1
2
def foo(a, b):
    print(a, b)

Then if I write foo(b = 2, a = 1) then I’ll see 1 2 printed on the console. This means that by default, arguments can be pass by position or by keyword.

There are a couple reasons that one might prefer to have positional-only parameters, so Python allows us a way to require arguments to be passed by position and disallow them being passed as keywords. When defining a function, we can add a comma-separated / character to denote that all arguments to the left of the / are positional-only, and a * to denote that all arguments to the right of the * are keyword-only. Arguments between the / and * can be either. For example, in the following function

1
2
def function(arg1, arg2, /, arg3, arg4, *, arg5, arg6):
    ...

arg1 and arg2 must be passed by position, arg3 and arg4 can be passed by position or by keyword, and arg5 and arg6 must be passed by keyword. All of these arguments are required.

*args, **kwargs, and the Unpacking Operators

These are two very common function arguments for allowing in an unknown number of additional arguments. *args is used to take in additional optional arguments as a list. That list will be called args. **kwargs is used to take in additional optional keyword arguments as a dictionary. That dictionary will be called kwargs. You can call these collections anything you want, and you might only want to use one at once, but the names args and kwargs are used by convention.

It’s also worth explaining what the * and ** do. These are both operators; * unpacks a list and ** unpacks a dict. So if I have a function like

1
2
3
4
5
6
def f(s1, s2):
    return s1 + s2

values = ["a", "b"]

f(*values)

would work as if I had called f("a", "b"). ** does the same for a dictionary with keyword arguments:

1
2
3
4
5
6
def f(s1, s2 = ""):
    return s1 + s2

values = {"s1": "a", "s2": "b"}

f(*values)

Note that this works with required arguments (s1) and optional keyword arguments (s2).

First-Class Functions

Python has first-class functions, which is a fancy way of saying that functions are also objects. This allows Python to support a wide range of useful features from functional programming: you can pass functions in as arguments to other functions, store functions in data structures like lists or dictionaries, you can access a function’s properties with ., and so on. For example, in game design, it’s common to map key presses to actions to allow players to remap their control scheme. This is typically done using the command pattern which might look (very roughly) like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def pause():
    ...

def jump():
    ...

def interact():
    ...

# maps a player's key press to an action
actions = {
    ESC : pause,
    SPACE : jump,
    E : interact,
}

Note that we don’t use the parentheses (so we write pause instead of pause()) to indicate that we are referring to the function itself, rather than the result of calling that function. If player input gets stored in a variable like key_press, you could call any of these functions with something like actions[key_press](). Note again that we place the parentheses after actions[key_press], because actions[key_press] will evaluate to a function.

However, first class function support means that Python doesn’t support function overloading because Python would have two allow for two objects with the same name. Fortunately, Python’s dynamic typing eliminates the need for function overloading in many cases (for example, I often don’t need to write one function to take an int and another to take a float like I’d have to in C++ or C#), but there are corner cases where we do want function overloading. There are a handful of ways to achieve the same thing, like being clever with keyword arguments or just doing some extra checks in the function body, but you’ll have to judge for yourself what’s appropriate for each case.

Lambdas

A lambda is an “anonymous function”, meaning a function that doesn’t have to be defined with a name or even given a name later.

We can define a lambda like so:

1
2
3
4
5
6
7
lambda: "hello!"

lambda x: ' '.join(list(x))

lambda x, y: abs(len(x) - len(y))

lambda x, y = 0: x + y

So the syntax for a lambda is generally:

1
lambda arguments: expression(arguments)

Lambdas are useful because you can define them wherever you can put an expression, so you can define them as an argument to a function or an elements of a collection. They’re also useful when you want a one-time function but don’t want it hanging around polluting your namespace.

Of course, we could still bind a lambda to a name if we wanted to. The following ways for defining f are essentially equivalent:

1
2
3
4
f = lambda x: x + 3

def f(x):
    return x + 3

Unfortunately, Python’s lambdas are somewhat weak for two main reasons:

  • Lambdas have to be only one expression, which considerably limits how much you can put into a lambda in practice
  • Lambdas can’t be asynchronous, so you need to define an async function instead
  • You can’t add type hints to lambdas, because lambdas and type hints have conflicting syntax (they need : to mean two different things)

Closures

Closures are an important feature of functional programming and other languages that have first-class functions. Essentially, it’s a way to have a function that also comes with an environment, usually as a way for a function to have static variables or as a way to dynamically create functions with different behavior, like in this example:

1
2
3
4
def add_function_factory(y):
    def f(x):
        return x + y
    return f

Here, the add_function_factory “encloses” the function f. Importantly, calling add_function_factory will create a new function f each time, so we can generate functions with different behavior:

1
2
add_three = add_function_factory(3)
append_exclamation = add_function_factory("!")

However, if you use closures to mutate variables, you might run into an issue. This works:

1
2
3
4
5
6
def closure():
    x = "hello"
    def inner():
        print(x)
    return inner
closure()()

But this doesn’t:

1
2
3
4
5
6
7
def closure():
    x = "hello"
    def inner():
        print(x)
        x += 1
    return inner
closure()()

The second chunk of code raises an UnboundLocalError before print is even called! Closure essentially “bake” variables into their inner functions when they define them, so x is treated like a constant rather than a variable. To fix this, you need to declare x as a nonlocal variable:

1
2
3
4
5
6
7
8
9
def closure():
    x = "hello"
    def inner():
        nonlocal x
        print(x)
        x += 1
    return inner

f = closure()

Now, calling f() repeatedly produces 0, 1, 2, and so on rather than an error.

Mutable Default Arguments

This is a Python quirk everyone encounters sooner or later. Suppose we had a function with an argument that had a mutable default value, like this one:

1
2
3
def func(x = []):
    x.append(0)
    print(x)

Most Python newcomers expect that if we run this function three times, it’ll simply print out [0] each time. However, we actually get [0], [0, 0], [0,0,0] instead. This is because Python retains the list between calls; x is only initialized to [] when the def statement is evaluated, and every subsequent call to func that uses the default value of x will use that same list and retain any mutations done to it between calls. If you’re familiar with pointers, you’re familiar with this kind of behavior.

This behavior can be advantageous, but most people encounter it for the first time when they just wanted to initialize an argument to a fresh default value. This is a circumstance where a sentinel value is probably the right choice:

1
2
3
4
5
sentinel = object()
def func(x = sentinel):
    if x = sentinel:
        x = []
    ...

This will set x to [] if and only if no value was specified for x, which is exactly what we wanted. (The only problem is that we have a sentinel hanging around in our namespace, but this basically never matters and is easily fixed with modules.)

Decorators

Decorators are a common way to modify the behavior of a function by wrapping it in another function. There are often cases where we want to wrap many different functions with the same functionality: logging, caching, mocking, and so on. We can create a decorator like so:

1
2
3
4
5
6
7
8
9
10
11
12
def decorator(func):
    ...
    def wrapper(*args, **kwargs):
        # behavior before decorated function runs
        result = func(*args, **kwargs) # function being decorated
        # behavior after decorated function runs
        return result
    return function

@decorator
def myfunction(arg1, arg2):
    ...

This is the basic way to create a decorator that takes no arguments. The decorator takes in the function func as an argument and decorates it with the wrapper, which can do something before func runs, runs func and saves its result (don’t forget that part!), does something after func runs, and finally returns the value of func (or whatever you want the decorated func to return). When def myfunction runs, myfunction will be automatically wrapped; from then on, any call to myfunction will execute the wrapper wrapper, with myfunction taking the place of func. Any arguments passed to myfunction will be passed into wrapper; because we don’t know in advance what kinds of arguments decorated functions will take, it’s a good idea to always use *args, **kwargs for the wrapper and pass that in to the decorated func.

What if we need our decorator to take arguments? In that case, we have to create another level of closures and write a decorator factory:

1
2
3
4
5
6
7
8
9
10
11
def decorator_factory(arguments):
    ... # do something with arguments
    def decorator(func):
        ... # do something with arguments, and the function
        def wrapper(*args, **kwargs):
            # behavior before decorated function runs
            result = func(*args, **kwargs) # function being decorated
            # behavior after decorated function runs
            return result
        return function
    return decorator

This does essentially the same thing a decorator does, except it allows us additional space to specify how decorator will construct the wrapper. For example, if I’m writing a custom logging decorator that needs to take in a logfile argument:

1
2
3
4
5
6
7
8
9
10
11
12
13
import os

def log(logfile):
    os.mkdir(logfile, exist_ok=True)
    def decorator(func):
        def wrapper(*args, **kwargs):
            with open(logfile):
                # write args and kwargs to logfile
                result = func(*args, **kwargs)
                # write result and any other info to logfile
            return result
        return function
    return decorator

We then decorate a function like so:

1
2
3
@log("path/to/log/file")
def myfunction(arg1, arg2, arg3):
    ...

It’s also important to see that you can apply multiple decorators:

1
2
3
4
@decorator2
@decorator1
def func():
    ...

decorator1 will be applied first, and then decorator2 will wrap the wrapper that decorator1 applies to func.

It’s important to note that this is distinct from the standard Gang of Four decorator pattern in object oriented programming. Python’s decorators are more akin to static attribute-oriented programming, whereas the standard decorator pattern is applied dynamically at runtime. (Also, if the @ seems reminiscent of Javadoc, it’s supposed to be!)

Memoization and Caching

There are often situations where you have a pure function that is expensive to call. This is often the case in dynamic programming problems, where the naive solution will have exponential time complexity, but by recognizing that repeated recursive calls are made with the same arguments we can optimize the function to run in polynomial time by caching results.

The standard solution is just caching the function’s results, like so:

1
2
3
4
5
6
7
8
9
10
cache = dict()

def function(arguments):
    if (arguments) not in cache.keys():

        ... # compute solution

        cache[(arguments)] = solution

    return cache[(arguments)]

It’s a bit tedious to put this code everywhere, and we often need to hide the cache in an object field or a closure or something. Thankfully, Python has a simple, thread-safe @cache decorator that does this for us:

1
2
3
4
5
6
from functools import cache

@cache
def function(arguments):
    ... # compute solution
    return solution

Type Hints

Python has a very strange relationship to typing. Its dynamic typing is one of its greatest strengths or greatest weaknesses depending on who you ask, and is a strange transition for people coming from languages like Rust, Java, C#, or C++.

For clarity and third party tooling, you can annotate functions and variables with the typing library. For example, if I have functions like

1
2
3
4
5
def list_and_reverse(s):
    return list(reversed(s))

def line(a, b = 0):
    return lambda x: a * x + b

I could annotate them like so:

1
2
3
4
5
def list_and_reverse(s: str) -> list[str]:
    return list(reversed(s))

def line(a: float, b: float = 0) -> Callable[[float], float]:
    return lambda x: a * x + b

This indicates that list_and_reverse is a function taking a string and returning a list of strings, and that line is a function taking a required float, an optional float, and returns a function (more broadly, a Callable) that takes a float and returns a float. Functions that have no return type can be annotated with None.

You can also do this for variables and class fields. For example, I may instantiate a variable that I intend to represent a dictionary from strings to ints or None, for example, and might want to document that at declaration, even if I don’t have keys or values for the dictionary yet. I could do this by writing

1
myvalues: dict[str, Optional[int]] = dict()

There are a handful of important types worth noting:

  • Any represents data that can be anything, and is the most useful type. It’s also the type you use when you’re feeling lazy and want your type checker to stop yelling at you.
  • Union represents data that can be one of a handful of other types. Union[str, int] represents data that can be either a string or integer.
  • Optional is a special case of Union that is a union of exactly one type with None. Optional[int] equals Union[int, None].

It’s worth mentioning that just importing a class is enough to use it as a type annotation, like from collections import Counter lets you use Counter as an annotation. You can also create type aliases, like vector = list[float]. You could then annotate a function like so:

1
2
def dot(v1: vector, v2: vector) -> float:
    return sum(a * b for a, b in zip(v1, v2, strict=True))

It’s important to mention that type annotations, except in a handful of cases like dataclasses, don’t change what your program does: it’s possible for you to completely ignore the type annotations, and Python won’t care. Without any kind of third party tools, it’s basically just a fancy docstring. However, third party tools like mypy will care, and will require your program to agree with the static analysis it performs. This can be really good for hunting bugs and ensuring style compliance.

Classes

vars

Python essentially treats objects like dictionaries, where the field names are the keys, methods are just first class functions as values, and classes can be thought of roughly as dictionary factories. For an object x, vars(x) returns the dict attribute of x (which can also be accessed with x.__dict__), which is essentially a way to directly interact with x as a dictionary. (There are a couple objects that don’t have __dict__ attributes, usually __slots__ objects.) You can also just call vars() without an argument, which will just return the local variables equivalent to calling locals(). (There’s a related function dir that does something similar, but its behavior is a bit complicated and it’s best to just read the documentation.) All of this is touching on Python’s object model, which is a topic also best read through the documentation.

Monkey Patching

There are many cases where we might want to dynamically alter the behavior of a class at runtime: maybe we want to fit one class to be used with a different interface, maybe we need a function to behave differently, or maybe we want a class to keep track of additional data. For whatever reason, the adapter pattern, inheritance, or composition might be undesirable or infeasible. Because Python’s classes are mutable, we can dynamically add or replace fields and methods like so:

1
2
3
4
5
6
7
8
9
10
11
12
13
class MyClass:

    def say_something(self):
        print("Hello")

instance = MyClass()

def say_goodbye(self):
    self.things_said += 1
    print("Goodbye")

instance.things_said = 0
instance.say_something = say_goodbye

Monkey patching is one of the forbidden techniques of Python programming, and leads to situations where two instances of the same class do different things or have different fields while still having the same type. For this reason, it’s important to use it in limited situations and take time to contrast it with the aforementioned alternatives.

hasattr, getattr, setattr

If you’ve been monkey patching objects, you might wonder whether a given object x has a particular field foo defined. One option might be to just check if x.foo evaluates to something in a try/except block, but there’s an easier way: the hasattr function takes an object x and a string and determines if x has a field with a name matching that string. hasattr(x, "foo") will be True if x.foo exists and False otherwise.

Note that the field name, "foo", is a string. There are other cases where we have might have the name of a field as a string, and want to actually access that field to get or set its value. In this case, we can use getattr to get the field of x with the associated name: getattr(x, "foo") is equivalent to x.foo. Similarly, we can use setattr to set the value of a field: setattr(x, "foo", bar) is equivalent to x.foo = bar.

Static Methods with @staticmethod

The @staticmethod decorator changes a method to no longer require a self argument, so it can be called without an instance of the class.

1
2
3
4
5
6
7
8
9
10
11
def say_goodbye():
    print("Goodbye")

class MyClass:

    @staticmethod
    def say_hello():
        print("Hello!")

    say_goodbye = staticmethod(say_goodbye)

Here, we’re using staticmethod both as a decorator to declare say_hello as static, and as a regular function to turn the existing say_goodbye function into a static method. We can then call MyClass.say_hello() and MyClass.say_goodbye().

Multiple Constructors and @classmethod

At a glance, it appears that Python doesn’t support multiple constructors because every class can have only one __init__ method. Without any other options, the only workarounds would be a lot of keyword arguments (which gets messy fast if we require specific combinations of arguments to be passed) or the builder pattern. (The builder pattern is actually my favorite design pattern, but it’s a lot of code and Python doesn’t let us make the constructor private to enforce using the builder.)

Fortunately, we have a solution with the @classmethod decorator, which essentially creates a static method that is aware of the class that it’s attached to. Among other things, this is useful for creating multiple constructors for the same class. For example, suppose I’m creating a goblin NPC for a video game:

1
2
3
4
5
6
7
8
9
10
11
12
13
import random

class Goblin:

    def __init__(self, level, health, attack):
        self.level = level
        self.health = health
        self.attack = attack

    @classmethod
    def random_constructor(cls):
        level = random.randint(1, 100)
        return cls(level, level * 5, level)

Here, we’re using @classmethod to create a second constructor, random_constructor, for the the Goblin class. Here, cls will be the class that the method is being defined on, so cls(level, level * 5, level) is equivalent to Goblin(level, level * 5, level). Unfortunately, we still have to call the same base constructor __init__, so

An immediate question this raises is: why use this instead of a static method? We just as easily could’ve written

1
2
3
4
@staticmethod
    def random_constructor():
        level = random.randint(1, 100)
        return Goblin(level, level * 5, level)

The reason is that this will still create a Goblin, even if we call it from a subclass: it doesn’t respect inheritance. We’re also out of luck if we want to define a class method on an abstract base class. If I were to create a subclass like this:

1
2
3
4
class GoblinWarrior(Goblin):
    ...

goblin_warrior = GoblinWarrior.random_constructor()

Them we just get back a Goblin, rather than a GoblinWarrior. Using @classmethod avoids this, because the cls parameter is determined based on which class is the method is being called through: when we call Goblin.random_constructor() then cls is Goblin, and we call GoblinWarrior.random_constructor() then cls is GoblinWarrior. By calling cls like a function, we can choose our constructor dynamically. We can also call static methods (including other class methods) dynamically using cls.method!

Dataclasses

Dataclasses are a very neat feature that allows for automatically generating common methods for classes whose primary purpose is bundling relevant fields or data together. For example, we may have a class like

1
2
3
4
5
6
class GameItem:
    name = "..."
    description = "..."
    gold_value = ...
    weight = ...
    ...

Where name and description are strings, gold_value is an int, and weight is a float, there may be other fields, and so on. The point is that this class represents a handful of data bundled together, and there’s a whole bunch of basic functionality we’d like to implement: an __init__ method that just takes all of these fields as parameters and assigns them accordingly, an __eq__ method so we can determine when two classes are equal with ==, a __repr__ method to create a string representation of the object, and so on.

These tend to be boilerplate functions which are somewhat tedious to write. To save time, Python has a @dataclass decorator:

1
2
3
4
5
6
7
8
from dataclasses import dataclass

@dataclass
class GameItem:
    name: str
    description: str
    gold_value: int
    weight: float

This several lines of code are equal to the entirety of the first definition, with the function definitions in all. Roughly 20 lines of code (or more) have been compressed into only 8. Dataclasses come with a whole handful of other parameters to handle the boilerplate like ordering and freezing, or you can simply override and add additional functionality to the dataclass yourself.

Abstract Base Classes

If you’ve taken a class in object-oriented programming, you’ll know there are often times where we want to abstract one more more classes’ behavior into a single abstract class to prevent code duplication and allow polymorphism. The problem is that we don’t often don’t want an abstract class to be instantiable. Although Python doesn’t allow other typical guards like public and private, it does allow abstract base classes. Simply have your class inherit from abc.ABC like so:

1
2
3
4
from abc import ABC

class MyAbstractClass(ABC):
    ...

MyAbstractClass won’t be instantiable.

Enum Classes

Making enums in Python is a bit tricky, but possible. Making an enum is as simple as

1
2
3
4
5
6
7
from enum import Enum

class Direction(Enum):
    North = 0
    East = 1
    South = 2
    West = 3

Direction will then behave a bit like a collection: if I set x = Direction.North, the expression x in Direction will be True. You can also check Direction.North.name to get "North", and Direction.North.value to get 0.

Mocking with @patch and MagicMock

In software development, we often want to test code in isolation without interacting with other components that that code might be coupled with. For example, the function that I’m testing might call another function that triggers a system call, API call, side effect, or expensive computation that we don’t want to occur during our tests. We also might want to verify that our function behaves correctly when another function returns a weird value, raises an exception, or otherwise misbehaves.

The typical solution is mocking the functions or objects that our code depends on. Fortunately, Python has an excellent builtin library for mocking. Suppose we want to test a function like this:

1
2
3
def foo():
    result = expensive_function(val1, val2, arg = val3)
    ...

We can mock this call like so:

1
2
3
4
5
6
7
8
9
10
import unittest
from unittest.mock import patch

class Test_foo(unittest.TestCase)

    @patch('expensive_function')
    def test_makes_correct_calls(expensive_function_mock):
        expensive_function_mock.return_value = ... # whatever we need the expensive_function to return
        foo()
        expensive_function_mock.assert_called_with(val1, val2, arg = val3)

This is a standard testing setup until we see the @patch decorator. This decorator takes a string referring to any function or object that will appear in test_makes_correct_calls’s scope and replaces it with a mock, which it will pass in to test_makes_correct_calls and we’ve chosen to call expensive_function_mock.

This expensive_function_mock is a MagicMock object, and they are absolutely magic. Because expensive_function is callable (it’s a function), we set a return_value to be whatever we choose. We then run foo and assert that by the end of foo’s run, that our expensive_function_mock was called with the arguments we expected, which we type in exactly as we expected them to be passed to the expensive_function: (val1, val2, arg = val3). If that assertion holds, our test passes; if it doesn’t, our test will fail, exactly as how it would fail if a regular unittest.TestCase assertion failed. Magic!

Let’s look at a more complicated function to test:

1
2
3
4
5
6
7
def save_content(url, username, password):
    api_login(username, password) # API call
    content = download(url) # downloading stuff
    if validate_result(content): # could be an expensive function call!
        write_file(content) # side effect
    else:
        handle_problem()

We might not want to make API calls that potentially eat into a rate limit or spam a service, eat up bandwidth downloading stuff, call an expensive function that makes our test run longer and take up more resources, or write new files that we then have to clean up. Even if we do want to do one or more of these things, we probably don’t want to do them all at once. We also might want to force validate_result to fail so we can see what happens in the else case, but it might be prohibitively hard to get that result, and we might encounter issues if we’re calling handle_problem() in such a contrived scenario. So, let’s mock these calls in our test:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import unittest
from unittest.mock import patch

class Test_save_content(unittest.TestCase)

    @patch('handle_problem')
    @patch('write_file')
    @patch('validate_result')
    @patch('download')
    @patch('api_login')
    def test_file_writes_on_success(
        self,
        api_login_mock,
        download_mock,
        validate_result_mock,
        write_file_mock,
        handle_problem_mock
        ):

        mock_content = object()
        download_mock.return_value = mock_content
        validate_result_mock.return_value = True

        save_content("mock url", "mock username", "mock password")

        api_login_mock.assert_called_with("mock username", "mock password")
        download_mock.assert_called_with("mock url")
        validate_result_mock.assert_called_with(mock_content)
        write_file_mock.assert_called_with(mock_content)
        handle_problem_mock.assert_not_called()

Note that the patching and function arguments are in reverse order: remember that the bottom decorator gets evaluated first, and the top decorator gets evaluated last. In this test, we set the download_mock.return_value to be a sentinel that we check was passed into various functions later, and we’re also asserting that handle_problem_mock was not called at any point during the test.

Finally, if we’re testing something that requires dependency injection like this:

1
2
3
4
5
6
def foo(dependency):
    dependency.method(data)
    dependency.field += 1
    dependency["key"] = 200
    ...

We can just create our own MagicMock and pass that in directly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import unittest
from unittest.mock import MagicMock

class Test_foo(unittest.TestCase)

    def test_foo_dependency_injection(self):

        dependency_mock = MagicMock()
        dependency_mock.field = 0

        foo(dependency_mock)

        dependency_mock.method.assert_called_with(data)
        self.assertEqual(dependency_mock.field, 1)
        dependency_mock.__setitem___.assert_called_with("key", 200)

Note that MagicMock dynamically creates things like MagicMock.method when they’re called; the fields and methods of a MagicMock are created dynamically, and are also MagicMocks. You can configure MagicMocks very flexibly: there are many different flavors MagicMock, and MagicMock has default methods (like MagicMock.__int__) that you can set yourself. For a full survey of the features, I highly recommend you read the documentation directly.

Other Useful Libraries

An essential part of learning any language familiarizing oneself with the standard library. Python is no exception, and its standard library is particularly rich and robust. Now that most of the essential syntax and language features are out of the way, we can cover some of the more interesting features.

Builtins

Besides the data types I’ve already spoken about, I’d like to mention frozenset, which is an immutable subclass of set that allows it to be hashed. Regular sets can’t be hashed, so to have a set be a member of a dictionary or another set, you need a frozenset.

There are also a handful of useful functions:

  • all(iterable), which returns True if all items in iterable are truthy and returns False otherwise
  • any(iterable), which returns True if any item in iterable is truthy and returns False otherwise
  • callable(object) returns True if object is callable and returns False otherwise
  • dir(object) treats object like a dictionary and returns all the field names as strings and values associated to those fields
  • getattr(object, str), setattr(object, str), and delattr(object, str) get, set, and delete the field of object with name equal to the specified string.

As well as standard functions like abs, len, round, reversed, sorted, sum, and zip. (I hope you have already seen all of those.)

The collections Module

The collections module in the standard library is also excellent and has a lot of valuable functions and classes that are just barely too niche to be included by default. Here are some of the highlights:

  • Counter is dictionaries that counts how often a value occurs; you can think of it like a Dict[Any, int]. You can pass an iterable to its constructor and it’ll work out how often each character appears. For example,

    1
    
    Counter("hello")
    

    produces {"l": 2, "h": 1, "e": 1, "o": 1}. By default, indexing something not in the counter just returns 0. You can also add or subtract counters, or multiply them by an integer. (So they kind of work a little bit kinda like vectors.)

  • defaultdict is a dictionary where elements that aren’t in the dictionary can be defined to have a default value. You pass a default_factory function into the defaultdict’s constructor; when the defaultdict is queried for a key that isn’t in the dictionary, its value is generated, set to default_factory(key), and returned.
  • OrderedDict is a dictionary where the key-value pairs have an ordering

The itertools Module

Finally, itertools is a great and fairly large standard library module with functions for manipulating iterables. They’re indispensable for hard interview questions and help with combinatorics-related problems. Several of the highlights are:

  • accumulate returns the sequence of partial sums of the given sequence
  • batched splits each of the elements of the iterable into batches of length n, where only the last element is allowed to have length less than n
  • chain chains all the given iterables in order into a single iterable
  • combinations returns all the subsequences of the given iterable with length at most r
  • combinations_with_replacement returns all the subsequences of the given iterable with length at most r, and allows the same element to be repeated in a subsequence
  • cycle cycles the given iterable so it repeats after it’s exhausted -pairwise returns all pairs of consecutive elements in the iterable
  • permutations returns all possible permutations of length n from the specified iterable
  • product returns the cartesian product of its argument iterables
  • repeat repeats the given argument n times
  • zip_longest zips together all the iterables and returns a new iterable with length equal to the length of the longest iterable. All the shorter iterables are padded with fillvalue

Fun fact: I once got an interview question that gave me two strings s1 and s2, where s2 was s1 after being scrambled and having a random character replaced at random with a new character not already in s1; my challenge was to find out which character had been replaced. My interviewer had never seen Counters before, and I used them to get the answer in one tenth of the usual time. (The follow-up question asked me to get the answer in constant space. The solution is a neat one-liner!)

Other Standard Library Modules

I’ve already discussed a lot of the highlights in Python’s standard library, and I think it’s worth recapping the big libraries for completeness’s sake and because there are some niche but powerful libraries included.

  • argparse for parsing command line arguments
  • asyncio for asynchronous I/O
  • csv for working with csv files
  • datetime for working with date and time data
  • email and smtplib for working with email
  • graphlib for working with graphs from discrete mathematics
  • json for reading and writing JSON data
  • os for methods for file system paths, accessing environment variables, ids, etc.
  • random for pseudorandom number generation
  • re for regular expressions
  • shutil for higher-level file operations
  • socket for low-level networking
  • tarfile and zipfile for working with archives
  • threading for making and managing threads and concurrent execution
  • time for working with time, sleeping, etc.

Third Party Libraries and Tooling

Every Python project I maintain uses pre-commit to ensure that all code that goes into the repo satisfies style and quality guidelines. They can be a bit annoying at first, but I swear by them. I also highly recommend investigating poetry as an alternative package manager to pip, because pip sometimes has issues managing Python installations. I have a template repository on GitHub that incorporates both these tools.

Major third party packages to know about are:

  • requests is the most common library for doing http requests, although I am partial to httpx, which is designed to have the same functionality and design as requests, but also supports making requests asynchronously.
  • beautifulsoup is great for HTML parsing and webscraping
  • pygame for developing games
  • numpy, pandas, and scipy are the standard suite for doing mathematical computations
  • scikit-learn, Keras, TensorFlow, and PyTorch are probably the most-used machine learning libraries in the world
  • nltk for natural language processing
  • matplotlib and seaborn for data visualization
  • FastAPI for designing website APIs and getting them off the ground quick (which has mostly replaced Flask)
  • Django for doing heavy-duty fullstack website development and database management
  • pillow for image manipulation

Performance

Python’s performance is a deep rabbithole, so I’ll be brief: if you want performance, look elsewhere. Obviously Python won’t be blazingly fast because it’s an interpreted rather than a compiled language, but exactly what specifically incurs performance costs will probably be surprising and unintuitive. For example, basic things like storing magic numbers in variables may have significant performance costs. If you’d like a deep dive on why Python’s wall-clock performance is so bad, Jake Vanderplas made an excellent writeup here. (The TL;DR is that it mostly comes down to dynamic typing resulting in overhead on basic operations and objects being scattered across different locations in memory.)

If you absolutely must use Python, a handful of the aforementioned libraries like numpy and its derivatives use C APIs to sidestep Python’s inefficient object model, and each library has its own tricks to squeeze more performance that way (here’s sklearn’s), but the APIs aren’t always flexible enough to offload everything to C. If you’re absolutely married to Python and need a performant library, you should check out Python’s C APIs.

Advanced Topics and Further Reading

Python topics I’d consider advanced are mostly inner workings of the language that occasionally leak out. Global variables, C APIs, disassembling, implementation details, __future__, and many more obscure topics. This will all differ based on your background experience and needs, and tends to be fairly niche.

If you’re still hungry for more, I’d recommend looking at Peter Norvig’s pytudes repository, which are studies in writing better Python.

This post is licensed under CC BY 4.0 by the author.