A Survey of Intermediate Python Features
I’ve been writing Python for about 9 years now. In that time, I’ve stumbled on a lot of language features that confused me at first or that I wish I had learned about earlier; now that several friends of mine are learning Python, I figured I’d give a brief overview of some of the major language features that I consider to be the next steps after the basics. (The basics are the type of stuff you’d see in Automate the Boring Stuff: primitives, containers, control flow, functions, classes, I/O, etc.)
My goal here isn’t to provide an exhaustive discussion of any of these features, but rather to just point out that they exist and demonstrate one or two of the most common use cases and clear up common misconceptions. I won’t cover features that are common to most programming languages unless I think there’s something important or substantially different about how they’re used in Python.
if __name__ == “__main__”:
This is a common stumbling block for new Python programmers when looking at someone else’s code. Many open source projects, in their main file, will have
1
2
if __name__ == "__main__":
...
__name__
is a global string variable that indicates whether the file is being run as top-level (in which case __name__
is equal to "__main__"
), or as an imported module (in which case __name__
is just the module’s name).
Usually, it’s standard to structure a .py
file to have only imports, globals, and function definitions at the top level, and have this check at the very end of the file to determine whether a main()
function (or anything besides definitions) should be executed. If you’re publishing any code publicly, it’s usually good to include this so it’s easier for other people to reuse your code without triggering a whole bunch of side effects when they import what you’ve written.
Sentinel Values
A sentinel value (also known as a flag value) is a unique value to which we attach some kind of special meaning. For example, in C, strings end with the null character \0
. There are a couple reasons to dislike flag arguments(Martin Fowler criticizes them here, there’s also boolean blindness) and a couple workarounds, but if you need a flag or sentinel, you should know how to implement it.
Besides True
and False
, usually None
is the first choice for a sentinel because it suffices most of the time. However, there are two reasons this might be a bad choice: first, you’re assigning extra meaning to None
, which might be code smell. Second, you might need to work over arbitrary values including None
. The best way to create a sentinel is just:
1
sentinel_value = object()
object()
creates a unique object that has its own hash and will never be equal to anything else (unless you monkey patch the __eq__
method, but that’s on you). You can create any number of sentinels this way and they’ll all be treated as different values.
Strings
Formatting Strings
Python’s philosophy is that there should be one clear, unambiguous way to accomplish a goal. This is mostly true except for string formatting, for which there are no less than ten thousand ways to accomplish the same thing. The most modern way to do it is with f-strings:
1
2
i = 10
f"the value of i is {i}"
This outputs the value of i is 10
. There are a handful of other techniques, like the str.format()
method and regular string concatenation, but f-strings are the most performant and easiest to read.
It’s also worth mentioning that if you want to print the value of a variable quickly, you can print f"{x=}"
, which will print x=
followed by the value of x
. For example, if I have a list l = [1, 2, 3]
, then printing f"{l=}"
prints l=[1, 2, 3]
.
Concatenating Strings
It’s also important to avoid concatenating multiple strings with +
. Adding s = s1 + s2 + s3 + ... + sn
or using +
or +=
in a loop causes performance issues so it’s better to use ''.join(strings)
:
1
2
3
4
strings = []
for i in range(10):
strings.append[str(i)]
string = ''.join(strings)
Booleans
Comparison Chaining
Often, we have one value x
and want to ensure that it falls within a specific range: that is, we want a < x
(or a <= x
) and x < b
(or x <= b
). Python allows us to chain this comparison like this by writing a < x < b
(replacing <
with <=
as needed). This essentially simplifies to a < x and x < b
. It’s also important to note that if we write something like
1
a < f() < b
then f
will only be called once, rather than twice.
However, comparison chaining introduces some weird edge cases. For example, the following two statements
(False == False) in [False]
False == (False in [False])
are both False
, as we would expect. However, False == False in [False]
is True
, because it simplifies to False == False and False in [False]
. Both expressions on either side of and
are True
, so the whole thing is True
; when we parenthesize this statement as we did in the two examples above, they cease to be chained expressions and just evaluate to True in False
and False == True
, which are both False
.
Truthy and Falsy
Truthy and falsy sound like words only children would ever use, so naturally computer scientists use them all the time. Truthiness is a property of an object that determines whether it evaluates to True
or False
when it’s (implicitly) cast to a boolean. For example, the empty string ""
, None
, 0
, and empty collections like lists []
and dictionaries {}
will evaluate to False
when cast to booleans. Falsy values tend to be identity elements: a + 0 = a
for any number a
, s + "" = s
for any string s
, and so on. (None
is the odd one out here because it doesn’t have any operations defined on it, but it gives off a “falsy vibe”, so that’s what it is.) Non-empty strings and collections are truthy, as are nonzero numbers and so on.
Newcomers to Python often write things like:
1
2
if len(mylist) == 0:
...
However, if
and while
statements implicitly cast to booleans, so we can simplify this to
1
2
if mylist:
...
No need to even call bool()
!
There are three important notes about truthiness:
- A truthy value is not equal to
True
and a falsy value is not equal toFalse
:1 == True
isFalse
, as is"" == False
. (Python might play fast and loose with its types, but at least it’s not JavaScript.) However, you can cast to a boolean explicitly:bool(1) == True
andbool("") == False
are bothTrue
. - Many functions like
any
andall
just check for truthiness instead of requiring values to be exactlyTrue
orFalse
. - New Python classes are truthy by default, because every object in Python inherits from
object
whose__bool__
method just returnsTrue
. By overriding the__bool__
method, you can set the truth value for a class you’ve written. For example, if I have a collection that implementslen()
, a reasonable__bool__
override might be:
1
2
def __bool__(self):
return bool(len(self))
Consider Using in Instead of or
Suppose I have code like this:
1
2
if x == 1 or x == 3 or x == 4:
...
It’s usually neater to just write
1
2
if x in (1, 3, 4):
...
Indexing
Python allows you to get slices when indexing. Suppose we have a list l = list(range(10))
. Then
l[start:]
gets everything whose indexi
satisfiesstart <= i
, sol[5:]
evaluates to[5, 6, 7, 8, 9]
. Note that the item at indexstart
is included.l[:stop]
gets everything whose indexi
satisfiesi < stop
, sol[:5]
evaluates to[0, 1, 2, 3, 4]
. Note that the item at indexstop
is excluded.l[start:stop]
gets everything whose indexi
satisfiesstart <= i < stop
, sol[2:6]
evaluates to[2, 3, 4, 5]
.l[start:stop:step]
will get everything whose index is one ofstart
,start + i
,start + i * 2
until it exceedsstop
. Importantly, you can also omit some of the entries here:l[::2]
gets every 2nd element (so those with even indices),l[::3]
gets every 3rd element,l[1::2]
gets every 2nd element starting froml[0]
(so those with even indices), etc.- Negative list indices, like
l[-1]
, index from the right, sol[-1]
gets the last element,l[-2]
gets the item second from the last, and so on. You can combine this with slicing to getl[-5:-2]
, which gets the fifth-, fourth-, and third- to last elements:[5, 6, 7]
- Finally, you can also assign the final index. For example, the documentation points out that
l[len(l):] = [x]
is equivalent tol.append(x)
.
(If you think this is complicated, wait until you try numpy!)
Looping, Iterables, and Iterators
Any item that we can iterate over with a for
loop in Python is called an iterable. When I write a for
loop like
1
2
for i in range(10):
...
Python does a handful of things. It evaluates range(10)
to get a range
object back. Python then calls the iter
method of that range
object, which produces something called an iterator. Iterators are basically a stream of data that let us loop over an iterable object once. The for loop then assigns i = iterator.__next__()
and executes the code in the body of the for
loop, then repeats this process over and over until iterator.__next__()
raises a StopIteration
exception to indicate that it’s done producing values, at which point the loop ends.
It’s worth mentioning that this is why
1
2
for element in mylist:
element = ...
fails to mutate the values in mylist
but
1
2
for i, element in enumerate(mylist):
mylist[i] = ...
succeeds.
We usually only deal with iterators when we’re designing them ourselves, but we deal with iterables all the time. Functions like range
, zip
, map
, and filter
return iterables, and collections like lists are also iterables.
Mutating a Collection with a Loop
This shows why the following code attempting to reassign the values of a list with a for
loop fails to change the list:
1
2
for item in mylist:
item = ...
Here, the loop variable item
is a copy of the element in the list. You can still access item
’s methods and fields and mutate item
internally, but changing what’s in the list requires operating on the list itself. Loop using each item’s index and indexing the list accomplishes this:
1
2
for i in range(len(mylist)):
mylist[i] = ...
Generators
Lists are more or less the standard way to store sequential data in Python; however, there are several problems with lists:
- They require us to know every that has to be in the list
- They require us to store all the list data in memory
- They always hold a specific number of elements at any given time
Generators are an alternative to lists that don’t have any of these pitfalls. Generators are also iterables, but rather than storing a pre-computed collection of elements in memory like a list does, a generator generates each element “on demand” in sequence; this is called lazy evaluation.
Generators are particularly for computing things which are only needed on an “on-demand” basis and usually only used in sequence, like the elements of range
objects.
We can write our own generator like this:
1
2
3
4
5
6
7
def generator_function():
i = 0
while True:
yield i
i += 2
g = even_generator()
Here, generator_function
is a function that produces a generator object, and g
is an instance of a generator object produces by generator_function
. From here, if we repeatedly call g.__next__()
, it will produce 0
, 2
, 4
, and so on ad infinitum. You can think of a generator function like a function where the return
statement is replaced by yield
. Every time a generator is called, it runs until it hits the yield
statement, returns that value, and then pauses until it’s called again, at which point it runs from where it left off. In this case, that means the second time we called g.__next__()
, the line i += 2
was the first one evaluated, and the next lines were all in the while
loop. Generators can also pull from other collections, read from a file, or do pretty much anything a regular function can do; they’re very flexible.
Generators can’t be indexed because they would require the generator to call .__next__
several times, which produces side effects. It also means that if I could index elements that have already been previously computed, I would have needed a mechanism for storing those, which might defeat the memory-saving purpose of having a generator in the first place. Of course, you can just repeatedly pull elements out of a generator calling the generator’s .__next__
method and just saving the results in a list. You can simply call the list
function on a generator, but this will cause your program to hang if the generator never stops generating!
If you want the generator to stop generating, just raise the StopIteration
exception yourself:
1
2
3
4
5
6
7
def generator_function():
i = 0
while True:
yield i
i += 2
if i > 10:
raise StopIteration
Yes, this uses the syntax for raising an exception. No, this doesn’t mean that you need to wrap all your generators in try
/except
: Python already internally checks for this StopIteration
exceptions while looping, but it doesn’t check for it if you’re just calling the .__next__
method raw; you’ll have to catch that exception yourself.
Because of the prevalence of iterables, most of Python’s standard library functions assume their inputs are only iterables (rather than something with more functionality) as inputs and produce iterables as outputs. This is both more memory efficient and more flexible than just using lists for everything, which was common in Python 2. There are also other kinds of iterables like sequences and so on. (range
objects are actually sequences, and that’s why you can index them.)
Enumerating
Frequently in Python, we want to loop over a collection of n-tuples. The most common situation is the fantastic standard library enumerate
function, which allows us to keep track of the index and associated element of the collection we’re looping over. If I have a list like mylist = ['a', 'b', 'c']
, then the code
1
2
for i in enumerate(mylist):
...
The I will range over the tuples (0, "a")
, (1, "b")
, and (2, "c")
. Similarly, I might have a dictionary
1
2
3
4
5
mydict = {
"x": 3,
"y": 2,
"z": 1,
}
Looping over mydict.values()
will produce a similar result: ("x", 3)
, ("y", 2)
, and ("z", 1)
. This is unfortunately kind of a pain if I want to refer to i[0]
and i[1]
individually rather than the tuple itself. I could write
1
2
3
for i in mytuples:
first, second = i
...
but Python allows us to unpack in the loop itself:
1
2
3
4
5
for i, element in enumerate(mylist):
...
for key, value in mydict.values():
...
This can be extended to three, four, or more variables, depending on how many your iterator produces.
Comprehensions
If you have experience with functional programming, you may be familiar with functions like map
and filter
. Map essentially takes a function and a list and applies that function to every element in the list. Filter takes a predicate and a list and removes all items from the list that don’t satisfy the predicate. These functions are very useful, but in Python are usually overshadowed by much more “Pythonic” expressions: list, set, and dictionary comprehensions and generator expressions.
map
as a List Comprehension
To see where list comprehensions come in handy, let’s look at the naive way to create a list that just includes the first five even numbers:
1
2
3
evens = []
for i in range(5):
evens.append(i * 2)
This produces the list [0, 2, 4, 6, 8]
. However, there’s a more elegant way to do this:
1
evens = [i * 2 for i in range(5)]
This is much more compact and easy to read (once you get used to it). It’s meant to mimic mathematical setbuilding notation, and is also an expression rather than a statement. We can also incorporate ternary expressions:
1
odd_or_even = ["even" if i % 2 == 0 else "odd" for i in range(5)]
This is essentially the preferred way to do map
; the general form of a basic list comprehension is
1
mylist = [expression(x) for x in iterable]
You can also construct the iterable
using a nested list comprehension, but that can get ugly quickly.
Because str.join()
takes an iterable as its argument, we can clean up the string concatenation code we had earlier from this:
1
2
3
4
strings = []
for i in range(10):
strings.append[str(i)]
string = ''.join(strings)
to this:
1
string = ''.join(str(i) for i in range(10))
filter
as a List Comprehension
Python has a similar alternative to filter
, too:
1
mylist = [x for x in range(10) if x % 2 == 0]
This also generates the list [0, 2, 4, 6, 8]
. Notice that when if x % 2 == 0
is false, nothing gets added to the list: there’s no need to specify some value for when the condition is False
. These two kinds of list comprehensions can be combined to accomplish a map
and a filter
in a single expression!
Dictionary and Set Comprehensions
Two other kinds of comprehensions I’d like to talk about are set comprehensions and dictionary comprehensions. As the names imply, these are the equivalent of list comprehensions for sets.
For sets, list comprehensions work exactly the same, with the added bonus that sets don’t allow duplicates:
1
myset = {expression(i) for i in iterable}
This can be an easy way to screen out duplicates.
For dictionaries, the syntax is slightly different:
1
mydict = {key_expression(i) : value_expression(i) for i in iterable}
Don’t forget the :
or you might accidentally create a set comprehension instead of a dict comprehension!
Generator Expressions
Generator expressions are the way I most often create generators. They’re essentially the same as a list or set comprehension, except they use ( )
instead of [ ]
or { }
. Importantly, unlike lists, sets, or dictionaries, this generator expression is still evaluated lazily. This also directly produces a generator, rather than a generator function.
You might notice that there doesn’t appear to be a way to do tuple comprehensions, and that’s correct. Instead, you can just cast:
1
mytuple = tuple(expression(x) for x in iterable)
Control Flow
with … as …
Often, we want to handle an object that needs some kind of “setup” and some kind of “teardown”. Common examples are files or other kinds of streams, where we want to open
the file, read or write to it, and we have to close it at the end or else we’re squatting on important resources like memory and file locks, any changes we write might not be properly saved, etc. Usually, this means we need to remember to call close()
, usually in the finally
part of a try/finally
statement. Doing this a lot results in a lot of boilerplate.
The with
statement is the best practice for handling objects like this, and essentially replaces try/finally
. Rather than writing
1
2
3
4
5
try:
f = open("path/to/file")
...
finally:
f.close()
We can instead write
1
2
with open("path/to/file") as f:
...
You can also manage multiple objects at the same time:
1
2
with expression1 as thing1, expression2 as thing2:
...
We can do this with any context manager objects, meaning objects that have an __enter__
and an __exit__
method defined.
match
Python provides the match
statement for structural pattern matching, another common and powerful feature from functional programming. Despite the similarities to switch
statements in languages with C-style syntax, Python’s match
is not a switch
statement, and treating them the same may introduce unwanted side effects and mutation.
The use cases for match
are for control flavor based on the structure and format of data, rather than just by value. This is one of the topics that would be hard to give a full tutorial of here, but I can direct you to a solid tutorial. I can also provide an example: if I were designing a simple command line shell, it might look something like
1
2
3
4
5
6
7
8
9
10
11
12
13
command = input()
match command.split():
case ["quit"] | ["exit"]:
...
case ["update"]:
...
case ["run", program_name]:
...
case ["upgrade", *package_names]:
...
case _:
...
Let’s break down a couple of these cases:
- The
["quit"] | ["exit"]
case will match both the values["quit"]
and["exit"]
, with the pipe|
functioning as anor
. - The
["update"]
case will match exactly the value["update"]
. - The
["run", program_name]
case will match a two-element list with"run"
as the first element and anything as the second element; that second value will be treated as a variable namedprogram_name
in the body of thatcase
. - The
["upgrade", *package_names]:
case will match a list with"upgrade"
as the first element and any number of additional elements, which will be bundled together in a list calledpackage_names
. - The
_
case matches anything;case _:
is tomatch
whatdefault:
is to aswitch
. The_
is a wildcard, meaning that it matches anything.
You can also match positional attributes and stuff like abstract syntax trees (I assume this is how you would make tooling like parsers and linters with Python).
Ternary
There are many cases where a variable’s value is determined by a conditional, like
1
2
3
4
if cond:
x = value1
else:
x = value2
This chunk of code works, but it’s a couple lines for something relatively simple and programmers from other languages might think it’s a bit weird that x
is defined in a smaller scope than where it will eventually be used. Fortunately, Python allows us to rewrite this into a single line, like so:
1
x = value1 if cond else value2
This code may appear to do more or less the same thing, and in addition to being more aesthetically pleasing is different in one crucial aspect: the first chunk of code is made up of statements, and the second chunk of code is one expression. There are many places that an expression can go that a statement can’t, like in a list comprehension, so ternary is not only more concise, but also more flexible.
for ... else
Sometimes, we need to break
out of a loop. However, we might also want to execute a piece of code if and only if we break
out of a loop early. This functionality is provided by for ... else
:
1
2
3
4
5
6
7
8
9
10
print("Please input three odd integers.")
numbers = []
for i in range(3):
numbers.append(int(input()))
for n in numbers:
if n % 2 == 0:
break
else:
print("You inputted an even integer!")
In most cases this control flow can be accomplished differently, and there aren’t many cases where you absolutely must execute a piece of code after the break
, but it’s worth recognizing in case you see it in the wild and might be simpler than other control flow.
The Walrus Operator
This :=
operator a controversial inclusion from 3.8 affectionately called the Walrus operator. The difference between x = y
and x := y
is that the former is a statement and the latter is an expression that evaluates to y
.
There are several places you might want to use the walrus. Most generally, you want the walrus when you need to use a piece of data in a conditional statement like if
or while
and you need to save that data for use in the body of that conditional. For example, this is good in places where you have a function that behaves somewhat like __next__
:
1
2
3
4
5
while (data := file.read(64)) != '':
# do something with data
while (item := myqueue.pop()) > 0:
# do something with item
Or when you have a function that you don’t need or want to compute twice, and don’t want to keep the result around for longer than you need:
1
2
if (result := expensive_function(x)) > 0.5:
# do something with result
A word of warning: don’t get clever with the walrus. For example, you might be tempted to write
1
2
while len(mystring := mystring[:-1]) > 10:
pass
This is often abuse of the walrus. I don’t think this makes code that’s more elegant or easier to read; it often leads to large conditionals and empty bodies and it can be annoying to add things to the loop body later on.
Functions
Positional-Only Parameters
In Python, you can treat most regular parameters as keyword parameters: if I have a function like
1
2
def foo(a, b):
print(a, b)
Then if I write foo(b = 2, a = 1)
then I’ll see 1 2
printed on the console. This means that by default, arguments can be pass by position or by keyword.
There are a couple reasons that one might prefer to have positional-only parameters, so Python allows us a way to require arguments to be passed by position and disallow them being passed as keywords. When defining a function, we can add a comma-separated /
character to denote that all arguments to the left of the /
are positional-only, and a *
to denote that all arguments to the right of the *
are keyword-only. Arguments between the /
and *
can be either. For example, in the following function
1
2
def function(arg1, arg2, /, arg3, arg4, *, arg5, arg6):
...
arg1
and arg2
must be passed by position, arg3
and arg4
can be passed by position or by keyword, and arg5
and arg6
must be passed by keyword. All of these arguments are required.
*args, **kwargs, and the Unpacking Operators
These are two very common function arguments for allowing in an unknown number of additional arguments. *args
is used to take in additional optional arguments as a list. That list will be called args
. **kwargs
is used to take in additional optional keyword arguments as a dictionary. That dictionary will be called kwargs
. You can call these collections anything you want, and you might only want to use one at once, but the names args
and kwargs
are used by convention.
It’s also worth explaining what the *
and **
do. These are both operators; *
unpacks a list and **
unpacks a dict. So if I have a function like
1
2
3
4
5
6
def f(s1, s2):
return s1 + s2
values = ["a", "b"]
f(*values)
would work as if I had called f("a", "b")
. **
does the same for a dictionary with keyword arguments:
1
2
3
4
5
6
def f(s1, s2 = ""):
return s1 + s2
values = {"s1": "a", "s2": "b"}
f(*values)
Note that this works with required arguments (s1
) and optional keyword arguments (s2
).
First-Class Functions
Python has first-class functions, which is a fancy way of saying that functions are also objects. This allows Python to support a wide range of useful features from functional programming: you can pass functions in as arguments to other functions, store functions in data structures like lists or dictionaries, you can access a function’s properties with .
, and so on. For example, in game design, it’s common to map key presses to actions to allow players to remap their control scheme. This is typically done using the command pattern which might look (very roughly) like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def pause():
...
def jump():
...
def interact():
...
# maps a player's key press to an action
actions = {
ESC : pause,
SPACE : jump,
E : interact,
}
Note that we don’t use the parentheses (so we write pause
instead of pause()
) to indicate that we are referring to the function itself, rather than the result of calling that function. If player input gets stored in a variable like key_press
, you could call any of these functions with something like actions[key_press]()
. Note again that we place the parentheses after actions[key_press]
, because actions[key_press]
will evaluate to a function.
However, first class function support means that Python doesn’t support function overloading because Python would have two allow for two objects with the same name. Fortunately, Python’s dynamic typing eliminates the need for function overloading in many cases (for example, I often don’t need to write one function to take an int
and another to take a float
like I’d have to in C++ or C#), but there are corner cases where we do want function overloading. There are a handful of ways to achieve the same thing, like being clever with keyword arguments or just doing some extra checks in the function body, but you’ll have to judge for yourself what’s appropriate for each case.
Lambdas
A lambda is an “anonymous function”, meaning a function that doesn’t have to be defined with a name or even given a name later.
We can define a lambda like so:
1
2
3
4
5
6
7
lambda: "hello!"
lambda x: ' '.join(list(x))
lambda x, y: abs(len(x) - len(y))
lambda x, y = 0: x + y
So the syntax for a lambda is generally:
1
lambda arguments: expression(arguments)
Lambdas are useful because you can define them wherever you can put an expression, so you can define them as an argument to a function or an elements of a collection. They’re also useful when you want a one-time function but don’t want it hanging around polluting your namespace.
Of course, we could still bind a lambda to a name if we wanted to. The following ways for defining f
are essentially equivalent:
1
2
3
4
f = lambda x: x + 3
def f(x):
return x + 3
Unfortunately, Python’s lambdas are somewhat weak for two main reasons:
- Lambdas have to be only one expression, which considerably limits how much you can put into a lambda in practice
- Lambdas can’t be asynchronous, so you need to define an async function instead
- You can’t add type hints to lambdas, because lambdas and type hints have conflicting syntax (they need
:
to mean two different things)
Closures
Closures are an important feature of functional programming and other languages that have first-class functions. Essentially, it’s a way to have a function that also comes with an environment, usually as a way for a function to have static variables or as a way to dynamically create functions with different behavior, like in this example:
1
2
3
4
def add_function_factory(y):
def f(x):
return x + y
return f
Here, the add_function_factory
“encloses” the function f
. Importantly, calling add_function_factory
will create a new function f
each time, so we can generate functions with different behavior:
1
2
add_three = add_function_factory(3)
append_exclamation = add_function_factory("!")
However, if you use closures to mutate variables, you might run into an issue. This works:
1
2
3
4
5
6
def closure():
x = "hello"
def inner():
print(x)
return inner
closure()()
But this doesn’t:
1
2
3
4
5
6
7
def closure():
x = "hello"
def inner():
print(x)
x += 1
return inner
closure()()
The second chunk of code raises an UnboundLocalError
before print
is even called! Closure essentially “bake” variables into their inner functions when they define them, so x
is treated like a constant rather than a variable. To fix this, you need to declare x
as a nonlocal
variable:
1
2
3
4
5
6
7
8
9
def closure():
x = "hello"
def inner():
nonlocal x
print(x)
x += 1
return inner
f = closure()
Now, calling f()
repeatedly produces 0
, 1
, 2
, and so on rather than an error.
Mutable Default Arguments
This is a Python quirk everyone encounters sooner or later. Suppose we had a function with an argument that had a mutable default value, like this one:
1
2
3
def func(x = []):
x.append(0)
print(x)
Most Python newcomers expect that if we run this function three times, it’ll simply print out [0]
each time. However, we actually get [0]
, [0, 0]
, [0,0,0]
instead. This is because Python retains the list between calls; x
is only initialized to []
when the def
statement is evaluated, and every subsequent call to func
that uses the default value of x
will use that same list and retain any mutations done to it between calls. If you’re familiar with pointers you’re familiar with this kind of behavior.
This behavior can be advantageous, but most people encounter it for the first time when they just wanted to initialize an argument to a fresh default value. This is a circumstance where a sentinel value is probably the right choice:
1
2
3
4
5
sentinel = object()
def func(x = sentinel):
if x = sentinel:
x = []
...
This will set x
to []
if and only if no value was specified for x
, which is exactly what we wanted. (The only problem is that we have a sentinel hanging around in our namespace, but this basically never matters and is easily fixed with modules.)
Decorators
Decorators are a common way to modify the behavior of a function by wrapping it in another function. There are often cases where we want to wrap many different functions with the same functionality: logging, caching, mocking, and so on. We can create a decorator like so:
1
2
3
4
5
6
7
8
9
10
11
12
def decorator(func):
...
def wrapper(*args, **kwargs):
# behavior before decorated function runs
result = func(*args, **kwargs) # function being decorated
# behavior after decorated function runs
return result
return function
@decorator
def myfunction(arg1, arg2):
...
This is the basic way to create a decorator that takes no arguments. The decorator takes in the function func
as an argument and decorates it with the wrapper
, which can do something before func
runs, runs func
and saves its result (don’t forget that part!), does something after func
runs, and finally returns the value of func
(or whatever you want the decorated func
to return). When def myfunction
runs, myfunction
will be automatically wrapped; from then on, any call to myfunction
will execute the wrapper wrapper
, with myfunction
taking the place of func
. Any arguments passed to myfunction
will be passed into wrapper
; because we don’t know in advance what kinds of arguments decorated functions will take, it’s a good idea to always use *args, **kwargs
for the wrapper and pass that in to the decorated func
.
What if we need our decorator to take arguments? In that case, we have to create another level of closures and write a decorator factory:
1
2
3
4
5
6
7
8
9
10
11
def decorator_factory(arguments):
... # do something with arguments
def decorator(func):
... # do something with arguments, and the function
def wrapper(*args, **kwargs):
# behavior before decorated function runs
result = func(*args, **kwargs) # function being decorated
# behavior after decorated function runs
return result
return function
return decorator
This does essentially the same thing a decorator does, except it allows us additional space to specify how decorator
will construct the wrapper
. For example, if I’m writing a custom logging decorator that needs to take in a logfile
argument:
1
2
3
4
5
6
7
8
9
10
11
12
13
import os
def log(logfile):
os.mkdir(logfile, exist_ok=True)
def decorator(func):
def wrapper(*args, **kwargs):
with open(logfile):
# write args and kwargs to logfile
result = func(*args, **kwargs)
# write result and any other info to logfile
return result
return function
return decorator
We then decorate a function like so:
1
2
3
@log("path/to/log/file")
def myfunction(arg1, arg2, arg3):
...
It’s also important to see that you can apply multiple decorators:
1
2
3
4
@decorator2
@decorator1
def func():
...
decorator1
will be applied first, and then decorator2
will wrap the wrapper that decorator1
applies to func
.
It’s important to note that this is distinct from the standard Gang of Four decorator pattern in object oriented programming. Python’s decorators are more akin to static attribute-oriented programming, whereas the standard decorator pattern is applied dynamically at runtime. (Also, if the @
seems reminiscent of Javadoc, it’s supposed to be!)
Memoization and Caching
There are often situations where you have a pure function that is expensive to call. This is often the case in dynamic programming problems, where the naive solution will have exponential time complexity, but by recognizing that repeated recursive calls are made with the same arguments we can optimize the function to run in polynomial time by caching results.
The standard solution is just caching the function’s results, like so:
1
2
3
4
5
6
7
8
9
10
cache = dict()
def function(arguments):
if (arguments) not in cache.keys():
... # compute solution
cache[(arguments)] = solution
return cache[(arguments)]
It’s a bit tedious to put this code everywhere, and we often need to hide the cache
in an object field or a closure or something. Thankfully, Python has a simple, thread-safe @cache
decorator that does this for us:
1
2
3
4
5
6
from functools import cache
@cache
def function(arguments):
... # compute solution
return solution
Type Hints
Python has a very strange relationship to typing. Its dynamic typing is one of its greatest strengths or greatest weaknesses depending on who you ask, and is a strange transition for people coming from languages like Rust, Java, C#, or C++.
For clarity and third party tooling, you can annotate functions and variables with the typing library. For example, if I have functions like
1
2
3
4
5
def list_and_reverse(s):
return list(reversed(s))
def line(a, b = 0):
return lambda x: a * x + b
I could annotate them like so:
1
2
3
4
5
def list_and_reverse(s: str) -> list[str]:
return list(reversed(s))
def line(a: float, b: float = 0) -> Callable[[float], float]:
return lambda x: a * x + b
This indicates that list_and_reverse
is a function taking a string and returning a list of strings, and that line
is a function taking a required float
, an optional float
, and returns a function (more broadly, a Callable
) that takes a float
and returns a float
. Functions that have no return type can be annotated with None
.
You can also do this for variables and class fields. For example, I may instantiate a variable that I intend to represent a dictionary from strings to ints or None
, for example, and might want to document that at declaration, even if I don’t have keys or values for the dictionary yet. I could do this by writing
1
myvalues: dict[str, Optional[int]] = dict()
There are a handful of important types worth noting:
Any
represents data that can be anything, and is the most useful type. It’s also the type you use when you’re feeling lazy and want your type checker to stop yelling at you.Union
represents data that can be one of a handful of other types.Union[str, int]
represents data that can be either a string or integer.Optional
is a special case ofUnion
that is a union of exactly one type withNone
.Optional[int]
equalsUnion[int, None]
.
It’s worth mentioning that just importing a class is enough to use it as a type annotation, like from collections import Counter
lets you use Counter
as an annotation. You can also create type aliases, like vector = list[float]
. You could then annotate a function like so:
1
2
def dot(v1: vector, v2: vector) -> float:
return sum(a * b for a, b in zip(v1, v2, strict=True))
It’s important to mention that type annotations, except in a handful of cases like dataclasses, don’t change what your program does: it’s possible for you to completely ignore the type annotations, and Python won’t care. Without any kind of third party tools, it’s basically just a fancy docstring. However, third party tools like mypy will care, and will require your program to agree with the static analysis it performs. This can be really good for hunting bugs and ensuring style compliance.
Classes
vars
Python essentially treats objects like dictionaries, where the field names are the keys, methods are just first class functions as values, and classes can be thought of roughly as dictionary factories. For an object x
, vars(x)
returns the dict attribute of x
(which can also be accessed with x.__dict__
), which is essentially a way to directly interact with x
as a dictionary. (There are a couple objects that don’t have __dict__
attributes, usually __slots__
objects.) You can also just call vars()
without an argument, which will just return the local variables equivalent to calling locals()
. (There’s a related function dir
that does something similar, but its behavior is a bit complicated and it’s best to just read the documentation.) All of this is touching on Python’s object model, which is a topic also best read through the documentation.
Monkey Patching
There are many cases where we might want to dynamically alter the behavior of a class at runtime: maybe we want to fit one class to be used with a different interface, maybe we need a function to behave differently, or maybe we want a class to keep track of additional data. For whatever reason, the adapter pattern, inheritance, or composition might be undesirable or infeasible. Because Python’s classes are mutable, we can dynamically add or replace fields and methods like so:
1
2
3
4
5
6
7
8
9
10
11
12
13
class MyClass:
def say_something(self):
print("Hello")
instance = MyClass()
def say_goodbye(self):
self.things_said += 1
print("Goodbye")
instance.things_said = 0
instance.say_something = say_goodbye
Monkey patching is one of the forbidden techniques of Python programming, and leads to situations where two instances of the same class do different things or have different fields while still having the same type. For this reason, it’s important to use it in limited situations and take time to contrast it with the aforementioned alternatives.
hasattr, getattr, setattr
If you’ve been monkey patching objects, you might wonder whether a given object x
has a particular field foo
defined. One option might be to just check if x.foo
evaluates to something in a try/except
block, but there’s an easier way: the hasattr
function takes an object x
and a string and determines if x
has a field with a name matching that string. hasattr(x, "foo")
will be True
if x.foo
exists and False
otherwise.
Note that the field name, "foo"
, is a string. There are other cases where we have might have the name of a field as a string, and want to actually access that field to get or set its value. In this case, we can use getattr
to get the field of x
with the associated name: getattr(x, "foo")
is equivalent to x.foo
. Similarly, we can use setattr
to set the value of a field: setattr(x, "foo", bar)
is equivalent to x.foo = bar
.
Static Methods with @staticmethod
The @staticmethod
decorator changes a method to no longer require a self
argument, so it can be called without an instance of the class.
1
2
3
4
5
6
7
8
9
10
11
def say_goodbye():
print("Goodbye")
class MyClass:
@staticmethod
def say_hello():
print("Hello!")
say_goodbye = staticmethod(say_goodbye)
Here, we’re using staticmethod
both as a decorator to declare say_hello
as static, and as a regular function to turn the existing say_goodbye
function into a static method. We can then call MyClass.say_hello()
and MyClass.say_goodbye()
.
Multiple Constructors and @classmethod
At a glance, it appears that Python doesn’t support multiple constructors because every class can have only one __init__
method. Without any other options, the only workarounds would be a lot of keyword arguments (which gets messy fast if we require specific combinations of arguments to be passed) or the builder pattern. (The builder pattern is actually my favorite design pattern, but it’s a lot of code and Python doesn’t let us make the constructor private to enforce using the builder.)
Fortunately, we have a solution with the @classmethod
decorator, which essentially creates a static method that is aware of the class that it’s attached to. Among other things, this is useful for creating multiple constructors for the same class. For example, suppose I’m creating a goblin NPC for a video game:
1
2
3
4
5
6
7
8
9
10
11
12
13
import random
class Goblin:
def __init__(self, level, health, attack):
self.level = level
self.health = health
self.attack = attack
@classmethod
def random_constructor(cls):
level = random.randint(1, 100)
return cls(level, level * 5, level)
Here, we’re using @classmethod
to create a second constructor, random_constructor
, for the the Goblin
class. Here, cls
will be the class that the method is being defined on, so cls(level, level * 5, level)
is equivalent to Goblin(level, level * 5, level)
. Unfortunately, we still have to call the same base constructor __init__
, so
An immediate question this raises is: why use this instead of a static method? We just as easily could’ve written
1
2
3
4
@staticmethod
def random_constructor():
level = random.randint(1, 100)
return Goblin(level, level * 5, level)
The reason is that this will still create a Goblin
, even if we call it from a subclass: it doesn’t respect inheritance. We’re also out of luck if we want to define a class method on an abstract base class. If I were to create a subclass like this:
1
2
3
4
class GoblinWarrior(Goblin):
...
goblin_warrior = GoblinWarrior.random_constructor()
Them we just get back a Goblin
, rather than a GoblinWarrior
. Using @classmethod
avoids this, because the cls
parameter is determined based on which class is the method is being called through: when we call Goblin.random_constructor()
then cls
is Goblin
, and we call GoblinWarrior.random_constructor()
then cls
is GoblinWarrior
. By calling cls
like a function, we can choose our constructor dynamically. We can also call static methods (including other class methods) dynamically using cls.method
!
Dataclasses
Dataclasses are a very neat feature that allows for automatically generating common methods for classes whose primary purpose is bundling relevant fields or data together. For example, we may have a class like
1
2
3
4
5
6
class GameItem:
name = "..."
description = "..."
gold_value = ...
weight = ...
...
Where name and description are strings, gold_value is an int, and weight is a float, there may be other fields, and so on. The point is that this class represents a handful of data bundled together, and there’s a whole bunch of basic functionality we’d like to implement: an __init__
method that just takes all of these fields as parameters and assigns them accordingly, an __eq__
method so we can determine when two classes are equal with ==
, a __repr__
method to create a string representation of the object, and so on.
These tend to be boilerplate functions which are somewhat tedious to write. To save time, Python has a @dataclass
decorator:
1
2
3
4
5
6
7
8
from dataclasses import dataclass
@dataclass
class GameItem:
name: str
description: str
gold_value: int
weight: float
This several lines of code are equal to the entirety of the first definition, with the function definitions in all. Roughly 20 lines of code (or more) have been compressed into only 8. Dataclasses come with a whole handful of other parameters to handle the boilerplate like ordering and freezing, or you can simply override and add additional functionality to the dataclass yourself.
Abstract Base Classes
If you’ve taken a class in object-oriented programming, you’ll know there are often times where we want to abstract one more more classes’ behavior into a single abstract class to prevent code duplication and allow polymorphism. The problem is that we don’t often don’t want an abstract class to be instantiable. Although Python doesn’t allow other typical guards like public and private, it does allow abstract base classes. Simply have your class inherit from abc.ABC
like so:
1
2
3
4
from abc import ABC
class MyAbstractClass(ABC):
...
MyAbstractClass
won’t be instantiable.
Enum Classes
Making enums in Python is a bit tricky, but possible. Making an enum is as simple as
1
2
3
4
5
6
7
from enum import Enum
class Direction(Enum):
North = 0
East = 1
South = 2
West = 3
Direction
will then behave a bit like a collection: if I set x = Direction.North
, the expression x in Direction
will be True
. You can also check Direction.North.name
to get "North"
, and Direction.North.value
to get 0
.
Mocking with @patch and MagicMock
In software development, we often want to test code in isolation without interacting with other components that that code might be coupled with. For example, the function that I’m testing might call another function that triggers a system call, API call, side effect, or expensive computation that we don’t want to occur during our tests. We also might want to verify that our function behaves correctly when another function returns a weird value, raises an exception, or otherwise misbehaves.
The typical solution is mocking the functions or objects that our code depends on. Fortunately, Python has an excellent builtin library for mocking. Suppose we want to test a function like this:
1
2
3
def foo():
result = expensive_function(val1, val2, arg = val3)
...
We can mock this call like so:
1
2
3
4
5
6
7
8
9
10
import unittest
from unittest.mock import patch
class Test_foo(unittest.TestCase)
@patch('expensive_function')
def test_makes_correct_calls(expensive_function_mock):
expensive_function_mock.return_value = ... # whatever we need the expensive_function to return
foo()
expensive_function_mock.assert_called_with(val1, val2, arg = val3)
This is a standard testing setup until we see the @patch
decorator. This decorator takes a string referring to any function or object that will appear in test_makes_correct_calls
’s scope and replaces it with a mock, which it will pass in to test_makes_correct_calls
and we’ve chosen to call expensive_function_mock
.
This expensive_function_mock
is a MagicMock object, and they are absolutely magic. Because expensive_function
is callable (it’s a function), we set a return_value
to be whatever we choose. We then run foo
and assert that by the end of foo
’s run, that our expensive_function_mock
was called with the arguments we expected, which we type in exactly as we expected them to be passed to the expensive_function
: (val1, val2, arg = val3)
. If that assertion holds, our test passes; if it doesn’t, our test will fail, exactly as how it would fail if a regular unittest.TestCase
assertion failed. Magic!
Let’s look at a more complicated function to test:
1
2
3
4
5
6
7
def save_content(url, username, password):
api_login(username, password) # API call
content = download(url) # downloading stuff
if validate_result(content): # could be an expensive function call!
write_file(content) # side effect
else:
handle_problem()
We might not want to make API calls that potentially eat into a rate limit or spam a service, eat up bandwidth downloading stuff, call an expensive function that makes our test run longer and take up more resources, or write new files that we then have to clean up. Even if we do want to do one or more of these things, we probably don’t want to do them all at once. We also might want to force validate_result
to fail so we can see what happens in the else
case, but it might be prohibitively hard to get that result, and we might encounter issues if we’re calling handle_problem()
in such a contrived scenario. So, let’s mock these calls in our test:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import unittest
from unittest.mock import patch
class Test_save_content(unittest.TestCase)
@patch('handle_problem')
@patch('write_file')
@patch('validate_result')
@patch('download')
@patch('api_login')
def test_file_writes_on_success(
self,
api_login_mock,
download_mock,
validate_result_mock,
write_file_mock,
handle_problem_mock
):
mock_content = object()
download_mock.return_value = mock_content
validate_result_mock.return_value = True
save_content("mock url", "mock username", "mock password")
api_login_mock.assert_called_with("mock username", "mock password")
download_mock.assert_called_with("mock url")
validate_result_mock.assert_called_with(mock_content)
write_file_mock.assert_called_with(mock_content)
handle_problem_mock.assert_not_called()
Note that the patching and function arguments are in reverse order: remember that the bottom decorator gets evaluated first, and the top decorator gets evaluated last. In this test, we set the download_mock.return_value
to be a sentinel that we check was passed into various functions later, and we’re also asserting that handle_problem_mock
was not called at any point during the test.
Finally, if we’re testing something that requires dependency injection like this:
1
2
3
4
5
6
def foo(dependency):
dependency.method(data)
dependency.field += 1
dependency["key"] = 200
...
We can just create our own MagicMock
and pass that in directly:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import unittest
from unittest.mock import MagicMock
class Test_foo(unittest.TestCase)
def test_foo_dependency_injection(self):
dependency_mock = MagicMock()
dependency_mock.field = 0
foo(dependency_mock)
dependency_mock.method.assert_called_with(data)
self.assertEqual(dependency_mock.field, 1)
dependency_mock.__setitem___.assert_called_with("key", 200)
Note that MagicMock dynamically creates things like MagicMock.method
when they’re called; the fields and methods of a MagicMock
are created dynamically, and are also MagicMocks. You can configure MagicMocks
very flexibly: there are many different flavors MagicMock
, and MagicMock
has default methods (like MagicMock.__int__
) that you can set yourself. For a full survey of the features, I highly recommend you read the documentation directly.
Other Useful Libraries
An essential part of learning any language familiarizing oneself with the standard library. Python is no exception, and its standard library is particularly rich and robust. Now that most of the essential syntax and language features are out of the way, we can cover some of the more interesting features.
Builtins
Besides the data types I’ve already spoken about, I’d like to mention frozenset
, which is an immutable subclass of set
that allows it to be hashed. Regular set
s can’t be hashed, so to have a set be a member of a dictionary or another set, you need a frozenset
.
There are also a handful of useful functions:
all(iterable)
, which returnsTrue
if all items initerable
are truthy and returnsFalse
otherwiseany(iterable)
, which returnsTrue
if any item initerable
is truthy and returnsFalse
otherwisecallable(object)
returnsTrue
ifobject
is callable and returnsFalse
otherwisedir(object)
treatsobject
like a dictionary and returns all the field names as strings and values associated to those fieldsgetattr(object, str)
,setattr(object, str)
, anddelattr(object, str)
get, set, and delete the field ofobject
with name equal to the specified string.
As well as standard functions like abs
, len
, round
, reversed
, sorted
, sum
, and zip
. (I hope you have already seen all of those.)
The collections Module
The collections
module in the standard library is also excellent and has a lot of valuable functions and classes that are just barely too niche to be included by default. Here are some of the highlights:
-
Counter
is dictionaries that counts how often a value occurs; you can think of it like aDict[Any, int]
. You can pass an iterable to its constructor and it’ll work out how often each character appears. For example,1
Counter("hello")
produces
{"l": 2, "h": 1, "e": 1, "o": 1}
. By default, indexing something not in the counter just returns0
. You can also add or subtract counters, or multiply them by an integer. (So they kind of work a little bit kinda like vectors) defaultdict
is a dictionary where elements that aren’t in the dictionary can be defined to have a default value. You pass adefault_factory
function into thedefaultdict
’s constructor; when thedefaultdict
is queried for akey
that isn’t in the dictionary, its value is generated, set todefault_factory(key)
, and returned.OrderedDict
is a dictionary where the key-value pairs have an ordering
The itertools Module
Finally, itertools
is a great and fairly large standard library module with functions for manipulating iterables. They’re indispensable for hard interview questions and help with combinatorics-related problems. Several of the highlights are:
accumulate
returns the sequence of partial sums of the given sequencebatched
splits each of the elements of the iterable into batches of lengthn
, where only the last element is allowed to have length less thann
chain
chains all the given iterables in order into a single iterablecombinations
returns all the subsequences of the given iterable with length at mostr
combinations_with_replacement
returns all the subsequences of the given iterable with length at mostr
, and allows the same element to be repeated in a subsequencecycle
cycles the given iterable so it repeats after it’s exhausted -pairwise
returns all pairs of consecutive elements in the iterablepermutations
returns all possible permutations of lengthn
from the specified iterableproduct
returns the cartesian product of its argument iterablesrepeat
repeats the given argumentn
timeszip_longest
zips together all the iterables and returns a new iterable with length equal to the length of the longest iterable. All the shorter iterables are padded withfillvalue
Fun fact: I once got an interview question that gave me two strings
s1
ands2
, wheres2
wass1
after being scrambled and having a random character replaced at random with a new character not already ins1
; my challenge was to find out which character had been replaced. My interviewer had never seenCounter
s before, and I used them to get the answer in one tenth of the usual time. (The follow-up question asked me to get the answer in constant space. The solution is a neat one-liner!)
Other Standard Library Modules
I’ve already discussed a lot of the highlights in Python’s standard library, and I think it’s worth recapping the big libraries for completeness’s sake and because there are some niche but powerful libraries included.
- argparse for parsing command line arguments
- asyncio for asynchronous I/O
- csv for working with csv files
- datetime for working with date and time data
- email and smtplib for working with email
- graphlib for working with graphs from discrete mathematics
- json for reading and writing JSON data
- os for methods for file system paths, accessing environment variables, ids, etc.
- random for pseudorandom number generation
- re for regular expressions
- shutil for higher-level file operations
- socket for low-level networking
- tarfile and zipfile for working with archives
- threading for making and managing threads and concurrent execution
- time for working with time, sleeping, etc.
Third Party Libraries and Tooling
Every Python project I maintain uses pre-commit to ensure that all code that goes into the repo satisfies style and quality guidelines. They can be a bit annoying at first, but I swear by them. I also highly recommend investigating poetry as an alternative package manager to pip, because pip sometimes has issues managing Python installations. I have a template repository on GitHub that incorporates both these tools.
Major third party packages to know about are:
- requests is the most common library for doing http requests, although I am partial to httpx, which is designed to have the same functionality and design as requests, but also supports making requests asynchronously.
- beautifulsoup is great for HTML parsing and webscraping
- pygame for developing games
- numpy, pandas, and scipy are the standard suite for doing mathematical computations
- scikit-learn, Keras, TensorFlow, and PyTorch are probably the most-used machine learning libraries in the world
- nltk for natural language processing
- matplotlib and seaborn for data visualization
- FastAPI for designing website APIs and getting them off the ground quick (which has mostly replaced Flask)
- Django for doing heavy-duty fullstack website development and database management
- pillow for image manipulation
Performance
Python’s performance is a deep rabbithole, so I’ll be brief: if you want performance, look elsewhere. Obviously Python won’t be blazingly fast because it’s an interpreted rather than a compiled language, but exactly what specifically incurs performance costs will probably be surprising and unintuitive. For example, basic things like storing magic numbers in variables may have significant performance costs. If you’d like a deep dive on why Python’s wall-clock performance is so bad, Jake Vanderplas made an excellent writeup here. (The TL;DR is that it mostly comes down to dynamic typing resulting in overhead on basic operations and objects being scattered across different locations in memory.)
If you absolutely must use Python, a handful of the aforementioned libraries like numpy
and its derivatives use C APIs to sidestep Python’s inefficient object model, and each library has its own tricks to squeeze more performance that way (here’s sklearn’s), but the APIs aren’t always flexible enough to offload everything to C. If you’re absolutely married to Python and need a performant library, you should check out Python’s C APIs.
Advanced Topics and Further Reading
Python topics I’d consider advanced are mostly inner workings of the language that occasionally leak out. Global variables, C APIs, disassembling, implementation details, __future__
, and many more obscure topics. This will all differ based on your background experience and needs, and tends to be fairly niche.
If you’re still hungry for more, I’d recommend looking at Peter Norvig’s pytudes repository, which are studies in writing better Python.