A Survey of Intermediate Python Features
I’ve been writing Python continuously since about 2015. In that time, I’ve stumbled on a lot of language features that confused me at first or that I wish I had learned about earlier; now that several friends of mine are learning Python, I figured I’d give a brief overview of some of the major language features that I consider to be the next steps after the basics. (The basics are the type of stuff you’d see in Automate the Boring Stuff: primitives, containers, control flow, functions, classes, I/O, etc.)
My goal here isn’t to provide an exhaustive discussion of any of these features, but rather to just point out that they exist and demonstrate one or two of the most common use cases and clear up common misconceptions. I won’t cover features that are common to most programming languages unless I think there’s something important or substantially different about how they’re used in Python.
if __name__ == "__main__":
This is a common stumbling block for new Python programmers when looking at someone else’s code. Many open source projects, in their main file, will have
1
2
if __name__ == "__main__":
...
at the bottom of their files. To break this line of code down, __name__ is a global string variable that indicates whether the file is being run as top-level (in which case __name__ is equal to "__main__"), or as an imported module (in which case __name__ is just the module’s name).
Usually, it’s standard to structure a .py file to have only imports, globals, and function definitions at the top level, and have this check at the very end of the file to determine whether a main() function (or anything besides definitions) should be executed. If you’re publishing any code publicly, it’s usually good to include this so it’s easier for other people to reuse your code without triggering a whole bunch of side effects (or running your entire program!) when they import your code.
Sentinel Values
A sentinel value (also known as a flag value) is a unique value to which we attach some kind of special meaning. For example, in C, strings end with the null character \0. There are a couple reasons to dislike flag arguments(Martin Fowler criticizes them here, there’s also boolean blindness) and a couple workarounds, but if you need a flag or sentinel, you should know how to implement it.
Besides True and False, usually None is the first choice for a sentinel because it suffices most of the time. However, there are two reasons this might be a bad choice: first, you’re assigning extra meaning to None, which might be code smell. Second, you might need to work over arbitrary values including None. The best way to create a sentinel is just:
class="highlight">1
sentinel_value = object()
object() creates a unique object that has its own hash and will never be equal to anything else (unless you monkey patch the __eq__ method, but that’s on you). You can create any number of sentinels this way and they’ll all be treated as different values.
Strings
Formatting Strings
Python’s philosophy is that there should be one clear, unambiguous way to accomplish a goal. This is mostly true except for string formatting, for which there are no less than ten thousand ways to accomplish the same thing. The most modern way to do it is with f-strings:
class="highlight">1
2
i = 10
f"the value of i is {i}"
This outputs the value of i is 10. There are a handful of other techniques, like the str.format() method and regular string concatenation, but f-strings are the most performant and easiest to read.
It’s also worth mentioning that if you want to print the value of a variable quickly, you can print f"{x=}", which will print x= followed by the value of x. For example, if I have a list l = [1, 2, 3], then printing f"{l=}" prints l=[1, 2, 3].
Concatenating Strings
It’s also important to avoid concatenating multiple strings with +. Adding s = s1 + s2 + s3 + ... + sn or using + or += in a loop causes performance issues, because each addition creates a temporary string object that has to be garbage collected. It’s better to use the .join string method on the separator string:
class="highlight">1
string = ''.join(strings)
In this case, '' is the separator, so it’ll join all these strings with just the empty string in between. If you want to combine the strings with a space in between each, consider ' '.join(strings); if you want a newline, use'\n'.join(strings), etc.
Booleans
Comparison Chaining
Often, we have one value x and want to ensure that it falls within a specific range: that is, we want a < x (or a <= x) and x < b (or x <= b). Python allows us to chain this comparison like this by writing a < x < b (replacing < with <= as needed). This essentially simplifies to a < x and x < b. It’s also important to note that if we write something like
class="highlight">1
a < f() < b
then f will only be called once, rather than twice.
However, comparison chaining introduces some weird edge cases. For example, the following two statements
(False == False) in [False]False == (False in [False])
are both False, as we would expect. However, False == False in [False] is True, because it simplifies to False == False and False in [False]. Both expressions on either side of and are True, so the whole thing is True; when we parenthesize this statement as we did in the two examples above, they cease to be chained expressions and just evaluate to True in False and False == True, which are both False.
Truthy and Falsy
Truthy and falsy sound like words only children would ever use, so naturally computer scientists use them all the time. Truthiness is a property of an object that determines whether it evaluates to True or False when it’s (implicitly) cast to a boolean. For example, the empty string "", None, 0, and empty collections like lists [] and dictionaries {} will evaluate to False when cast to booleans. Falsy values tend to be identity elements: a + 0 = a for any number a, s + "" = s for any string s, and so on. (None is the odd one out here because it doesn’t have any operations defined on it, but it gives off a “falsy vibe”, so that’s what it is.) Non-empty strings and collections are truthy, as are nonzero numbers and so on.
Newcomers to Python often write things like:
class="highlight">1
2
if len(mylist) == 0:
...
However, if and while statements implicitly cast to booleans, so we can simplify this to
class="highlight">1
2
if mylist:
...
No need to even call bool()!
There are three important notes about truthiness:
- A truthy value is not equal to
True and a falsy value is not equal to False: 1 == True is False, as is "" == False. (Python might play fast and loose with its types, but at least it’s not JavaScript.) However, you can cast to a boolean explicitly: bool(1) == True and bool("") == False are both True. - Many functions like
any and all just check for truthiness instead of requiring values to be exactly True or False. - New Python classes are truthy by default, because every object in Python inherits from
object whose __bool__ method just returns True. By overriding the __bool__ method, you can set the truth value for a class you’ve written. For example, if I have a collection that implements len(), a reasonable __bool__ override might be:
class="highlight">1
2
def __bool__(self):
return bool(len(self))
Consider Using in Instead of or
Suppose I have code like this:
class="highlight">1
2
if x == 1 or x == 3 or x == 4:
...
It’s usually neater to just write:
class="highlight">1
2
if x in (1, 3, 4):
...
Indexing
Python allows you to get slices when indexing. Suppose we have a list l = list(range(10)). Then
l[start:] gets everything whose index i satisfies start <= i, so l[5:] evaluates to [5, 6, 7, 8, 9]. Note that the item at index start is included.l[:stop] gets everything whose index i satisfies i < stop, so l[:5] evaluates to [0, 1, 2, 3, 4]. Note that the item at index stop is excluded.l[start:stop] gets everything whose index i satisfies start <= i < stop, so l[2:6] evaluates to [2, 3, 4, 5].l[start:stop:step] will get everything whose index is one of start, start + i, start + i * 2 until it exceeds stop. Importantly, you can also omit some of the entries here: l[::2] gets every 2nd element (so those with even indices), l[::3] gets every 3rd element, l[1::2] gets every 2nd element starting from l[0] (so those with even indices), etc.- Negative list indices, like
l[-1], index from the right, so l[-1] gets the last element, l[-2] gets the item second from the last, and so on. You can combine this with slicing to get l[-5:-2], which gets the fifth-, fourth-, and third- to last elements: [5, 6, 7] - Finally, you can also assign the final index. For example, the documentation points out that
l[len(l):] = [x] is equivalent to l.append(x).
(If you think this is complicated, wait until you try numpy!)
Looping, Iterables, and Iterators
Any item that we can iterate over with a for loop in Python is called an iterable. When I write a for loop like
class="highlight">1
2
for i in range(10):
...
Python does a handful of things. It evaluates range(10) to get a range object back. Python then calls the iter method of that range object, which produces something called an iterator. Iterators are basically a stream of data that let us loop over an iterable object once. The for loop then assigns i = iterator.__next__() and executes the code in the body of the for loop, then repeats this process over and over until iterator.__next__() raises a StopIteration exception to indicate that it’s done producing values, at which point the loop ends.
It’s worth mentioning that this is why
class="highlight">1
2
for element in mylist:
element = ...
fails to mutate the values in mylist, but
class="highlight">1
2
for i, element in enumerate(mylist):
mylist[i] = ...
succeeds.
We usually only deal with iterators when we’re designing them ourselves, but we deal with iterables all the time. Functions like range, zip, map, and filter return iterables, and collections like lists are also iterables.
Mutating a Collection with a Loop
This shows why the following code attempting to reassign the values of a list with a for loop fails to change the list:
class="highlight">1
2
for item in mylist:
item = ...
Here, the loop variable item is a copy of the element in the list. You can still access item’s methods and fields and mutate item internally, but changing what’s in the list requires operating on the list itself. Loop using each item’s index and indexing the list accomplishes this:
class="highlight">1
2
for i in range(len(mylist)):
mylist[i] = ...
Generators
Lists are more or less the standard way to store sequential data in Python; however, there are several problems with lists:
- They require us to know every that has to be in the list
- They require us to store all the list data in memory
- They always hold a specific number of elements at any given time
Generators are an alternative to lists that don’t have any of these pitfalls. Generators are also iterables, but rather than storing a pre-computed collection of elements in memory like a list does, a generator generates each element “on demand” in sequence; this is called lazy evaluation.
Generators are particularly for computing things which are only needed on an “on-demand” basis and usually only used in sequence, like the elements of range objects.
We can write our own generator like this:
class="highlight">1
2
3
4
5
6
7
def generator_function():
i = 0
while True:
yield i
i += 2
g = even_generator()
Here, generator_function is a function that produces a generator object, and g is an instance of a generator object produces by generator_function. From here, if we repeatedly call g.__next__(), it will produce 0, 2, 4, and so on ad infinitum. You can think of a generator function like a function where the return statement is replaced by yield. Every time a generator is called, it runs until it hits the yield statement, returns that value, and then pauses until it’s called again, at which point it runs from where it left off. In this case, that means the second time we called g.__next__(), the line i += 2 was the first one evaluated, and the next lines were all in the while loop. Generators can also pull from other collections, read from a file, or do pretty much anything a regular function can do; they’re very flexible.
Generators can’t be indexed because they would require the generator to call .__next__ several times, which produces side effects. It also means that if I could index elements that have already been previously computed, I would have needed a mechanism for storing those, which might defeat the memory-saving purpose of having a generator in the first place. Of course, you can just repeatedly pull elements out of a generator calling the generator’s .__next__ method and just saving the results in a list. You can simply call the list function on a generator, but this will cause your program to hang if the generator never stops generating!
If you want the generator to stop generating, just raise the StopIteration exception yourself:
class="highlight">1
2
3
4
5
6
7
def generator_function():
i = 0
while True:
yield i
i += 2
if i > 10:
raise StopIteration
Yes, this uses the syntax for raising an exception. No, this doesn’t mean that you need to wrap all your generators in try/except: Python already internally checks for this StopIteration exceptions while looping, but it doesn’t check for it if you’re just calling the .__next__ method raw; you’ll have to catch that exception yourself.
Because of the prevalence of iterables, most of Python’s standard library functions assume their inputs are only iterables (rather than something with more functionality) as inputs and produce iterables as outputs. This is both more memory efficient and more flexible than just using lists for everything, which was common in Python 2. There are also other kinds of iterables like sequences and so on. (range objects are actually sequences, and that’s why you can index them.)
Enumerating
Frequently in Python, we want to loop over a collection of n-tuples. The most common situation is the fantastic standard library enumerate function, which allows us to keep track of the index and associated element of the collection we’re looping over. If I have a list like mylist = ['a', 'b', 'c'], then the code
class="highlight">1
2
for i in enumerate(mylist):
...
The I will range over the tuples (0, "a"), (1, "b"), and (2, "c"). Similarly, I might have a dictionary
class="highlight">1
2
3
4
5
mydict = {
"x": 3,
"y": 2,
"z": 1,
}
Looping over mydict.values() will produce a similar result: ("x", 3), ("y", 2), and ("z", 1). This is unfortunately kind of a pain if I want to refer to i[0] and i[1] individually rather than the tuple itself. I could write
class="highlight">1
2
3
for i in mytuples:
first, second = i
...
but Python allows us to unpack in the loop itself:
class="highlight">1
2
3
4
5
for i, element in enumerate(mylist):
...
for key, value in mydict.values():
...
This can be extended to three, four, or more variables, depending on how many your iterator produces.
Comprehensions
If you have experience with functional programming, you may be familiar with functions like map and filter. Map essentially takes a function and a list and applies that function to every element in the list. Filter takes a predicate and a list and removes all items from the list that don’t satisfy the predicate. These functions are very useful, but in Python are usually overshadowed by much more “Pythonic” expressions: list, set, and dictionary comprehensions and generator expressions.
map as a List Comprehension
To see where list comprehensions come in handy, let’s look at the naive way to create a list that just includes the first five even numbers:
class="highlight">1
2
3
evens = []
for i in range(5):
evens.append(i * 2)
This produces the list [0, 2, 4, 6, 8]. However, there’s a more elegant way to do this:
class="highlight">1
evens = [i * 2 for i in range(5)]
This is much more compact and easy to read (once you get used to it). It’s meant to mimic mathematical setbuilding notation, and is also an expression rather than a statement. We can also incorporate ternary expressions:
class="highlight">1
odd_or_even = ["even" if i % 2 == 0 else "odd" for i in range(5)]
This is essentially the preferred way to do map; the general form of a basic list comprehension is
class="highlight">1
mylist = [expression(x) for x in iterable]
You can also construct the iterable using a nested list comprehension, but that can get ugly quickly.
filter as a List Comprehension
Python has a similar alternative to filter, too:
class="highlight">1
mylist = [x for x in range(10) if x % 2 == 0]
This also generates the list [0, 2, 4, 6, 8]. Notice that when if x % 2 == 0 is false, nothing gets added to the list: there’s no need to specify some value for when the condition is False. These two kinds of list comprehensions can be combined to accomplish a map and a filter in a single expression!
Dictionary and Set Comprehensions
Two other kinds of comprehensions I’d like to talk about are set comprehensions and dictionary comprehensions. As the names imply, these are the equivalent of list comprehensions for sets.
For sets, list comprehensions work exactly the same, with the added bonus that sets don’t allow duplicates:
class="highlight">1
myset = {expression(i) for i in iterable}
This can be an easy way to screen out duplicates.
For dictionaries, the syntax is slightly different:
class="highlight">1
mydict = {key_expression(i) : value_expression(i) for i in iterable}
Don’t forget the : or you might accidentally create a set comprehension instead of a dict comprehension!
Generator Expressions
Generator expressions are the way I most often create generators. They’re essentially the same as a list or set comprehension, except they use ( ) instead of [ ] or { }. Importantly, unlike lists, sets, or dictionaries, this generator expression is still evaluated lazily. This also directly produces a generator, rather than a generator function.
You might notice that there doesn’t appear to be a way to do tuple comprehensions, and that’s correct. Instead, you can just cast:
class="highlight">1
mytuple = tuple(expression(x) for x in iterable)
Control Flow
with ... as ...
Often, we want to handle an object that needs some kind of “setup” and some kind of “teardown”. Common examples are files or other kinds of streams, where we want to open the file, read or write to it, and we have to close it at the end or else we’re squatting on important resources like memory and file locks, any changes we write might not be properly saved, etc. Usually, this means we need to remember to call close(), usually in the finally part of a try/finally statement. Doing this a lot results in a lot of boilerplate.
The with statement is the best practice for handling objects like this, and essentially replaces try/finally. Rather than writing
class="highlight">1
2
3
4
5
try:
f = open("path/to/file")
...
finally:
f.close()
We can instead write
class="highlight">1
2
with open("path/to/file") as f:
...
You can also manage multiple objects at the same time:
class="highlight">1
2
with expression1 as thing1, expression2 as thing2:
...
We can do this with any context manager objects, meaning objects that have an __enter__ and an __exit__ method defined.
match
Python provides the match statement for structural pattern matching, another common and powerful feature from functional programming. Despite the similarities to switch statements in languages with C-style syntax, Python’s match is not a switch statement, and treating them the same may introduce unwanted side effects and mutation.
The use cases for match are for control flavor based on the structure and format of data, rather than just by value. This is one of the topics that would be hard to give a full tutorial of here, but I can direct you to a solid tutorial. I can also provide an example: if I were designing a simple command line shell, it might look something like
class="highlight">1
2
3
4
5
6
7
8
9
10
11
12
13
command = input()
match command.split():
case ["quit"] | ["exit"]:
...
case ["update"]:
...
case ["run", program_name]:
...
case ["upgrade", *package_names]:
...
case _:
...
Let’s break down a couple of these cases:
- The
["quit"] | ["exit"] case will match both the values ["quit"] and ["exit"], with the pipe | functioning as an or. - The
["update"] case will match exactly the value ["update"]. - The
["run", program_name] case will match a two-element list with "run" as the first element and anything as the second element; that second value will be treated as a variable named program_name in the body of that case. - The
["upgrade", *package_names]: case will match a list with "upgrade" as the first element and any number of additional elements, which will be bundled together in a list called package_names. - The
_ case matches anything; case _: is to match what default: is to a switch. The _ is a wildcard, meaning that it matches anything.
You can also match positional attributes and stuff like abstract syntax trees (I assume this is how you would make tooling like parsers and linters with Python).
Ternary
There are many cases where a variable’s value is determined by a conditional, like
class="highlight">
