Table of Contents
We write a program to solve a problem or make a tool that we can repeatedly solve a similar problem. For the latter, it is inevitable that we come back to revisit the program we wrote, or someone else is reusing the program we write. There is also a chance that we will encounter data that we didn’t foresee at the time we wrote our program. After all, we still want our program to work. There are some techniques and mentalities we can use in writing our program to make our code more robust.
After finishing this tutorial, you will learn
- How to prepare your code for the unexpected situation
- How to give an appropriate signal for situations that your code cannot handle
- What are the good practices to write a more robust program
Let’s get started!
Sanitation and Assertive Programming
When we write a function in Python, we usually take in some argument and return some value. After all, this is what a function supposed to be. As Python is a duck-typing language, it is easy to see a function accepting numbers to be called with strings. For example:
This code works perfectly fine, as the
+ operator in Python strings means concatenation. Hence there is no syntax error; it’s just not what we intended to do with the function.
This should not be a big deal, but if the function is lengthy, we shouldn’t learn there is something wrong only at a later stage. For example, our program failed and terminated because of a mistake like this only after spending hours in training a machine learning model and wasting hours of our time waiting. It would be better if we could proactively verify what we assumed. It is also a good practice to help us communicate to other people who read our code what we expect in the code.
One common thing a fairly long code would do is to sanitize the input. For example, we may rewrite our function above as the following:
Or, better, convert the input into a floating point whenever it is possible:
The key here is to do some “sanitization” at the beginning of a function, so subsequently, we can assume the input is in a certain format. Not only do we have better confidence that our code works as expected, but it may also allow our main algorithm to be simpler because we ruled out some situations by sanitizing. To illustrate this idea, we can see how we can reimplement the built-in
This is a simplified version of
range() that we can get from Python’s built-in library. But with the two
if statements at the beginning of the function, we know there are always values for variables
c. Then, the
while loop can be written as such. Otherwise, we have to consider three different cases that we call
range(2,10,3), which will make our
while loop more complicated and error-prone.
Another reason to sanitize the input is for canonicalization. This means we should make the input in a standardized format. For example, a URL should start with “http://,” and a file path should always be a full absolute path like
/etc/passwd instead of something like
/tmp/../etc/././passwd. Canonicalized input is easier to check for conformation (e.g., we know
/etc/passwd contains sensitive system data, but we’re not so sure about
You may wonder if it is necessary to make our code lengthier by adding these sanitations. Certainly, that is a balance you need to decide on. Usually, we do not do this on every function to save our effort as well as not to compromise the computation efficiency. We do this only where it can go wrong, namely, on the interface functions that we expose as API for other users or on the main function where we take the input from a user’s command line.
However, we want to point out that the following is a wrong but common way to do sanitation:
assert statement in Python will raise the
AssertError exception (with the optional message provided) if the first argument is not
True. While there is not much practical difference between raising
AssertError and raising
ValueError on unexpected input, using
assert is not recommended because we can “optimize out” our code by running with the
-O option to the Python command, i.e.,
assert in the code
script.py will be ignored in this case. Therefore, if our intention is to stop the code from execution (including you want to catch the exception at a higher level), you should use
if and explicitly raise an exception rather than use
The correct way of using
assert is to help us debug while developing our code. For example,
While we develop this function, we are not sure our algorithm is correct. There are many things to check, but here we want to be sure that if we extracted every even-indexed item from the input, it should be at least half the length of the input array. When we try to optimize the algorithm or polish the code, this condition must not be invalidated. We keep the
assert statement at strategic locations to make sure we didn’t break our code after modifications. You may consider this as a different way of unit testing. But usually, we call it unit testing when we check our functions’ input and output conformant to what we expect. Using
assert this way is to check the steps inside a function.
If we write a complex algorithm, it is helpful to add
assert to check for loop invariants, namely, the conditions that a loop should uphold. Consider the following code of binary search in a sorted array:
def binary_search(array, target):
“””Binary search on array for target
array: sorted array
target: the element to search for
index n on the array such that array[n]==target
if the target not found, return -1
s,e = 0, len(array)
while s < e:
m = (s+e)//2
if array[m] == target:
elif array[m] > target:
e = m
elif array[m] < target:
s = m+1
assert m != (s+e)//2, "we didn't move our midpoint"
assert statement is to uphold our loop invariants. This is to make sure we didn’t make a mistake on the logic to update the start cursor
s and end cursor
e such that the midpoint
m wouldn’t update in the next iteration. If we replaced
s = m+1 with
s = m in the last
elif branch and used the function on certain targets that do not exist in the array, the assert statement will warn us about this bug. That’s why this technique can help us write better code.
Guard Rails and Offensive Programming
It is amazing to see Python comes with a
NotImplementedError exception built-in. This is useful for what we call offensive programming.
While the input sanitation is to help align the input to a format that our code expects, sometimes it is not easy to sanitize everything or is inconvenient for our future development. One example is the following, in which we define a registering decorator and some functions:
NotImplementedError with a custom error message in our function
activate(). Running this code will print you the result for the first two calls but fail on the third one as we haven’t defined the
tanh function yet:
As you can imagine, we can raise
NotImplementedError in places where the condition is not entirely invalid, but it’s just that we are not ready to handle those cases yet. This is useful when we gradually develop our program, which we implement one case at a time and address some corner cases later. Having these guard rails in place will guarantee our half-baked code is never used in the way it is not supposed to. It is also a good practice to make our code harder to be misused, i.e., not to let variables go out of our intended range without notice.
In fact, the exception handling system in Python is mature, and we should use it. When you never expect the input to be negative, raise a
ValueError with an appropriate message. Similarly, when something unexpected happens, e.g., a temporary file you created disappeared at the midway point, raise a
RuntimeError. Your code won’t work in these cases anyway, and raising an appropriate exception can help future reuse. From the performance perspective, you will also find that raising exceptions is faster than using if-statements to check. That’s why in Python, we prefer “it’s easier to ask for forgiveness than permission” (EAFP) over “look before you leap” (LBYL).
The principle here is that you should never let the anomaly proceed silently as your algorithm will not behave correctly and sometimes have dangerous effects (e.g., deleting wrong files or creating cybersecurity issues).
Good Practices to Avoid Bugs
It is impossible to say that a piece of code we wrote has no bugs. It is as good as we tested it, but we don’t know what we don’t know. There are always potential ways to break the code unexpectedly. However, there are some practices that can promote good code with fewer bugs.
First is the use of the functional paradigm. While we know Python has constructs that allow us to write an algorithm in functional syntax, the principle behind functional programming is to make no side effect on function calls. We never mutate something, and we don’t use variables declared outside of the function. The “no side effect” principle is powerful in avoiding a lot of bugs since we can never mistakenly change something.
When we write in Python, there are some common surprises that we find mutated a data structure unintentionally. Consider the following:
It is trivial to see what this function does. However, when we call this function without any argument, the default is used and returned us
. When we call it again, a different default is used and returned us
[1,1]. It is because the list
 we created at the function declaration as the default value for argument
a is an initiated object. When we append a value to it, this object is mutated. The next time we call the function will see the mutated object.
Unless we explicitly want to do this (e.g., an in-place sort algorithm), we should not use the function arguments as variables but should use them as read-only. And in case it is appropriate, we should make a copy of it. For example,
This code intended to keep a log of what we did in the list
LOGS, but it did not. While we work on the names “Alice,” “Bob,” and then “Charlie,” the three records in
LOGS will all be “Charlie” because we keep the mutable dictionary object there. It should be corrected as follows:
Then we will see the three distinct names in the log. In summary, we should be careful if the argument to our function is a mutable object.
The other technique to avoid bugs is not to reinvent the wheel. In Python, we have a lot of nice containers and optimized operations. You should never try to create a stack data structure yourself since a list supports
pop(). Your implementation would not be any faster. Similarly, if you need a queue, we have
deque in the
collections module from the standard library. Python doesn’t come with a balanced search tree or linked list. But the dictionary is highly optimized, and we should consider using the dictionary whenever possible. The same attitude applies to functions too. We have a JSON library, and we shouldn’t write our own. If we need some numerical algorithms, check if you can get one from NumPy.
Another way to avoid bugs is to use better logic. An algorithm with a lot of loops and branches would be hard to follow and may even confuse ourselves. It would be easier to spot errors if we could make our code clearer. For example, making a function that checks if the upper triangular part of a matrix contains any negative would be like this:
But we also use a Python generator to break this into two functions:
We wrote a few more lines of code, but we kept each function focused on one topic. If the function is more complicated, separating the nested loop into generators may help us make the code more maintainable.
Let’s consider another example: We want to write a function to check if an input string looks like a valid floating point or integer. We require the string to be “
0.12” and not accept “
.12“. We need integers to be like “
12” but not “
12.“. We also do not accept scientific notations like “
1.2e-1” or thousand separators like “
1,234.56“. To make things simpler, we also do not consider signs such as “
+1.23” or “
We can write a function to scan the string from the first character to the last and remember what we saw so far. Then check whether what we saw matched our expectation. The code is as follows:
isfloat() above is messy with a lot of nested branches inside the for-loop. Even after the for-loop, the logic is not entirely clear for how we determine the Boolean value. Indeed we can use a different way to write our code to make it less error-prone, such as using a state machine model:
Visually, we implement the diagram below into code. We maintain a state variable until we finish scanning the input string. The state will decide to accept a character in the input and move to another state or reject the character and terminate. This function returns True only if it stops at the acceptable states, namely, “integer” or “decimal.” This code is easier to understand and more structured.
In fact, the better approach is to use a regular expression to match the input string, namely,
However, a regular expression matcher is also running a state machine under the hood.
There is way more to explore on this topic. For example, how we can better segregate responsibilities of functions and objects to make our code more maintainable and easier to understand. Sometimes, using a different data structure can let us write simpler code, which helps make our code more robust. It is not a science, but almost always, bugs can be avoided if the code is simpler.
Finally, consider adopting a coding style for your project. Having a consistent way to write code is the first step in offloading some of your mental burdens later when you read what you have written. This also makes you spot mistakes easier.
In this tutorial, you have seen the high-level techniques to make your code better. It can be better prepared for a different situation, so it works more rigidly. It can also be easier to read, maintain, and extend, so it is fit for reuse in the future. Some techniques mentioned here are generic to other programming languages as well.
Specifically, you learned:
- Why we would like to sanitize our input, and how it can help make our program simpler
- The correct way of using
assertas a tool to help development
- How to use Python exceptions appropriately to give signals in unexpected situations
- The pitfall in Python programming in handling mutable objects