The built-in Python
collections library is a treasure trove of useful tools. I will focus on the two structures that I find the most useful:
defaultdict. Understanding these data structures will help you make your code more concise, readable, and easy to debug.
Counter object takes an iterable and aggregates the items into counts of unique values in the iterable. The results are stored in a dictionary-like structure where the unique items are keys and the counts are values. For example, the following code takes a list of words and returns the counts of each word:
from collections import Countertext = "apple banana orange apple apple orange"
counts = Counter(text.split())print(counts.most_common())
# [('apple', 3), ('orange', 2), ('banana', 1)]print(counts['apple'])
You can retrieve values from a
Counter in the same way as you would with a normal Python dictionary. Note that
Counter has the very nice property that if you query a
Counter for a key that doesn’t exist (like ‘pear’ above), it returns 0 rather than giving you a
Another very useful feature of
Counter objects is that they can be merged with a simple
+ operator. This makes combining counts of items from different locations/files a breeze:
This saves a lot of time and lines of code. I use
Counter a lot in text-processing/NLP tasks, and it definitely makes my life easier. Here are a few final tips and tricks for working with
dict()to convert a counter into a plain Python dictionary.
- use the
most_common()function with no arguments to return a list of (item, count) tuples, sorted in descending order by count.
- count characters in a string using
Counter— this works because a string is an iterable in Python.
This is a great alternative to the basic dictionary data structure when you don’t want to worry about
KeyErrors and special cases. You simply create a
defaultdict with a default of your choice, and the data structure will automatically assign the default value to any previously unseen keys. The important thing to understand is that the argument to a
defaultdict constructor should be a
callable. This includes the following:
list: default is an empty list
int: default is 0
lambdaexpressions: very flexible, can make anything callable
set: default is empty set
This is a very useful data structure, because it removes the need to check if an item exists in a dictionary before incrementing/modifying its value.https://towardsdatascience.com/media/452e130d00d297e3e7580ff38a128cbd
Let’s see a practical example of when we could use a defaultdict to write really elegant, concise code. The below code uses
defaultdict to implement the training loop of a bigram language model from scratch in 5 lines of code! For more context on n-gram language models, you can check out a previous post I wrote here, in which I did NOT use
defaultdict. Notice the amount of code we save by using
The trick is the nested use of
defaultdict on Line 6 above. A language model is trained to learn the probabilities of words in context. We want to have a nested data structure where the outer-layer key specifies the context (i.e. previous word in the case of a bigram model) and the inner-layer key specifies the current word. We want to be able to ask questions like: “In the training data, how many times was the word the followed by the word cat”?
Note that the inner
defaultdict is actually just doing the exact same thing as a
Counter, so we can replace the above line of code with the following line and have the same result:
self.d = defaultdict(Counter)
Thank you for reading this far! I hope you try using the
defaultdict structures in your next Python project. Let me know in the comments if you have any other good use cases for them! If you found this interesting, check out my other Python-related articles: