If you use Python for your data science projects, it’s very likely that pandas is one of the most used libraries. For me, as a biomedical scientist, I use pandas almost every day to process and analyze data. From time to time, I find that I reinvented the wheel because I didn’t know the existence of related functions that could do the job easily. In this article, I’d like to share some functions that many people didn’t know when they started to use pandas.
1. nsmallest and nlargest
When we deal with numeric values, we often need to find the extreme values of specific columns for data quality check. We can certainly sort the values and select the top rows, as shown below.https://towardsdatascience.com/media/782bd194507cfd69f0a9950ee65d5317Reinvented Wheel to Get Extreme Records
However, using the
sort_values to get the extreme records is reinventing the wheel, because pandas has implemented functions with better performance to achieve the same results. The following code shows you such usage. It’s much cleaner, right?
>>> df.nlargest(3, "Score")
1 Jennifer 100
3 Joshua 99
0 John 95
>>> df.nsmallest(3, "Score")
4 Jack 88
6 Jason 89
5 Jeremy 92
2. idxmin and idxmax
One time, I needed to find out the largest records by group. For instance, suppose that we have the following DataFrame. I want to extract the records whose scores are the highest in their respective groups.https://towardsdatascience.com/media/63dc8db97528b2f7b035616433c99671Example DataFrame
My first reaction was using the
groupby function with the
max function, as below. Although it tells me the highest scores for each group, it doesn’t include other columns (e.g., name) that I need. Maybe I could’ve used these max values to filter the original records. My gut feeling was it was too complicated. There must be something else for this.
Name: score, dtype: int64
After I did some research, I was finally able to find the solution that I was looking for — using the
idxmax function. This function is to find out the index of the largest value of the specified column, with which we can retrieve the entire records using the
loc property. The following code snippet shows you such usage. Isn’t it much more neat?
>>> record_indices = df.groupby("group")["score"].idxmax()
Name: score, dtype: int64
group name score
0 1 A1 97
4 2 B2 95
6 3 A3 100
In a similar manner, if you’re interested in finding out the records with the smallest values, there is a counterpart for the
idxmin, which is specialized in retrieving smallest values’ indices.
When we deal with continuous variables, it’s sometimes necessary to create ordinal values based on the quantiles of the existing numeric values. Theoretically, some people, including me initially, may tend to find out the quantiles of a certain column and create the categorical column by hard-coding the cutoffs.
However, pandas has an easy option to achieve this operation — using the
qcut function, which create distinct values based on their rankings. Using the DataFrame that we defined in the previous section, we can dichotomize the data based on the
Score column, as shown below.
>>> df["score_dichotomized"] = pd.qcut(df["score"], 2, labels=['bottom50', 'top50'])
group name score score_dichotomized
0 1 A1 97 top50
1 1 B1 93 bottom50
2 1 C1 92 bottom50
3 2 A2 94 top50
4 2 B2 95 top50
5 2 C2 93 bottom50
6 3 A3 100 top50
7 3 B3 92 bottom50
8 3 C3 93 bottom50
When I dealt with date data, I just used the basic operations. For instance, if I needed to create a series of dates, I would just use the start date and create additional dates by using the
timedelta function from the built-in
Suppose that I need to create the dates for the first week in 2021, as the index of a DataFrame. Here’s a possible solution.
>>> import datetime
>>> start_date = datetime.datetime(2021, 1, 1)
>>> first_week = [start_date + datetime.timedelta(days=x) for x in range(7)]
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07'], dtype='datetime64[ns]', freq=None)
However, there is a better way — using the
date_range function. The following code snippet shows you such usage.
>>> pd.date_range(start='1/1/2021', periods=7)
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07'], dtype='datetime64[ns]', freq='D')
There are always great things to discover in pandas. When you find your current implementations are cumbersome, chances are that there are better solutions, possibly with just a simple function call, to solve your problems.