If you use Python for your data science projects, it’s very likely that pandas is one of the most used libraries. For me, as a biomedical scientist, I use pandas almost every day to process and analyze data. From time to time, I find that I reinvented the wheel because I didn’t know the existence of related functions that could do the job easily. In this article, I’d like to share some functions that many people didn’t know when they started to use pandas.

1. nsmallest and nlargest

When we deal with numeric values, we often need to find the extreme values of specific columns for data quality check. We can certainly sort the values and select the top rows, as shown below.https://towardsdatascience.com/media/782bd194507cfd69f0a9950ee65d5317Reinvented Wheel to Get Extreme Records

However, using the sort_values to get the extreme records is reinventing the wheel, because pandas has implemented functions with better performance to achieve the same results. The following code shows you such usage. It’s much cleaner, right?

>>> df.nlargest(3, "Score")
Student Score
1 Jennifer 100
3 Joshua 99
0 John 95
>>> df.nsmallest(3, "Score")
Student Score
4 Jack 88
6 Jason 89
5 Jeremy 92

2. idxmin and idxmax

One time, I needed to find out the largest records by group. For instance, suppose that we have the following DataFrame. I want to extract the records whose scores are the highest in their respective groups.https://towardsdatascience.com/media/63dc8db97528b2f7b035616433c99671Example DataFrame

My first reaction was using the groupby function with the max function, as below. Although it tells me the highest scores for each group, it doesn’t include other columns (e.g., name) that I need. Maybe I could’ve used these max values to filter the original records. My gut feeling was it was too complicated. There must be something else for this.

>>> df.groupby("group")["score"].max()
group
1 97
2 95
3 100
Name: score, dtype: int64

After I did some research, I was finally able to find the solution that I was looking for — using the idxmax function. This function is to find out the index of the largest value of the specified column, with which we can retrieve the entire records using the loc property. The following code snippet shows you such usage. Isn’t it much more neat?

>>> record_indices = df.groupby("group")["score"].idxmax()
>>> record_indices
group
1 0
2 4
3 6
Name: score, dtype: int64
>>> df.loc[record_indices]
group name score
0 1 A1 97
4 2 B2 95
6 3 A3 100

In a similar manner, if you’re interested in finding out the records with the smallest values, there is a counterpart for the idxmax — idxmin, which is specialized in retrieving smallest values’ indices.

3. qcut

When we deal with continuous variables, it’s sometimes necessary to create ordinal values based on the quantiles of the existing numeric values. Theoretically, some people, including me initially, may tend to find out the quantiles of a certain column and create the categorical column by hard-coding the cutoffs.

However, pandas has an easy option to achieve this operation — using the qcut function, which create distinct values based on their rankings. Using the DataFrame that we defined in the previous section, we can dichotomize the data based on the Score column, as shown below.

>>> df["score_dichotomized"] = pd.qcut(df["score"], 2, labels=['bottom50', 'top50'])
>>> df
group name score score_dichotomized
0 1 A1 97 top50
1 1 B1 93 bottom50
2 1 C1 92 bottom50
3 2 A2 94 top50
4 2 B2 95 top50
5 2 C2 93 bottom50
6 3 A3 100 top50
7 3 B3 92 bottom50
8 3 C3 93 bottom50

4. date_range

When I dealt with date data, I just used the basic operations. For instance, if I needed to create a series of dates, I would just use the start date and create additional dates by using the timedelta function from the built-in datetime module.

Suppose that I need to create the dates for the first week in 2021, as the index of a DataFrame. Here’s a possible solution.

>>> import datetime
>>> start_date = datetime.datetime(2021, 1, 1)
>>> first_week = [start_date + datetime.timedelta(days=x) for x in range(7)]
>>> pd.DatetimeIndex(first_week)
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07'], dtype='datetime64[ns]', freq=None)

However, there is a better way — using the date_range function. The following code snippet shows you such usage.

>>> pd.date_range(start='1/1/2021', periods=7)
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07'], dtype='datetime64[ns]', freq='D')

Conclusions

There are always great things to discover in pandas. When you find your current implementations are cumbersome, chances are that there are better solutions, possibly with just a simple function call, to solve your problems.

Original Source