Introduction

Out with the old, and in with the new. More specifically, CatBoost [2] may be replacing XGBoost for many data scientists and ML engineers moving forward. Not only is this a great algorithm for data science competitions, but it is also very beneficial for professional data scientists and ML engineers for a variety of reasons.

Oftentimes, complex machine learning algorithms can take, what seems forever, to train, and then lack critical plotting tools that can help to explain features as well a the model training itself. Perhaps the biggest benefit of CatBoost is in its name, which we will expound upon below. With that being said, let’s take a deeper dive into three of the main benefits of CatBoost.

Categorical Features are More Powerful

Photo by Mikhail Vasilyev on Unsplash [3].

One of the most frustrating parts of data science is, well, the data. That data can include a variety of forms, but perhaps the main one that causes all the problems is that of the categorical feature type. This type can also be described as a string, object, or categorical dtype more specifically.

Current space/problems

The current place for most, if not all other machine learning algorithm is that they ingest categorical features with one-hot-encoding. This transformation means you will have many more columns, up to hundreds or even thousands, which will be valued at either a 0 or 1. Of course, this method is useful sometimes, like when there are just two categories, but when you have features like ID’s that can take thousands or even more unique forms, having a sparse dataframe can simply make your model take too long to train, especially if it runs frequently in a production environment.

Here is the current space highlighted:

  • Too many one-hot-encoded features/sparse
  • Too many columns
  • Training slows
  • Features are less powerful in general
  • Matching training/testing/inference data in production can be difficult

CatBoost space/benefits

CatBoost uses ordered target encoding, which essentially allows you to keep the feature/column in its original state, allowing you to collaborate with ml engineers and software engineers more easily. You will not have to worry about matching one-hot-encodings of several features, and interpret the features as to how they were intended. Not only that, but this encoding allows for more important feature importance.

Here is the CatBoost space highlighted:

  • No one-hot-encodings/sparse dataframe
  • Keeps original format of dataframe, making collaboration easier as well
  • Training is faster
  • Categorical features are more important
  • Model is more accurate
  • You can now work with features that you could not before like ID’s, or categorical features with high unique counts

Overall, with the use of encoding your categorical features in the way that CatBoost processes, the model will be more accurate, as tested and compared various times and outlined within their documentation in more detail. So, it is important to not just focus on numeric features, but categorical features that have often been neglected in past machine learning algorithms.

Integrated Plotting Features

Summary plot of CatBoost implementation of SHAP [4].

With newer libraries emerging, easier plotting techniques have become more prevalent, and CatBoost is no exception. There are a few plots to highlight from this library, which include the training plot as well as the feature importance plot.

Training plot

This plot allows you to see each iteration of your model with just setting plot=True in the fit method. You can see the train vs test plotting as well, which will show the curve that is expected over time when assessing accuracy. For example, if you were optimizing your model with MAE (mean absolute value), the plot would show training and testing eventually flatten but stopping before overfitting so you will know the optimal amount of iterations you need for your model.

  • View train vs test iterations
  • View output of every X iteration for more specific accuracy results
  • Transparency and granularity on training

SHAP Feature Importance

If you are already a data scientist, you will know that SHAP [5] is one of the best tools for assessing and viewing your feature importance. Say you want to see your top 10 features of your model, ranked, you can easily with this library feature. The main one I use is the summary plot, which shows all your features ranked, categorical and numeric alike. You can see each individual data point and its contribution by an increase or decrease on the target value. This plot is incredibly powerful when sharing results and feature analysis to stakeholders because it is very user-friendly and easy to understand.

  • Summary plot for showing top features of the model
  • The rank of each model on the target value
  • Individual force_plot for a specific prediction (very good granularity when diving deeper on a specific pull on feature colored by red and blue for high and low effect)
  • Explains model features simply for stakeholder collaboration and presentation

Overall, there are many more plots, of course, which are included in the official documentation, but these two might be the most common to use, because they are incredibly powerful and useful for your own understanding don’t the model, as well as for others without a data science background.

Efficient Processing

Photo by Kurt Cotoaga on Unsplash [6].

The last benefit of CatBoost is a product of the first benefit. Because you do not have a sparse dataframe, the processing of your model that uses plenty amount of categorical features is much faster than if you used other algorithms like XGBoost or Random Forest.

Here are some other more detailed benefit of efficiency in regard to CatBoost training and prediction:

  • Categorical feature processing allows for quicker training
  • Overfitting detector will stop model training when necessary automatically
  • Parameters self-tune so you do not have to waste time tuning, which can honestly take hours to even weeks — in my experience, CatBoost detail always wins over manual tuning or grid and randomized grid search
  • GPU training

Every new, more popular algorithm focuses on speed and efficiency, and this is something CatBoost does incredibly well.

Summary

Data scientists have a lot to consider when choosing a machine learning algorithm. There are many pros and cons to each algorithm for example, and each iteration usually makes an algorithm or library more pro-oriented, and to be frank, CatBoost is mainly all pros.

Here are three top benefits of the CatBoost library:

* Categorical Features are More Powerful* Integrated Plotting Features* Efficient Processing

I hope you found my article both interesting and useful. Please feel free to comment down below if you agree or disagree with these benefits of the CatBoost library. Why or why not? Have you used or heard of this library? What other benefits do you think are important to point out in regards to this library? These can certainly be clarified even further, but I hope I was able to shed some light on some of the benefits of CatBoost.

Thank you for reading!

Original Source

Author

Sr/MS Data Scientist. Top Writer in Artificial Intelligence, Technology, & Education.