Have you ever passed your data to a list of functions and classes without knowing for sure how the output is like?

You might try to save the data then check it in your Jupyter Notebook to make sure the output is as you expected. This approach works, but it is cumbersome.

Another common issue is that it’s hard to understand the relationships between functions when looking at a Python script that contains both the code to create and execute functions.

Your code looks even more complex and hard to follow as the project grows.

Wouldn’t it be nicer if you can visualize how the inputs and outputs of different functions are connected like below?

Image by Author

That is when Kedro comes in handy.

What is Kedro?

Kedro is an open-source Python framework for creating reproducible, maintainable, and modular data science code. It borrows concepts from software engineering best-practice and applies them to machine learning code.

Kedro allows you to:

  1. Create a data science from a cookie-cutter template
  2. Create a data science pipeline
  3. Slice a pipeline
  4. Modularize a pipeline
  5. Configure your data and parameters through a YAML file
  6. Effortlessly analyze nodes’ outputs in a Jupyter Notebook
  7. Visualize your pipeline
  8. Create documentation for your project

In this article, I will go through each of these features and explain how they can be helpful for your data science projects.

To install Kedro, type:

pip install kedro

Set Up a Data Science Project

Create a Data Science from a Cookie-Cutter Template

Have you ever contemplated how to structure your data science project so that it is logical and reasonably standardized? Wouldn’t it be nice if you can create a well-defined and standard project structure in one line of code?

That could be easily done with Kedro. After installing Kedro, you can create a new empty project using:

$ kedro new

After going through a series of questions, a new project will be created with the structure like below:

Image by Author

If we look at the project structure at a higher level, we can see that there are 5 main directories:

Image by Author
  • conf: Store configuration files
  • data: Store data
  • docs: Store documentation of the project
  • logs: Store log files
  • notebooks: Store Jupyter Notebooks
  • src: Store the main code

Install Dependencies

Kedro requires some basic dependencies before using. These dependencies are specified under src/requirements.txt . To install these dependencies, type:

$ kedro install

And all necessary dependencies to run the pipelines will be installed in your environment.

Now that we learn how to set up a data science project, let’s understand how to create a pipeline with Kedro.

Create a pipeline

To create a new pipeline, type:

$ kedro pipeline create <NAME>

Since there are often 2 steps in a data science project: data processing and model training, we will create a pipeline called data_engineer and a pipeline called data_science :

$ kedro pipeline create data_engineering
$ kedro pipeline create data_science

These two pipeline directories will be created under src/project_name :

Image by Author

Each pipeline consists of 4 files:

  • : specifies information about the pipeline
  • : contains nodes
  • : contains pipelines


A pipeline consists of multiple nodes. Each node is a Python function.

For each node, there are input(s) and output(s):

Visualization of the node get_classes:

Image by Author

Inputs and outputs of each node can be specified using None , string, a list of strings, or a dictionary. Note that these strings are abstract names, not the real values.

Why are names useful? If we know the name of each function’s inputs and outputs, we can easily grab a specific input or output by calling its name. Thus, there is less ambiguity in your code.

Image by Author

You can find all node definition syntax here.


A pipeline consists of a list of nodes. It connects the outputs of one node to the inputs of another node.

For example, in the code below, 'classes' (output of the function get_classes ) and 'encoded_data' (output of the function encode_categorical_columns) are used as the inputs of the function split_data . Data Catalog is located under conf/base/catalog.yml .

Visualization of the pipeline above:

Image by Author

Register and Run the Pipeline

Now that we have the node and the pipeline, let’s register these pipelines in the file src/project_name/ :

After we have registered our pipelines, we can run these pipelines using:

$ kedro run

If you only want to run a specific pipeline, add --pipeline=<NAME> to the command kedro run :

$ kedro run --pipeline=de


Slice a Pipeline

If you prefer to run only a portion of a pipeline, you can slice the pipeline. Four options to slice a pipeline are:

  • --from-nodes : Run the pipeline starting from certain nodes
  • --to-nodes : Run the pipeline until reaching certain nodes
  • --from-inputs : Run the pipeline starting from the nodes that produce certain inputs
  • --to-outputs : Run the pipeline until reaching the nodes that produce certain outputs

For example,

$ kedro run --to-nodes=encode_categorical_columns

… allows you to run the pipeline until reaching the node encode_categorical_columns .

Modularize a Pipeline

Sometimes, you might want to reuse the same pipeline for different purposes. Kedro allows you to create modular pipelines, which are isolated and can be reused.

For example, instead of writing two separate pipelines “cook lunch pipeline” and “cook dinner pipeline”, you can write a pipeline called “cook pipeline”.

Then, turn “cook pipeline” into “cook meat pipeline” and the “cook vegetable pipeline” by switching the inputs and outputs of “cook pipeline” with new values.

Image by Author

Modular pipelines are nice since they are portable, easier to develop, test, and maintain. Find instructions on how to modularize your Kedro pipeline here.

Configure Your Parameters and Data Through a YAML File


Kedro also allows you to specify the parameters for a function using a YAML file. This is very nice because you can view all of your parameters from a file without digging into the source code.

To configure your project with a configuration file, start with putting the parameters used for the data_engineering pipeline in the conf/base/parameters/data_engineering.yml file.

Now, accessing a parameter from the pipeline is simple. All we need to do is to add params: before the name of parameter we want to access.

In the example below, we use params:test_data_ratio to access the parameter test_data_ratio from the configuration file.

Data Catalog

Now you might wonder: how do we access the data from the pipeline?

Kedro allows you to load and save data with Data Catalog. Data Catalog is located under conf/base/catalog.yml .

Data Catalog is a configuration file that allows you to specify the data type, and the location where the data is saved.

For example, to save the encoded_data output from the encode_categorical_columns node,

… we can insert the name encoded_data to the conf/base/catalog.yml file. Under encoded_data , we specify the dataset’s type and its location.

pandas.CSVDataSet tells Kedro that we want to save encoded_data as a CSV and load it using pandas. Kedro also supports other dataset types such as pickle, parquet, excel, plot, Spark, SQL, etc. Find all datasets that Kedro supports here.

Using a Data Catalog to manage loading and saving of data is very nice because you can specify:

  • the locations of each file
  • how the data is saved and loaded

… by only editing one file. The next time when you forget the locations of your datasets, you can look at this configuration file to find their information.

Load Data Catalog and Parameters From Jupyter Notebook

Load Data Catalog

Have you ever wanted to quickly check the outputs of a function from a Jupyter Notebook? Normally, you need to first save the data:

… then load it into a Jupyter Notebook:

Image by Author

With Kedro, saving and loading data don’t require extra code. To start a Kedro’s Jupyter session, type:

$ kedro jupyter notebook

After running the pipeline that produces the output encoded_data specified in the catalog.yml file,

… we can easily load the data in the notebook using catalog.load :

If you want to specify how to save and load the outputencoded_data, add load_args and save_args under encoded_data .

Note that the configuration above will be equivalent to:

Load Parameters

All parameters inside src/base can also be easily loaded withcontext.params .

Image by Author

Run a Pipeline

Sometimes you might want to experiment with new Python functions and observe their outputs in the notebook. Luckily, Kedro allows you to run a pipeline inside a notebook:

For quick testing, you can also convert functions from a Jupyter Notebook into Kedro nodes.

Visualize your Pipelines

If you are confused about the structure of your pipeline, you can visualize the entire pipeline using kedro-viz. Start with installing kedro-viz:

pip install kedro-viz

Then type:

$ kedro viz

… to visualize your pipelines. A website will be automatically open in your browser. You should see something like the below:

Image by Author

The visualization captures the relationships between datasets and nodes. Clicking a node will provide you more information about that node such as its parameters, inputs, outputs, file path, and code.

GIF by Author

Create Documentation for the Project

Documenting your project makes it easier for your team to understand your project and how to use your code. Kedro makes it easy for you to create documentation based on the code structure of your project and includes any docstrings defined in your code.

To create documentation for your project, type:

$ kedro build-docs

And documentation for your project will be automatically created under docs/build/html ! You can browse the documentation by either open docs/build/html/index.html or run:

$ kedro build-docs --open

… to automatically open the documentation after building.

Your documentation should look similar to below:

Image by Author
Image by Author


Congratulations! You have just learned how to create reproducible and maintainable data science projects using Kedro. It might take a little bit of time to learn Kedro, but once your data science is set up with Kedro, you will find it much less overwhelming to maintain and update your projects.

I hope this article will give you the motivation to use Kedro in your existing or future data science project.

The source code of the demo project in this article could be found here:GitHub – khuyentran1401/kedro_demo: A data science project using KedroThis is your new Kedro project, which was generated using Kedro 0.17.4. Take a look at the Kedro documentation to get…

I like to write about basic data science concepts and play with different algorithms and data science tools. You could connect with me on LinkedIn and Twitter.

Original source


Write A Comment