Have you ever passed your data to a list of functions and classes without knowing for sure how the output is like?
You might try to save the data then check it in your Jupyter Notebook to make sure the output is as you expected. This approach works, but it is cumbersome.
Another common issue is that it’s hard to understand the relationships between functions when looking at a Python script that contains both the code to create and execute functions.
Your code looks even more complex and hard to follow as the project grows.
Wouldn’t it be nicer if you can visualize how the inputs and outputs of different functions are connected like below?
That is when Kedro comes in handy.
What is Kedro?
Kedro is an open-source Python framework for creating reproducible, maintainable, and modular data science code. It borrows concepts from software engineering best-practice and applies them to machine learning code.
Kedro allows you to:
- Create a data science from a cookie-cutter template
- Create a data science pipeline
- Slice a pipeline
- Modularize a pipeline
- Configure your data and parameters through a YAML file
- Effortlessly analyze nodes’ outputs in a Jupyter Notebook
- Visualize your pipeline
- Create documentation for your project
In this article, I will go through each of these features and explain how they can be helpful for your data science projects.
To install Kedro, type:
pip install kedro
Set Up a Data Science Project
Create a Data Science from a Cookie-Cutter Template
Have you ever contemplated how to structure your data science project so that it is logical and reasonably standardized? Wouldn’t it be nice if you can create a well-defined and standard project structure in one line of code?
That could be easily done with Kedro. After installing Kedro, you can create a new empty project using:
$ kedro new
After going through a series of questions, a new project will be created with the structure like below:
If we look at the project structure at a higher level, we can see that there are 5 main directories:
conf: Store configuration files
data: Store data
docs: Store documentation of the project
logs: Store log files
notebooks: Store Jupyter Notebooks
src: Store the main code
Kedro requires some basic dependencies before using. These dependencies are specified under
src/requirements.txt . To install these dependencies, type:
$ kedro install
And all necessary dependencies to run the pipelines will be installed in your environment.
Now that we learn how to set up a data science project, let’s understand how to create a pipeline with Kedro.
Create a pipeline
To create a new pipeline, type:
$ kedro pipeline create <NAME>
Since there are often 2 steps in a data science project: data processing and model training, we will create a pipeline called
data_engineer and a pipeline called
$ kedro pipeline create data_engineering
$ kedro pipeline create data_science
These two pipeline directories will be created under
Each pipeline consists of 4 files:
README.md: specifies information about the pipeline
node.py: contains nodes
pipeline.py: contains pipelines
A pipeline consists of multiple nodes. Each node is a Python function.
For each node, there are input(s) and output(s):
Visualization of the node
Inputs and outputs of each node can be specified using
None , string, a list of strings, or a dictionary. Note that these strings are abstract names, not the real values.
Why are names useful? If we know the name of each function’s inputs and outputs, we can easily grab a specific input or output by calling its name. Thus, there is less ambiguity in your code.
You can find all node definition syntax here.
A pipeline consists of a list of nodes. It connects the outputs of one node to the inputs of another node.
For example, in the code below,
'classes' (output of the function
get_classes ) and
'encoded_data' (output of the function
encode_categorical_columns) are used as the inputs of the function
split_data .https://towardsdatascience.com/media/96bee6b6bd866de4e1178f6362549579The Data Catalog is located under
Visualization of the pipeline above:
Register and Run the Pipeline
Now that we have the node and the pipeline, let’s register these pipelines in the file
After we have registered our pipelines, we can run these pipelines using:
$ kedro run
If you only want to run a specific pipeline, add
--pipeline=<NAME> to the command
kedro run :
$ kedro run --pipeline=de
Slice a Pipeline
If you prefer to run only a portion of a pipeline, you can slice the pipeline. Four options to slice a pipeline are:
--from-nodes: Run the pipeline starting from certain nodes
--to-nodes: Run the pipeline until reaching certain nodes
--from-inputs: Run the pipeline starting from the nodes that produce certain inputs
--to-outputs: Run the pipeline until reaching the nodes that produce certain outputs
$ kedro run --to-nodes=encode_categorical_columns
… allows you to run the pipeline until reaching the node
Modularize a Pipeline
Sometimes, you might want to reuse the same pipeline for different purposes. Kedro allows you to create modular pipelines, which are isolated and can be reused.
For example, instead of writing two separate pipelines “cook lunch pipeline” and “cook dinner pipeline”, you can write a pipeline called “cook pipeline”.
Then, turn “cook pipeline” into “cook meat pipeline” and the “cook vegetable pipeline” by switching the inputs and outputs of “cook pipeline” with new values.
Modular pipelines are nice since they are portable, easier to develop, test, and maintain. Find instructions on how to modularize your Kedro pipeline here.
Configure Your Parameters and Data Through a YAML File
Kedro also allows you to specify the parameters for a function using a YAML file. This is very nice because you can view all of your parameters from a file without digging into the source code.
To configure your project with a configuration file, start with putting the parameters used for the
data_engineering pipeline in the
Now, accessing a parameter from the pipeline is simple. All we need to do is to add
params: before the name of parameter we want to access.
In the example below, we use
params:test_data_ratio to access the parameter
test_data_ratio from the configuration file.
Now you might wonder: how do we access the data from the pipeline?
Kedro allows you to load and save data with Data Catalog. Data Catalog is located under
Data Catalog is a configuration file that allows you to specify the data type, and the location where the data is saved.
For example, to save the
encoded_data output from the
… we can insert the name
encoded_data to the
conf/base/catalog.yml file. Under
encoded_data , we specify the dataset’s type and its location.
pandas.CSVDataSet tells Kedro that we want to save
encoded_data as a CSV and load it using pandas. Kedro also supports other dataset types such as pickle, parquet, excel, plot, Spark, SQL, etc. Find all datasets that Kedro supports here.
Using a Data Catalog to manage loading and saving of data is very nice because you can specify:
- the locations of each file
- how the data is saved and loaded
… by only editing one file. The next time when you forget the locations of your datasets, you can look at this configuration file to find their information.
Load Data Catalog and Parameters From Jupyter Notebook
Load Data Catalog
Have you ever wanted to quickly check the outputs of a function from a Jupyter Notebook? Normally, you need to first save the data:
… then load it into a Jupyter Notebook:
With Kedro, saving and loading data don’t require extra code. To start a Kedro’s Jupyter session, type:
$ kedro jupyter notebook
After running the pipeline that produces the output
encoded_data specified in the
… we can easily load the data in the notebook using
If you want to specify how to save and load the output
Note that the configuration above will be equivalent to:
All parameters inside
src/base can also be easily loaded with
Run a Pipeline
Sometimes you might want to experiment with new Python functions and observe their outputs in the notebook. Luckily, Kedro allows you to run a pipeline inside a notebook:
For quick testing, you can also convert functions from a Jupyter Notebook into Kedro nodes.
Visualize your Pipelines
If you are confused about the structure of your pipeline, you can visualize the entire pipeline using kedro-viz. Start with installing kedro-viz:
pip install kedro-viz
$ kedro viz
… to visualize your pipelines. A website will be automatically open in your browser. You should see something like the below:
The visualization captures the relationships between datasets and nodes. Clicking a node will provide you more information about that node such as its parameters, inputs, outputs, file path, and code.
Create Documentation for the Project
Documenting your project makes it easier for your team to understand your project and how to use your code. Kedro makes it easy for you to create documentation based on the code structure of your project and includes any docstrings defined in your code.
To create documentation for your project, type:
$ kedro build-docs
And documentation for your project will be automatically created under
docs/build/html ! You can browse the documentation by either open
docs/build/html/index.html or run:
$ kedro build-docs --open
… to automatically open the documentation after building.
Your documentation should look similar to below:
Congratulations! You have just learned how to create reproducible and maintainable data science projects using Kedro. It might take a little bit of time to learn Kedro, but once your data science is set up with Kedro, you will find it much less overwhelming to maintain and update your projects.
I hope this article will give you the motivation to use Kedro in your existing or future data science project.
The source code of the demo project in this article could be found here:GitHub – khuyentran1401/kedro_demo: A data science project using KedroThis is your new Kedro project, which was generated using Kedro 0.17.4. Take a look at the Kedro documentation to get…github.com