Motivation

Have you ever passed your data to a list of functions and classes without knowing for sure how the output is like?

You might try to save the data then check it in your Jupyter Notebook to make sure the output is as you expected. This approach works, but it is cumbersome.

https://towardsdatascience.com/media/e2a0b3f378f7e5e56a80fe5394d546da

Another common issue is that it’s hard to understand the relationships between functions when looking at a Python script that contains both the code to create and execute functions.

https://towardsdatascience.com/media/f57e6842a13ba07faa9291b0236334cc

Your code looks even more complex and hard to follow as the project grows.

Wouldn’t it be nicer if you can visualize how the inputs and outputs of different functions are connected like below?

Image by Author

That is when Kedro comes in handy.

What is Kedro?

Kedro is an open-source Python framework for creating reproducible, maintainable, and modular data science code. It borrows concepts from software engineering best-practice and applies them to machine learning code.

Kedro allows you to:

  1. Create a data science from a cookie-cutter template
  2. Create a data science pipeline
  3. Slice a pipeline
  4. Modularize a pipeline
  5. Configure your data and parameters through a YAML file
  6. Effortlessly analyze nodes’ outputs in a Jupyter Notebook
  7. Visualize your pipeline
  8. Create documentation for your project

In this article, I will go through each of these features and explain how they can be helpful for your data science projects.

Visit here:https://www.dataflareup.com/nft-marketplace-development/

To install Kedro, type:

pip install kedro

Set Up a Data Science Project

Create a Data Science from a Cookie-Cutter Template

Have you ever contemplated how to structure your data science project so that it is logical and reasonably standardized? Wouldn’t it be nice if you can create a well-defined and standard project structure in one line of code?

That could be easily done with Kedro. After installing Kedro, you can create a new empty project using:

$ kedro new

After going through a series of questions, a new project will be created with the structure like below:

Image by Author

If we look at the project structure at a higher level, we can see that there are 5 main directories:

Image by Author
  • conf: Store configuration files
  • data: Store data
  • docs: Store documentation of the project
  • logs: Store log files
  • notebooks: Store Jupyter Notebooks
  • src: Store the main code

Install Dependencies

Kedro requires some basic dependencies before using. These dependencies are specified under src/requirements.txt . To install these dependencies, type:

$ kedro install

And all necessary dependencies to run the pipelines will be installed in your environment.

Now that we learn how to set up a data science project, let’s understand how to create a pipeline with Kedro.

Create a pipeline

To create a new pipeline, type:

$ kedro pipeline create <NAME>

Since there are often 2 steps in a data science project: data processing and model training, we will create a pipeline called data_engineer and a pipeline called data_science :

$ kedro pipeline create data_engineering
$ kedro pipeline create data_science

These two pipeline directories will be created under src/project_name :

Image by Author

Each pipeline consists of 4 files:

  • __init__.py
  • README.md : specifies information about the pipeline
  • node.py : contains nodes
  • pipeline.py : contains pipelines

Node

A pipeline consists of multiple nodes. Each node is a Python function.

https://towardsdatascience.com/media/48dcc48b72d6a144f966816ebe166422

For each node, there are input(s) and output(s):

https://towardsdatascience.com/media/bbd9141c4f90dc56ef063f5bdc21c330

Visualization of the node get_classes:

Image by Author

Inputs and outputs of each node can be specified using None , string, a list of strings, or a dictionary. Note that these strings are abstract names, not the real values.

Why are names useful? If we know the name of each function’s inputs and outputs, we can easily grab a specific input or output by calling its name. Thus, there is less ambiguity in your code.

Image by Author

You can find all node definition syntax here.

Pipeline

A pipeline consists of a list of nodes. It connects the outputs of one node to the inputs of another node.

For example, in the code below, 'classes' (output of the function get_classes ) and 'encoded_data' (output of the function encode_categorical_columns) are used as the inputs of the function split_data .https://towardsdatascience.com/media/96bee6b6bd866de4e1178f6362549579The Data Catalog is located under conf/base/catalog.yml .

Visualization of the pipeline above:

Image by Author

Register and Run the Pipeline

Now that we have the node and the pipeline, let’s register these pipelines in the file src/project_name/pipeline_registry.py :

https://towardsdatascience.com/media/44c1681e2255d041c9325239c05b437f

After we have registered our pipelines, we can run these pipelines using:

$ kedro run

If you only want to run a specific pipeline, add --pipeline=<NAME> to the command kedro run :

$ kedro run --pipeline=de

Cool!

Slice a Pipeline

If you prefer to run only a portion of a pipeline, you can slice the pipeline. Four options to slice a pipeline are:

  • --from-nodes : Run the pipeline starting from certain nodes
  • --to-nodes : Run the pipeline until reaching certain nodes
  • --from-inputs : Run the pipeline starting from the nodes that produce certain inputs
  • --to-outputs : Run the pipeline until reaching the nodes that produce certain outputs

For example,

$ kedro run --to-nodes=encode_categorical_columns

… allows you to run the pipeline until reaching the node encode_categorical_columns .

Modularize a Pipeline

Sometimes, you might want to reuse the same pipeline for different purposes. Kedro allows you to create modular pipelines, which are isolated and can be reused.

For example, instead of writing two separate pipelines “cook lunch pipeline” and “cook dinner pipeline”, you can write a pipeline called “cook pipeline”.

Then, turn “cook pipeline” into “cook meat pipeline” and the “cook vegetable pipeline” by switching the inputs and outputs of “cook pipeline” with new values.

Image by Author

Modular pipelines are nice since they are portable, easier to develop, test, and maintain. Find instructions on how to modularize your Kedro pipeline here.

Configure Your Parameters and Data Through a YAML File

Parameters

Kedro also allows you to specify the parameters for a function using a YAML file. This is very nice because you can view all of your parameters from a file without digging into the source code.

https://towardsdatascience.com/media/0332069d8adb5df93ace9084912d54c6

To configure your project with a configuration file, start with putting the parameters used for the data_engineering pipeline in the conf/base/parameters/data_engineering.yml file.

https://towardsdatascience.com/media/6ec349f5985a6b1d8c7ca0d6a1149352

https://towardsdatascience.com/media/7fa3d4cd2c81a740308281b2064f0314

Now, accessing a parameter from the pipeline is simple. All we need to do is to add params: before the name of parameter we want to access.

In the example below, we use params:test_data_ratio to access the parameter test_data_ratio from the configuration file.

https://towardsdatascience.com/media/62181153e5551955f0bbf1c349eba63d

Data Catalog

Now you might wonder: how do we access the data from the pipeline?

Kedro allows you to load and save data with Data Catalog. Data Catalog is located under conf/base/catalog.yml .

Data Catalog is a configuration file that allows you to specify the data type, and the location where the data is saved.

For example, to save the encoded_data output from the encode_categorical_columns node,

https://towardsdatascience.com/media/62181153e5551955f0bbf1c349eba63d

… we can insert the name encoded_data to the conf/base/catalog.yml file. Under encoded_data , we specify the dataset’s type and its location.

https://towardsdatascience.com/media/e3e50424ad19b4a2c0f5e15d356e5136

pandas.CSVDataSet tells Kedro that we want to save encoded_data as a CSV and load it using pandas. Kedro also supports other dataset types such as pickle, parquet, excel, plot, Spark, SQL, etc. Find all datasets that Kedro supports here.

Using a Data Catalog to manage loading and saving of data is very nice because you can specify:

  • the locations of each file
  • how the data is saved and loaded

… by only editing one file. The next time when you forget the locations of your datasets, you can look at this configuration file to find their information.

Load Data Catalog and Parameters From Jupyter Notebook

Load Data Catalog

Have you ever wanted to quickly check the outputs of a function from a Jupyter Notebook? Normally, you need to first save the data:

https://towardsdatascience.com/media/f32cdbb344b32fcd52e6a1fc18ebe3b8

… then load it into a Jupyter Notebook:

Image by Author

With Kedro, saving and loading data don’t require extra code. To start a Kedro’s Jupyter session, type:

$ kedro jupyter notebook

After running the pipeline that produces the output encoded_data specified in the catalog.yml file,

https://towardsdatascience.com/media/e3e50424ad19b4a2c0f5e15d356e5136

… we can easily load the data in the notebook using catalog.load :

https://towardsdatascience.com/media/467a5afc4aa62611e975b17cd5c31a60

If you want to specify how to save and load the outputencoded_data, add load_args and save_args under encoded_data .https://towardsdatascience.com/media/28aacf699dc8a44b2032461b87018b89

Note that the configuration above will be equivalent to:

https://towardsdatascience.com/media/124228f75d5bf03294ba836b2e049f94

Load Parameters

All parameters inside src/base can also be easily loaded withcontext.params .

Image by Author

Run a Pipeline

Sometimes you might want to experiment with new Python functions and observe their outputs in the notebook. Luckily, Kedro allows you to run a pipeline inside a notebook:

https://towardsdatascience.com/media/cc7c59e5636459fc287506d28b3a9bc0

For quick testing, you can also convert functions from a Jupyter Notebook into Kedro nodes.

Visualize your Pipelines

If you are confused about the structure of your pipeline, you can visualize the entire pipeline using kedro-viz. Start with installing kedro-viz:

pip install kedro-viz

Then type:

$ kedro viz

… to visualize your pipelines. A website will be automatically open in your browser. You should see something like the below:

Image by Author

The visualization captures the relationships between datasets and nodes. Clicking a node will provide you more information about that node such as its parameters, inputs, outputs, file path, and code.

GIF by Author

Create Documentation for the Project

Documenting your project makes it easier for your team to understand your project and how to use your code. Kedro makes it easy for you to create documentation based on the code structure of your project and includes any docstrings defined in your code.

To create documentation for your project, type:

$ kedro build-docs

And documentation for your project will be automatically created under docs/build/html ! You can browse the documentation by either open docs/build/html/index.html or run:

$ kedro build-docs --open

… to automatically open the documentation after building.

Your documentation should look similar to below:

Image by Author
Image by Author

Conclusion

Congratulations! You have just learned how to create reproducible and maintainable data science projects using Kedro. It might take a little bit of time to learn Kedro, but once your data science is set up with Kedro, you will find it much less overwhelming to maintain and update your projects.

I hope this article will give you the motivation to use Kedro in your existing or future data science project.

The source code of the demo project in this article could be found here:GitHub – khuyentran1401/kedro_demo: A data science project using KedroThis is your new Kedro project, which was generated using Kedro 0.17.4. Take a look at the Kedro documentation to get…github.com

I like to write about basic data science concepts and play with different algorithms and data science tools. You could connect with me on LinkedIn and Twitter.

Original source

Author

Data scientist. I share a little bit of goodness every day through daily data science tips: https://mathdatasimplified.com