We live in the era of big data. Terabytes of data are flowing around us constantly. It is inevitable to have some issues in this tremendous flow of data.

Having data-related issues is a reality we need to cope with. In order to have reliable and accurate products, it is of crucial importance to continuously monitor the data quality.

Great Expectations is a Python library that helps us validate, document, and profile our data so that we always make sure it is good and just like we expect it to be.

Great Expectations provides several functions to evaluate the data from many different perspectives. Here is a quick example to check if all values in a column are unique:

The expect_column_values_to_be_unique function not only returns a simple true or false answer but also provides other useful information such as the number of all values, number of unexpected values, and so on.

In this article, we will discover the Great Expectations library by implementing some of the great functions on a sample dataset.

We can easily install it with pip.

pip install great_expectations# in a jupyter notebook
!pip install great_expectations

We can now import it.

import great_expectations as ge

Let’s start with creating a data frame to use for the examples. I have previously created a sales dataset filled with mock data.

It is important to note that we need to have a compatible data frame to be able to apply the functions in the Great Expectations library.

One option is to convert a Pandas data frame using the from_pandas function.

import pandas as pdsales = pd.read_csv("Sales.csv")df = ge.from_pandas(sales)

Another option is to directly use the read_csv function of the Great Expectations.

df = ge.read_csv("Sales.csv")df.head()
df (image by author)

The id column should always be unique and duplicate id values might have severe consequences. We can easily check for the uniqueness of the values in this column.

df.expect_column_values_to_be_unique(column="id")# output
{
"meta": {},
"result": {
"element_count": 1000,
"missing_count": 0,
"missing_percent": 0.0,
"unexpected_count": 0,
"unexpected_percent": 0.0,
"unexpected_percent_total": 0.0,
"unexpected_percent_nonmissing": 0.0,
"partial_unexpected_list": []
},
"success": true,
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
}
}

The functions of the Great Expectations library return a json file containing multiple pieces of information. We can assign it to a variable and extract a specific piece of information.

In the example above, we are actually interested in if the success value is true.

uniqueness = df.expect_column_values_to_be_unique(column="id")uniqueness["success"]
True

If it is false, then we should look for further details.

Consider we expect the price of a product to be between 1 and 10000. Let’s check if the price column fits our expectations.

sales_ge.expect_column_values_to_be_between(
column="price", min_value=1, max_value=10000
)# output
{
"meta": {},
"result": {
"element_count": 1000,
"missing_count": 0,
"missing_percent": 0.0,
"unexpected_count": 2,
"unexpected_percent": 0.2,
"unexpected_percent_total": 0.2,
"unexpected_percent_nonmissing": 0.2,
"partial_unexpected_list": [
0.76,
0.66
]
},
"success": false,
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
}
}

The success is false and there are two unexpected values.

We can also check if the values in a categorical column are in a given set.

df.expect_column_values_to_be_in_set(
column = "product_group",
value_set = ["PG1","PG2","PG3","PG4", "PG5"]
)

I’m not pasting the output for the remaining examples because it is kind of lengthy. You can always practice on your own though.

There are several other useful functions of the Great Expectations library. It will require writing several pages to do one example for each. Feel free to take a look at the glossary of expectations to see all of the available functions.

Another function that I find very useful is expect_column_values_to_be_increasing.

Consider the following data frame:

my_df

If we expect the values in the quantity column to always be increasing, we can use the function I have just mentioned.

my_df.expect_column_values_to_be_increasing(column="Quantity")

We can also check uniqueness based on multiple columns. For instance, we might be expecting to have unique product groups and id combinations for each product.

Here is how we implement this control using the Great Expectations library:

df.expect_compound_columns_to_be_unique(
column_list=["product_group","id"]
)

Great Expectations library is a standard and easily-maintainable solution for data quality checks. We have covered only a small part of its functions. I strongly suggest checking the glossary of expectations.

You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you do so using the following link, I will receive a portion of your membership fee at no additional cost to you.

Thank you for reading. Please let me know if you have any feedback.

Original Source

Author

Writing about Data Science, AI, ML, DL, Python, SQL, Stats, Math | linkedin.com/in/soneryildirim/ | twitter.com/snr14