Have You Ever “Tested” Your Data Pipelines?
A comprehensive guide to make your data pipelines testable, maintainable and reliable
👋 Hi folks, thanks for reading my newsletter! My name is Yunna Wei, and I write about modern data stack, MLOps implementation patterns and data engineering best practices. The purpose of this newsletter is to help organizations select and build an efficient Data+AI stack with real-world experiences and insights and assist individuals to build their data and AI career journey with deep-dive tutorials and guidance; Please consider subscribing if you haven’t already, reach out on LinkedIn if you ever want to connect!
Why is it necessary to test your data pipelines?
Embedding appropriate tests to your data pipelines makes them less bug-prone and also makes sure the data goes through proper data quality checks, before flowing to the end data consumers.
The two key components of any data pipeline are “code” and “data”. Code is used as a tool to manage how to Extract, Transform and Load (ETL) the data, while data is the ingredient of the data pipelines. To be honest, most data pipeline complexities live in the data, instead of the code. In order to build and operate a reliable data pipeline, doing only standard code testing is not enough. Hence, when we talk about testing data pipelines, we need to make sure both code and data are tested properly.
Therefore, today’s article will be divided into 2 parts:
Code testing — Similar to the tests on traditional software / applications. The code tests include unit tests, integration tests and end-to-end system tests. Code testing is generally conducted as part of Continuous Integration (CI) pipelines to make sure the code quality is decent and the functions of data ingestions, data transformations and data loading behave as expected.
Data testing — Set expectations on critical data elements and make sure the data passes these expectation checks before serving the end data to consumers. Data testing requires continuous effort, and is even more important once the data pipelines are deployed to the production environments. Other than testing the data quality, it is also extremely necessary to monitor the testing results to make sure any data quality violations can be fixed immediately.
I will first talk about code testing and then data testing. In the end, I will share some open-source tools and frameworks that you can leverage to start adding necessary tests to your data pipelines.
Code Testing
Code testing for data pipelines is not much different from code testing for software. It generally includes unit testing, integration testing as well as end-to-end system testing. However there are a couple of features with data pipelines, that make it more challenging to do code testing on data pipelines.
Firstly, data pipelines are extremely data dependent, therefore you will need to generate sample data — sometimes quite large volumes of sample data — for testing purposes.
Secondly, to execute data pipelines requires heavy dependencies on external systems, including processing systems like Spark and Databricks, and data warehouses such as Snowflake, Redshift and Databricks SQL. Therefore you need to find ways to do testing independently — separating testing the functions of data processing logics from testing the connections and interactions with these external systems.
Let’s talk about unit testing and integration testing separately and understand how each works specially for data pipelines.
Unit testing — Unit tests are very low level and close to the source code of an application. They consist of testing individual methods and functions used in your data pipelines. The purpose of unit tests for data pipelines is to catch errors without provisioning a heavy external environment. These errors could include refactoring errors, syntax errors in interpreted languages, configuration errors, graph structure errors, and so on. For example, you may require quite a few data transformation functions in order to derive the final results. You can perform unit tests on these transformations functions to make sure they generate data as expected. As we discussed at the beginning of this article, data is the ingredient of any data pipeline. Therefore to perform unit tests on the functions used “in” the data pipeline, you also need to make sure there is data available for the test. Hence, there two popular ways to obtain sufficient data for testing your pipeline codes. The first is to simulate fake testing data based on the distribution and statistics characteristics of the true data. The second is to copy a sample of true data into either a development or staging environment for testing purposes. For the second one, it is very critical to make sure there is no data privacy, security or compliance violation when copying data to an environment than is less stringent than a production environment.
Integration testing — Integration tests verify that different modules or services used by your data pipelines work well together. The most necessary integration tests for data pipelines are testing interactions with data platforms, such as data warehouses, data lakes (mainly cloud storage locations) and data source applications (OLTP databases applications and SaaS applications such as Salesforce and Workday). As we all know, the three key steps of a data pipelines are Extract, Transform and Load. At least two of them (extract and load), sometimes all three of them need to interact with the above mentioned data platforms. Additionally, your data pipelines also need to interact with a messaging system, such as Slack or Teams to send notifications or alerts for key events that have happened on your data pipelines. Therefore it is critical to do integration tests with all the external platforms and systems that your pipelines heavily interact with.
Most data pipelines are written in Python. To automate the code testing for your data pipelines, you will need to leverage a Python testing framework, such as pytest, which is probably the most widely used Python testing framework. It can be used to write various types of software tests, including unit tests, integration tests, end-to-end tests, and functional tests.
Other than a testing framework, a data pipeline orchestration tool can also be leveraged to make the code testing easier for your data pipelines. As far as I know, Dagster — a pipeline orchestrator tool — provides a data pipeline workflow and relevant functions that allows you to unit-test your data applications, separate business logic from environments, and set explicit expectations on uncontrolled inputs. If you know of other high-quality orchestration tools that also provide capacity to make code testing easier, please feel free to let me know. I am always keen to learn more.
Now we have covered the code testing part, let’s move to the data testing part, which is more interesting, dynamic and perhaps more challenging.
Let’s get started!
Data testing
As explained above, data testing is about setting expectations on critical data elements and making sure the data flowing through your data pipelines passes these expectation checks before serving the end data consumers. If there are any violations on these data quality expectations, relevant notification / alerting and corresponding fix actions should be taken.
Different from code testing, which is generally conducted at the compilation / deployment time, data testing is conducted continuously every time there are new streams / batches of data being ingested and processed. You can think of data testing as a kind of continuous series of acceptance tests. You make assumptions and expectations about newly arrived data and conduct real-time testing to make sure these assumptions and expectations are met.
The reasons that data testing is absolutely necessary are as follows:
Firstly, data is the basis of many critical business decisions. As more and more organizations move towards data-driven decisioning, data plays an increasingly important role in the operations of modern organizations. Therefore having good-quality data will also improve the quality and relevance of business decisions.
Secondly, data always changes. Different from code, which is generally static and clean, data is extremely dynamic. There are many factors that would bring changes to your data, such as business operations changes, macro-economy changes, and the most recent pandemic (covid-19). All bring significant changes to the underlying data. Additionally, for most scenarios, data is dirty and requires some cleaning work before it is usable. Therefore, doing data testing ensures significant changes / drifting can be detected in time, and corrupted data can be filtered and rejected appropriately.
Data testing can be roughly divided into the following categories:
Table-level Tests : Table-level tests focus on understanding the overall shape of a table. Below are some sample table-level tests :
#row-wise
expect_table_row_count_to_equal
expect_table_row_count_to_equal_other_table
expect_table_row_count_to_be_between
#column-wise
expect_table_column_count_to_equal
expect_table_column_count_to_be_between
expect_table_columns_to_match_ordered_list
expect_table_columns_to_match_set
Column-level Tests : Generally speaking, there are two types of column-level tests — one is single column tests and the other is multi-column tests.
Single column tests tend to focus on setting the expectation on statistical properties of individual columns. For numerical columns, the single column tests will check column max, column min, column average, column distribution, and so on. For categorical columns, the single column tests will check the column for most common values, column distinct values, and so on. Multi-column tests are more about checking the relationships between the columns. For example, multi-columns tests expect the values in column A to be greater than column B. Below are some column-level tests:
#Single column tests
expect_column_average_to_be_within_range_of_given_point
expect_column_max_to_be_between
expect_column_min_to_be_between
expect_column_mean_to_be_between
expect_column_median_to_be_between
expect_column_distinct_values_to_be_in_set
expect_column_most_common_value_to_be_in_set
#Multi-coumns tests
expect_column_pair_values_to_be_equal
expect_column_pair_values_a_to_be_greater_than_b
expect_column_pair_values_to_have_diff
Of course you can always customize the specific data tests based on your business domain knowledge to make sure all the tests are relevant to your data and the business context that your data operates in.
There are a few open-source libraries designed for supporting data teams to implement data quality tests, such as Great Expectations (GX) , Deequ and PyDeequ.
I have another article already published, that specifically talks about how to leverage Great Expectations (GX) , Deequ and PyDeequ to embed reliability into your data pipelines. You can find the article here:
Data Engineering Best Practice — Embedding Reliability and Integrity into Your Data Pipelines
Summary
To complete this article, it is important to reiterate the goal of testing your data pipelines is to build highly reliable and trustworthy data pipelines so you can deliver high quality data and information for downstream data consumers. Meanwhile you should also avoid a situation of over-testing, meaning you only test the codes, features and data that are relevant to your data pipeline quality and the business use cases that your data pipelines serve for. Therefore before you write tests, you need to understand what tests are necessary and it is also recommended to talk to your business stakeholders to get their domain knowledge regarding what data tests and expectations are most relevant and necessary to them.