What Makes a Good Data Pipeline — A Pre-Production Checklist for Data Engineers
The most essential part of becoming a data engineer is to build highly scalable and reliable data pipelines
👋 Hi folks, thanks for reading my newsletter! My name is Yunna Wei, and I write about modern data stack, MLOps implementation patterns and data engineering best practices. The purpose of this newsletter is to help organizations select and build an efficient Data+AI stack with real-world experiences and insights and assist individuals to build their data and AI career journey with deep-dive tutorials and guidance; Please consider subscribing if you haven’t already, reach out on LinkedIn if you ever want to connect!
In one of my previous articles on this subject, namely Learning the Core of Data Engineering — Building Data Pipelines, I talked about the 8 key components of building a data pipeline also called Extract, Transform and Load (ETL) pipeline
In today’s article, I will go into more detail to explain what makes a data pipeline good enough for deployment in a production environment. Hopefully this can also provide you a kind of “checklist” before you deploy business-critical data pipelines into a production environment.
I will start by explaining what could go wrong for a data pipeline in a production environment, and then explain what is required to prevent a pipeline from going wrong, (as much as possible), and ensure safe and quick recovery, if the pipelines do run into errors and failures. Unless you are extremely lucky, there is a very high chance that your pipelines will go wrong at some stage.
Let’s see what you can do to have a peace mind when running your data pipelines in production.
Data Volume Fluctuates
In a real-world production environment, data volume does not always remain stable. Data volume can either go up or go down, due to various reasons, such as business expansion, more customer acquisitions, marketing promotions and so on. In fact, data volume fluctuates for most scenarios. Thus, the data pipelines need to be able to auto-scale up or down to handle the data volume fluctuations. Therefore the first item on the “checklist” is to understand if your data volumes fluctuate, and if they do, you need to leverage data infrastructure that provides auto-scaling capabilities to respond to the data volume changes in real-time. You want to avoid the following scenarios :
Infrastructure over-provisioning, which will result in unnecessary waste.
Infrastructure under-provisioning which will likely cause the pipelines to fail, or pipelines that do complete, but finish at a level below the Service Level Agreement (SLA).
Hence understanding how the data volume change patterns might occur, and having the elasticity in the underlying infra to respond to the changes immediately is critical.
Data Latency Changes
Data latency refers how fast the data is ingested from the data source and delivered to the end users for consumption. The data latency requirements from the users can change when business requirements change. The data pipeline needs to be defined in a flexible enough way to accommodate the change without a significant change to the source codes. If it’s a batch pipeline, you can leverage workflow orchestration to adjust the schedule frequency. If it is a streaming pipeline, you can revise how frequently the pipeline is triggered.
Let’s take Apache Spark structured streaming as an example, you can adjust the streaming pipeline trigger interval as follows:
Default — If no trigger setting is explicitly specified, then by default, the streaming pipeline will be executed in micro-batch mode, where micro-batches will be generated as soon as the previous micro-batch has completed processing.
Fixed interval micro-batches — The streaming pipeline will be executed with micro-batches mode, where micro-batches will be kicked off at the user-specified intervals.
One-time micro-batch — The streaming pipeline will execute only one micro-batch to process all the available data, and then stop on its own.
Available-now micro-batch — Similar to one-time micro-batch trigger, the streaming pipeline will process all the available data and then stop on its own. The difference is that, it will process the data in (possibly) multiple micro-batches based on the source options (e.g. maxFilesPerTrigger for file source).
Continuous with fixed checkpoint interval — The streaming pipeline will be executed in the new low-latency, continuous processing mode.
If you foresee that data latency requirements are likely to vary in the future, you need to select a compute engine and framework that can facilitate data latency changes easily. Apache Spark structured streaming plus delta lake as the data storage format is good combination, which provides the capabilities of unifying batch and streaming data pipelines.
Data Quality Check and Management
Validating data quality rules and guaranteeing reliable data in the end-to-end data pipelines is very necessary, as poor quality data can lead to poor decision making and/or poor Machine Learning (ML) models. When data engineers develop and deploy data pipelines, they will need to:
First, vigorously conduct data quality checks. These data checks include technical validations including statistical checks, missing value checks, data type checks, data duplication checks, as well as business checks, which are generally defined by the business owners or users of the data assets.
Second, report and alert on violations of critical data quality rules. These reports will be very helpful when you debug pipelines issues caused by deteriorating data quality.
Third, understand the root causes of poor data quality and provide corresponding fixes based on what the root causes are.
There are a few open-source libraries that are specifically designed for data quality management, such as:
great_expectations — a shared, open standard for data quality. It helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.
Deequ — a library built on top of Apache Spark for defining “unit tests for data”, which measure data quality in large datasets. Python users may also be interested in PyDeequ, a Python interface for Deequ.
Some commercial ETL frameworks like Databricks Delta Live Table also provide data quality management capability. Delta Live Table prevents bad data from flowing into tables through validation and integrity checks, and avoid data quality errors with predefined error policies (fail, drop, alert or quarantine data). In addition, you can monitor data quality trends over time to get some insight into how your data is evolving and where changes may be necessary.
It’s always best practice to include data quality checks and validation steps into data pipelines, to deliver highly reliable data. I am going to write another article specifically about embedding data quality checks to your data pipelines. Please feel free to follow me if you’d like to be notified when the article is published.
Data schema evolution
With fast changes in business context and requirements, data schema, a blueprint that defines the shape of the data, (such as data types and columns and metadata), changes relatively quickly as well. Good data pipelines should be able to handle data schema changes flexibly while maintaining data quality.
Schema enforcement for data quality — in order to maintain and ensure data quality, schema enforcement is necessary, meaning, when data engineers write and persist data, (which does not match the predefined table’s schema), into a destination table, the writes will be rejected and the pipeline will fail.
Schema evolution for pipeline flexibility — as explained just now, it is not realistic to expect data schema to remain the same all the time. Your data pipelines should be flexible enough to allows users to easily change a tables current schema to accommodate data that is changing over time. Schema evolution is often used when performing an append or overwrite operation, to automatically adapt the schema to include one or more new columns.
Depending on the nature of your data pipelines and the business use cases they empower, data engineers need to take into consideration how the schema changes, when authoring data pipelines. Use schema enforcement to uphold the highest standards for your data and ensure the highest level of data integrity. Schema evolution complements enforcement by making it easy for intended schema changes to take place automatically.
Pipeline Monitoring and Observability
Understanding and logging how your pipeline works including auditing logs, job metrics, data quality checks, pipeline progress, and data lineage is very necessary for you to find out the root causes for errors and recover the pipelines quickly. Having a good level of visibility and observability gives you confidence in deploying your data pipelines into a production environment.
Data engineers can leverage workflow orchestration tools for pipeline logging. Most workflow orchestrator tools support standard Python logging levels — critical, error, warning, info and debug. They also generally allow you to customize logging messages through configurations, to give you more visibility into the tasks and jobs of your workflows.
Additionally data engineers can also integrate workflow orchestrators with real-time monitoring solutions, such as Prometheus or Grafana for real-time metrics, critical alerts and notifications.
If your data pipelines are business-critical and have an extremely high SLA, it will be definitely worthwhile implementing a robust logging and monitoring mechanism.
Error Handling and Pipeline Recovery
Pipeline recovery is more than just fixing the errors and re-running the workflows. More importantly, while fixing errors, you must make sure the underlying data is not corrupted because of the errors. For example, if the task of writing data into the destination locations fails in the middle, it can be quite tricky depending on the the storage location and data format. If you are writing data into a data format that provides Atomicity, Consistency, Isolation, and Durability (ACID) transaction guarantees, data snapshot isolation, as well as concurrency support, failing at the writing data stage should not be too much a trouble, and you can safely re-run the task without worrying too much about data quality and reliability.
Delta Lake is such a format that provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS. However, if you are writing data into a data format that does not have such data reliability guarantees, you will have to manually clean up before you can re-run the failed job.
Once you have the confidence that your data quality is not compromised due to pipeline failures, you can leverage the following methods for error handling and pipeline recovery.
First, is to get notified. Set meaningful notifications for your pipelines so that you are notified when there are serious errors and failures.
Second, is to have the pipelines automatically retry on failures. Data engineers can define the number of retries, as well as how long between each retry.
Third, is to manually fix the failures when automatic retries do not work. In order to be able to quickly fix the errors, having sufficient and detailed logs is quite necessary.
Summary
I hope this blog provides you with some useful tips that help make your data pipelines less error-prone and run in a highly scalable and reliable manner. Below is a quick summary, as a checklist:
Check if your pipeline is able to auto scale to handle data volume fluctuations;
Check if you can adjust the schedule / trigger the frequency to deal with requirements changes on data latency;
Understand data quality expectations and make sure you validate these expectations in your pipeline and report critical data quality violations;
Understand how the data schema would evolve and build in mechanisms in your pipelines for schema enforcement and schema evolutions;
I recommend having at least some monitoring and observability in place. Of course, depending on the SLA of your data pipelines, the higher the SLA, the more vigorous monitoring and observability should be arranged;
When there is pipeline failure, first check and understand if the reason is that the underlying data is corrupted, and if there is any data quality issue. If yes, it is critical to clear up the corrupted data before you re-run the pipeline;
Finally, understand the root causes of pipelines and build in some automatic pipeline recovery, such as retries;