What Does It to Take to Make Your Data Ingestion Efficient and Reliable
Every data pipeline starts with data ingestion. Having the data ingestion in good order, lays a solid foundation for scalable and reliable data pipelines.
👋 Hi folks, thanks for reading my newsletter! My name is Yunna Wei, and I write about modern data stack, MLOps implementation patterns and data engineering best practices. The purpose of this newsletter is to help organizations select and build an efficient Data+AI stack with real-world experiences and insights and assist individuals to build their data and AI career journey with deep-dive tutorials and guidance; Please consider subscribing if you haven’t already, reach out on LinkedIn if you ever want to connect!
In my previous article, I talked about learning the core of data engineering, for building data pipelines. You can find the previous article here:
Learn the Core of Data Engineering — Building Data Pipelines
As we all know, every data pipeline starts with data ingestion. How you configure the data ingestion, to a large extent, will determine the overall throughput and latency of your end-to-end data pipelines. Having the data ingestion in good order, lays a solid foundation for efficient, scalable and reliable data pipelines.
Therefore I would like to share some best practices specifically around data ingestion. Hopefully this can also provide you a kind of “checklist” before you deploy business-critical data ingestion solutions into a production environment.
Now Let’s get started!
1. Understand the data sources
First you need to fully understand where you are ingesting data from. In general, data sources can be divided into the following 4 categories:
File-based data sources (data lake): For file-based data sources, the most two critical considerations are where and how the data is stored. “Where” normally refers to a data lake location, either an on-premise distributed file system such as HDFS, or a cloud storage location, such as AWS s3, Azure blob storage or Google GCS.
“How” refers to the format and layout the data is stored in, in the data lake.
The popular data formats for big data workloads include Delta Lake, Apache Parquet, Apache ORC , Apache Avro, JSON, CSV and Binary. Each data format carries its own features in terms storage format (row-based or column-based), data structure (structured, semi-structured and unstructured), encoding, schema (self-describing or not) and compression algorithms. All of these will impact how you design the data ingestion options.
Data layout refers how these data files are partitioned and indexed. Using the right partitioning and indexing keys can significantly improve your data ingestion efficiency by avoiding scanning unnecessary files and only ingesting the required data.Database-based data sources (OLTP and OLAP): Databased-based data sources can be roughly divided into 2 categories. One is Online Transaction Processing (OLTP), whose primary purpose is to support business transactions, and the other is Online Analytics Processing (OLAP), which is built to mainly support analytical workloads. OLAP databases are also called data warehouses.
When ingesting data from OLTP databases, the primary consideration is to avoid placing substantial overheads to the databases, particularly you should avoid any read-heavy queries. Therefore the preferred approach is to ingest data from a replica read instance, instead of the primary database instance.
When ingesting data from OLAP databases, the primary consideration is to understand if you are able to push down the compute to the data warehouse and process and filter the data first to reduce network I/O before ingesting it. This is very useful when you have a large volume of data that needs to be ingested.
For database-based data sources, compared to a bulk-ingestion method, a more efficient data ingestion pattern is Change Data Capture (CDC), which refers to the process of identifying and capturing changes made to data in a database and then only delivering those changes to a downstream process or system. If possible, you should always consider leveraging CDC to ingest data from source databases.SaaS-based data sources: SaaS-based data sources mainly refer to ingesting data from enterprise Software as a Solution (SaaS) applications, such as Workday, Salesforce, SAP and so on. Most SaaS applications provide data models and native connectors that you can leverage, to access data from them. You can also use commercial data integration service providers, such as Fivetran, which provide managed connectors to popular SaaS applications, which can simplify your data ingestion efforts from these source applications.
Aim for exactly-once data ingestion
Efficiency in your data ingestion ultimately comes from avoiding ingesting and processing the same data more than once — this is also called incremental data ingestion. When you design your data ingestion pipelines and decide which framework to leverage, you should definitely take into account how you might achieve “exactly-once” data ingestion, without any potential data loss of course. Depending on what the data sources are, and which data ingestion frameworks you use, there are multiple ways to achieve “exactly-once” data ingestion. One way is, if your data sources are file-based, you can leverage Spark structured streaming to achieve exactly-once data ingestion. Spark structured streaming ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs. Structured streaming tracks the read position of each stream from the data sources by recording offsets. The Spark SQL engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling re-processing. Together, using replay-able sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure. If your data sources are database-based, you can use primary keys to track the read progress of each batch to avoid unnecessarily duplicated data ingestion. You can even build tracking systems yourself to record what data has been ingested for each batch trigger. The key is not to duplicate effort on data that has been ingested and processed.
Leverage and Predicate / Filter push-down as much as possible
The ultimate goal of using predicate / filter push-down is to reduce data scanning, and process and filter data at the source level before being ingested. For file-based data sources, you can leverage disk partitioning and file metadata for predicate push-down. For example, Parquet natively comes with various metadata, such as file metadata, column (chunk) metadata and page header metadata. Parquet filter pushdown relies on the minimum and maximum value statistics in the row group metadata of the Parquet file to filter and prune data at the row group level. This is one of the main reasons why Parquet is a very popular file format for big data workloads. For database-based data sources, particularly when data comes from data warehouses, you can push complicated queries to the data warehouse engine and only ingest returned result (which is a generally a much smaller data size) when possible.
Manage schema properly for flexibility and data quality
A data schema defines the shape of the data, such as data types and columns and metadata. It is best that you know the data schema details before you start the data ingestion pipeline, as it is preferred to pre-define the schema yourself, rather than asking the data ingestion engine to infer the schema for you, (which is called schema inference). Schema inference requires the engine to first go over all the data to infer the input schema automatically. That would require one extra pass over the data. Therefore schema inference can increase data ingestion latency. Other than schema inference, there is also schema enforcement, meaning the specified or inferred data schema will be forcibly applied to data source files. Schema evolution (schema merge), means automatically detecting schema changes and adding new columns to the existing schema, in a mutually compatible manner. Depending on how well you know the schema, as well as how frequently schema changes, your data ingestion pipelines should provide options to best manage schema-related behaviours.
Detecting corrupted files and records before ingesting
Data ingestion is the first step of a data pipeline. Therefore detecting any potentially corrupted files / records and rejecting them at the very early stage of the data pipeline is critical in delivering high-quality data. There are a few mechanisms that can be leveraged to detect corrupted files and records:
Schema enforcement — Apply schema enforcement to validate the ingested data matches with the pre-defined schema, and fail the whole pipeline when any discrepancies are identical so that an immediate fix can be made.
Embed self-defined data quality checks — Define data quality rules and check if these data quality rules are met before flowing the data into the next stage. Additionally, report the data quality check results to relevant stakeholders so that corrective actions can be taken. If you are keen to understand more about embedding reliability and integrity into your data pipelines, you can refer to this earlier article I have produced.
Summary
I hope this blog provides you with some useful tips that help make your data ingestion less error-prone and able to run in a highly efficient and reliable manner. Below is a quick summary, as a checklist:
Understand the where the data comes from — Each source (file-based, database-based, SaaS-based) means different data ingestion options;
Aim for exactly-once data ingestion — Design the pipelines to avoid duplicated ingestion effort;
Leverage Predicate / Filter push-down as much as possible — Reduce data ingestion loads by pushing down computation to the data sources when possible;
Manage schema properly for flexibility and data quality — Understand how the data schema would evolve and build in mechanisms in your pipelines for schema enforcement and schema evolutions;
Detect corrupted files and records before ingesting — Embed data quality checks and reject files and records that violate these checks at the very early stage of the data pipelines.
Please feel free to subscribe to my newsletter if you want to be notified when there are new blogs published. I generally publish 1 or 2 articles on data and AI every week.
Thanks so much for your support!