Learn the Core of Data Engineering — Building Data Pipelines
Master the Core Skills of Data Engineering to Become a Data Engineer
👋 Hi folks, thanks for reading my newsletter! My name is Yunna Wei, and I write about modern data stack, MLOps implementation patterns and data engineering best practices. The purpose of this newsletter is to help organizations select and build an efficient Data+AI stack with real-world experiences and insights and assist individuals to build their data and AI career journey with deep-dive tutorials and guidance; Please consider subscribing if you haven’t already, reach out on LinkedIn if you ever want to connect!
The Importance of Data Engineering
Data engineers play an extremely critical role in developing and managing enterprise data platforms. There is an increasingly high demand for data engineers who build ETL (Extract, Transform and Load) pipelines to turn raw data into usable information. Without these data pipelines, it is very difficult for data scientists to build accurate Machine Learning (ML) models, for data analysts to develop insightful Business Intelligence (BI) dashboards, and for business decision makers to make data-driven and better-informed decisions and identity innovative business opportunities. Hence data engineering provides the foundation for data science and analytics and constitutes an important aspect of the overall enterprise data landscape and business operations.
Therefore, I think it could be very useful to share a series of practical guides on how to develop and deploy data pipelines, which are the core responsibility of data engineers. Data pipelines can come in different levels of scales and complexities based on data latency and data volume and can be also developed in different languages and frameworks (Python, SQL, Scala, Spark and so on). In total, I will provide 4 modules of data engineering practical guides:
Module 1 : Building Data Pipelines with Python
Module 2 : Building Data Pipelines with SQL
Module 3 : Building Data Pipelines with Spark (mainly PySpark and Spark SQL)
Module 4 : Building Real-Time Data Pipelines (Structured streaming and Apache Flink)
For each module, I will explain the key data engineering concepts, design considerations as well as sample codes for implementations.
Key Components of A Data Engineering Pipelines
Regardless how complex your data pipelines are and which languages and frameworks you use, any data pipeline should entail these key components.
The first component is data extraction. Every data pipeline starts with ingesting and accessing data from data sources. These data sources could be data lakes, data warehouses, application databases, APIs or real-time event stores, such as Kafka. The data can also be extracted and loaded at different latencies, such as batch, micro-batch and streaming, depending on how quick the data needs to be assessed. While planning the data pipelines, data engineers need to understand the pattern to ingest the data.
After the data is extracted from data sources, it will be processed and transformed based on the consumption requirements of downstream applications, such as Machine Learning (ML) models or Business Intelligence (BI) dashboards. Depending on the data latency and data volumes, data can ben transformed using different computation frameworks. For example, if the data volume is much bigger than any single compute node can hold, data engineers will turn to a large-scale and parallel computing engine, such as Apache Spark. I will provide a deep dive session on Spark very soon.
Once the data is processed, transformed, and ready to be used by the downstream data consumers, the data needs to be loaded into a specific data sink locations where end users and end applications will be granted access. There are different modes to load the data, be it append, insert, update or overwrite, depending on the data storage configurations and data formats as well as how data is ingested from the source system. For example, we will cover log-based Change Data Capture (CDC) for SQL servers, which is a pattern for incrementally loading data and only captures the changes through database logs.
As data goes through the different life stages of being extracted, transformed and loaded, it is critical to perform data validations to make sure high-quality data is delivered to the end data consumers. Therefore, data engineers not only need to extract, transform and load the data, but they also need need to embed data profiling, constraints checks, as well as quality verification in the data pipelines. There are quite a few ways to conduct data validations. For example, when data engineers / data architects design data models in the data warehouse, they can define schemas and add constraints on the critical columns to make sure the schema is matched, and constraints are passed, before writing data into the destination tables. There are some Python open-sources libraries for data quality checks, such as Great Expectations, Pandera and Deequ/PyDeequ, which provides open standards and shared constraints for data verification. There are also some data platform providers, such as Databricks, who provide an ETL framework called Delta Live Table (DLT) and users can define quality checks while defining the data pipelines.
Normally data pipelines are not linear, meaning one pipeline has only one data source and one data sink. For many pipelines, there could be multiple data sources and multiple jobs and stages in a data pipeline. It is therefore very important to manage these dependencies and orchestrate, schedule and run these data flows accordingly. Workflow orchestration tools not only schedule data pipelines to run and manage dependencies within the pipelines, they also provide functions such as retries, logging, caching, notifications and observability. Therefore, it is necessary for data engineers to become familiar with at least one workflow orchestration tool. There are some open-source frameworks, such as Apache Airflow, Prefect and Argo. In upcoming blogs, I will further explain workflow orchestrations in detail, and demo how to coordinate and run an end-to-end data pipeline, leveraging a workflow orchestration tool.
You have finished all the development work, now it is time to test and deploy your data pipelines in a production environment. There are quite a few considerations before “productionizing” a data pipeline. For the technical implementation side, considerations include code versioning, Continuous Integration / Continuous Delivery (CI/CD), error handling, data versioning control systems, monitoring, alerts and notifications, and Infrastructure-as-Code(IaC). For the business side, considerations include Service-Level Agreements (SLA’s), cost optimizations and budget constraints.
Once the pipelines are ready to be deployed, data engineers also need to consider if there is any data governance and data security and privacy concerns that need to be addressed in the data pipelines. In most organizations, data governance and security are normally owned by a separate team, however where data engineers can help, is to automate some of the governance effort in the data pipelines. For example, if there are requirements of personal private information that must be identified or deleted / anonymized before serving the data to the end users, data engineers can embed and implement these requirements automatically in the pipelines.
All of the above will not happen without a solid data infrastructure. The most two fundamental pieces of data infrastructure are compute and storage. Many organizations have migrated their IT infrastructure to a cloud platform (AWS, Azure, and GCP), which provides more options for their data infrastructure. For the compute part, roughly there are 4 different types of workloads and their corresponding compute engines — ETL/big-data processing workloads, data warehousing workloads, Machine Learning (ML) workloads and streaming workloads. Each type of workload has its own most efficient compute engines. For example the most popular compute framework to process big data in parallel is Spark. For the storage part, data is normally stored in a cloud storage locations such as AWS S3, Azure blob storage and Google Cloud Storage (GCS) and the popular data formats when they are stored, are Parquet, Avro, ORC, JSON and CSV. Recently, new storage formats such as Delta, Apeche Hudi and Iceberg are being adopted rapidly, as these storage formats add an additional data management layer, which overcomes some of the critical disadvantages of data lakes, such as lack of ACID transactions, lack of schema enforcement and potential data corruption. Databricks, as the original creator of open-source Delta, has created and proposed a new lakehouse architecture, which builds data warehouse capabilities on top of data lakes and users can reap the benefits of both data warehouses and data lakes. In fact, Delta is the essence of the lakehouse architecture. Therefore data engineers need to understand how each storage format works and when to use which storage format, depending on the use cases.
This is a very high-level introduction of the key components that data engineers will encounter when building data pipelines. In the upcoming blogs, I will explain the key concepts and design considerations of each of the aforementioned components, and more importantly, demo how to implement them under different frameworks and languages.
Learning the Core of Data Engineering — Recap and Announcement
To recap, there will be 4 modules of data engineering practical guides. You can click each of the modules below, to obtain a preview of the module curriculum.
Module 3 : Building Data Pipelines with Spark (mainly PySpark and Spark SQL)
Module 4 : Building Real-Time Data Pipelines (Structured streaming and Apache Flink)
From the next blog onward, I will kick off the first module of Building Data Pipelines with Python. For each module, I will explain the key concepts and design considerations behind each of the components — (Data Extraction, Data Transformation, Data Loading, Data Quality Validation and Check, Workflow Orchestration, Data Pipelines Deployment, Data Security and Governance, Data Infrastructure), and more importantly demo how to implement them within each module. Stay tuned.
Thanks for reading!
your explanations are structured and clear building up from general principles. if you haven't yet, perhaps you could considering authoring a book on the subject matter. if you have already written a book or published material on this, could you please share the details. medium is a great source but its nice to have all information in one place in a book form. thanks for your effort.