Why `Fundamentals of Data Engineering` Belongs on Your Desk - Not Your Wishlist
Honest Book Review: "Fundamentals of Data Engineering" by Joe Reis and Matt Housley
Recently, I gave a talk at the Women Coding Community Book Club about the book “Fundamentals of Data Engineering” by Joe Reis and Matt Housley. After the session, I decided to write a follow-up review article, combining a detailed review of the book with key educational insights beneficial for data professionals at all levels.
Disclaimer: Everything in this article reflects my personal opinion and experience.
Overview
Unlike many tech-focused books I have read, “Fundamentals of Data Engineering” focuses on core principles and the data engineering lifecycle. That is why it remains a timeless resource, valuable for both beginners and experts. It offers clarity on essential concepts and a strategic view of the data engineering landscape.
Why Read This Book?
So, what makes this book so valuable in a world full of tech tutorials?
First, it’s organized around the Data Engineering Lifecycle. It does not attempt to teach you a single tool or programming language. It takes a step back and follows the logical flow of data from the moment it’s created to the moment it provides business value. Once you understand this flow, you can rationally evaluate any technology, understand where it fits in, and learn it if necessary.
Second, it’s written for a broad audience. Doesn’t matter if you are a software engineer, data analyst, or scientist; if you are interested in data engineering, this book is for you. It provides a common language and framework, bridging the gap between different roles.
Third, while it is not an easy read for absolute beginners, it can serve as a fantastic syllabus. This book will prevent you from the common trap of learning a few random tools without understanding how they connect.
Finally, authors provide a clear distinction between the broader Data Lifecycle, which covers data for its entire existence, and the Data Engineering Lifecycle, which is the subset of stages that data engineers directly control. This focus is what makes the book so practical and actionable for us, for people interested in data engineering.
Key Learnings and Educational Insights
As I mentioned already, the book is written around the Data Engineering Lifecycle.
The authors present five crucial lifecycle stages:
Generation (or Source Systems): As data engineers, we consume rather than own data sources. This means that we have to work with what the owners give us. Therefore, a deep working understanding of sources is crucial. We need to consider how data is generated, its frequency, its velocity, and its variety.
Storage: The foundational layer underneath the other layers. Choices we make between different storage systems influence accessibility, performance, and, of course, costs. Understanding concepts such as write- and read-oriented storage, hot, warm, and cold data, and different access patterns is vital. It’s also beneficial to understand the raw ingredients of the storage system you’re going to use, including its architecture and design.
Ingestion: It is the process of moving data from the source system into the analytical storage layer. The authors note that ingestion and storage are often the most significant bottlenecks in the entire lifecycle. Why? Because source systems are outside our direct control. If anything changes in the source, we need to be ready for that change. If not, and our ingestion process fails, it will have a ripple effect on everything downstream.
Transformation: This is where we convert data from its original, raw form into a usable format for downstream consumers. Transformation is where a huge amount of data engineers’ work happens, and it’s primarily driven by business logic. The transformation process can be divided into several stages:
Basic transformations might include casting data into the correct type, removing unnecessary records, or standardizing formats.
Then, we perform more complex work. This is where data modeling becomes essential. We might normalize the data, or we might build a denormalized star schema for analytics.
Finally, we might perform large-scale aggregations to pre-compute business metrics for a report, or featurize data to create predictive signals for an ML model.
Serving: In my opinion, this is the most important layer. It is where we finally get value from the data we’ve transformed. As the book says, data projects must be intentional. Therefore, we must continually ask, “What is the business purpose of the data we’re collecting and cleaning?” The serving layer is the answer to that question.
The main use cases for the serving layer are:Analytics: Operational, Business Intelligence, or Customer-facing Analytics.
Machine Learning: Serving trustworthy data to ML models for training and inference.
Reverse ETL: This is a powerful pattern. Instead of just analyzing data in a warehouse, we push the transformed data back into operational tools that business stakeholders use every day, like Salesforce.
Undercurrents
The lifecycle stages are what we do, but the undercurrents are how we do it successfully and professionally. We need to note that undercurrents are continuous disciplines.
Security: It is the key to all data processes. We often follow the principle of least privilege, which has two parts: give a user access to only the essential data needed, and only for the duration necessary. We also need to protect data in transit (when it travels through the network) and at rest (when it’s stored in your preferred storage system).
Data Management: This is a broad category that includes data governance, data quality, and metadata management. It also covers an ethical responsibility to handle data correctly and comply with regulations such as GDPR and CCPA.
DataOps: It’s all about automation, monitoring, and CI/CD for our data pipelines. We need to version control our code, have automated tests, and monitoring and incident response procedures.
Architecture: The primary objective is to develop scalable, resilient, and cost-effective designs that facilitate ongoing evolution. The authors offer valuable advice: Never aim for the best architecture, but rather for the least worst architecture. Some key principles for data architecture are:
Build loosely coupled systems: The idea behind loosely coupled systems is to have independent components, where a failure in one area does not cascade and bring down the entire system.
Plan for failure: As the famous saying goes, “Everything fails, all the time”. Therefore, we need to be as ready as possible for those failures.
Architect for scalability: It’s about building an elastic system that can scale up to handle increased load while also scaling down to save money.
Embrace FinOps: Every decision we make has a related cost associated with it.
Orchestration: Orchestration is the coordination of all our data processes. A common misconception is that Orchestrator is just a scheduler similar to cron. But it’s much more than that. Cron is only aware of time. An orchestration engine is aware of dependencies. It uses a Directed Acyclic Graph (DAG) to ensure that Task C only runs after both Task A and Task B have completed. It also handles retries, alerting, backfills, and more.
Software Engineering: Data pipelines are complex software. They demand mature engineering practices, such as code reviews, testing, and the use of Infrastructure as Code tools. Data engineering is increasingly about adopting good software engineering practices.
Selecting Technologies
Making informed technology decisions involves assessing team capabilities, determining whether to build or buy, evaluating deployment speed, and choosing between serverless and traditional solutions. Ultimately, the most effective technologies are those that seamlessly integrate with your existing infrastructure, security model, CI/CD pipelines, and monitoring stack. A tool’s interoperability is often more important than its standalone performance in a benchmark war.
Core Concepts
Schema Evolution: Managing changes to data schemas effectively.
Change Data Capture (CDC): Capturing incremental data changes for real-time ingestion.
Logic-on-Read vs. Logic-on-Write: Balancing query flexibility and ingestion efficiency.
Data Governance: Ensuring data quality, accountability, and compliance.
CI/CD in Data Pipelines: Automating deployment and testing.
Data Lakehouse: Combining warehouse and lake features.
Data Warehouse: Centralized storage optimized for analytical queries.
Data Lake: Storage for raw and structured data.
Batch Processing: Scheduled, periodic data handling.
Streaming Processing: Real-time data ingestion and analysis.
Push, Pull, Polling: Different methods for moving data.
Event-Driven Architecture: Systems reacting to events in real-time.
Directed Acyclic Graph (DAG): Managing complex job dependencies.
FinOps: Financial management practice.
Infrastructure as Code (IaC): Automating infrastructure provisioning and management.
Kimball & Inmon Models: Data modeling approaches for data warehouses.
Data Vault: Flexible modeling technique for complex data environments.
Partitioning and Clustering: Techniques for optimizing data storage and retrieval.
Compression: Reducing storage size and improving performance.
Encryption: Securing data at rest and in transit.
Data Contracts: Formal agreements defining data quality and expectations.
Data Lineage: Tracking data origins and transformations.
Serverless Architecture: Running applications without managing servers.
Elastic Scalability: Automatically scaling resources up and down based on demand.
Final Thoughts
“Fundamentals of Data Engineering” by Joe Reis and Matt Housley provides invaluable insights for mastering data engineering. If you are starting your journey, use this book as your foundation. Learn about the core principles of the data engineering lifecycle, and use it as a reference to explore concepts in more detail. Return to it as you progress to understand better how various technologies fit into the overall framework.
I am confident that applying these principles to your career will unlock new opportunities.
Enjoyed the article? Support my newsletter: Paypalme Link. 🚀
Authors Notes
✅ Keep Knowledge Flowing by following me for more content on Solutions Architecture, System Design, Data Engineering, Business Analysis, and more. Your engagement is appreciated.
🔔 You can also follow my work on: