Data Engineering Fundamentals
Exec Summary:
Data engineering is the backbone of effective analytics and data science. Without timely, high-quality data, even the best models and insights fall flat. In this series, I share my reflections as a data scientist and analytics freelancer on:
- The core principles of data engineering inspired by Joe Reis and Matt Housley’s foundational work.
- Practical tools and best practices that I currently prefer, emphasising automation, disciplined SQL, and sensible tool choices.
- The importance of robust data modelling and the critical role of domain-savvy data architects.
- Emerging architectural paradigms like data mesh and data fabric, and guidance on when to consider them.
This series is written from a Python/SQL-first perspective, focusing on reproducibility, version control, and avoiding vendor lock-in. I’m not affiliated with any vendors mentioned, and this reflects my personal experience and preferences.
Part 1: Core Principles of Data Engineering
Explore the foundational dimensions of data engineering, inspired by Reis and Housley’s Fundamentals of Data Engineering. Understand why timely, quality data is essential and learn about key axes like batch vs streaming, the 5 Vs, data accessibility, granularity, and change management.
Part 2: Practical Tools and Better Practices
A deep dive into the tools and practices that make data engineering effective today:
- Why automation and observability matter — Prefect.io vs Airflow
- Disciplined SQL workflows with SQLMesh
- Using DuckDB for local development and MVPs
- The growing importance of synthetic data (e.g. Faker)
- Avoiding premature scaling with heavyweight platforms
- New tools to watch: like MotherDuck
Part 3: Data Modelling and the Role of the Data Architect
Discuss the universal star schema approach to simplify and standardise analytics data models. Highlight the indispensable value of data architects with deep domain knowledge who ensure data solutions align with real business needs. Also, reflect on the messy reality of corporate data versus the clean datasets often used in tutorials.
Part 4: Emerging Architectures — Data Mesh and Data Fabric
Understand the differences between data mesh and data fabric architectures, their organisational and technical implications, and when to consider each. Learn about hybrid approaches that combine the best of both worlds to balance agility and governance.
Part 5: Navigating Constraints - Greenfield vs. Brownfield
Explore the reality of greenfield vs. brownfield data engineering projects. Learn how to assess technical landscapes, data quality, and organisational context to make informed decisions and embrace reversible decisions.
Selected References
Selected references that informed this series of articles.
- 15 Data Engineering Best Practices to Follow in 2025 (lakefs.io)
- 10 Essential Data Engineering Tools To Use in 2025 (infomineo.com)
- Small Data Engineering tools/techniques : dataengineering (reddit.com)
- Data Engineering Best Practices | Nexla
- 14 Essential Data Engineering Tools to Use in 2025 (datacamp.com)
- Data Engineering: Components, Skills & Best Practices [2025 Guide] | Dagster
- Is Data Modelling Still Important In Modern Data Architecture? (matillion.com)
- The Role of Data Architecture and Data Modelling Strategy (cuelogic.com)