Posted on :: 901 Words :: Tags: , , , , , , , , ,

DuckDB combined with MotherDuck and the innovative DuckLake format enables data scientists and analysts to seamlessly transition from local, high-performance analytics to scalable cloud lakehouses; reducing complexity, improving query efficiency, and lowering infrastructure costs for modern data workflows. This post discusses a new short course that explores this part of the data engineering ecosystem.

I'm primarily a problem-solving Data Scientist, focused on extracting insights and building models, but that often means dipping into "lite" data engineering to make analysis possible: cleaning pipelines, querying disparate sources, and occasionally going deeper than I'd like because, well, someone has to do it. Tools like DuckDB have been a lifesaver for this, enabling fast, local analytics without heavy infrastructure. Recently, I completed the free "DuckDB for Data Engineers: From Local to Cloud with MotherDuck" short course at Learn Data Engineering by Andreas Kretz. It explores scaling DuckDB workflows to the cloud via MotherDuck and introduces the emerging DuckLake format. Even after using DuckDB for well over a year, I picked up valuable new approaches—reminding me that there's no single "right way" in data work; we all evolve our methods differently.

Why This Course Stands Out: Practicality at Its Core

What impressed me most was the course's emphasis on real-world examples. It doesn't just throw theory at you; it dives into hands-on scenarios that mimic everyday data challenges. From querying local files to scaling up in the cloud, the modules build progressively, making it accessible yet challenging. If you're like me, someone who's already got DuckDB in your toolkit but wants to refine your approach, this is gold. I found myself nodding along to setups I'd done differently in the past, and it sparked ideas for optimising my own pipelines.

One standout takeaway? The deep dive into SQL directives like EXPLAIN and EXPLAIN ANALYZE. I'd skimmed these before, but the course breaks them down with clear examples, showing how to diagnose query performance bottlenecks. It's the kind of practical tip that pays dividends immediately, next time you're wondering why a join is dragging, these tools will be your first stop.

Areas for Possible Enhancements: Platform and Tooling Suggestions

That said, the course has a strong Windows/WSL bent, which makes sense for accessibility but could be broadened. Many data pros work across macOS, Linux natives, or even containerised environments, so expanding examples to cover those would make it even more universal. It's not a deal-breaker, but a nod to cross-platform quirks would elevate it.

On the Python side, it promotes pyenv for managing isolated environments, which is solid advice. However, I'd nudge toward updating this to something more contemporary like uv tooling. uv streamlines virtual environments and package management with speed and simplicity, aligning better with modern workflows. See my previous post on uv

The course embeds SQL queries directly within Python scripts. I much prefer loading queries from well-named .sql files in a dedicated sql/ directory. This approach makes queries easier to identify, edit, format, and benefit from syntax highlighting in editors. Plus, they can be run directly in the DuckDB CLI or UI for quick testing, promoting better separation of concerns and reusability.

The DuckLake S3 integration shines yet is very AWS-centric. DuckLake's potential as a table format for lakehouses is exciting, blending metadata efficiency with DuckDB's query prowess. Generalising the examples to other S3-compatible providers (like MinIO or Google Cloud Storage) might make it more accessible to a wider audience.

Wrapping Up: Worth Your Time?

Overall, this is a really excellent short course: engaging, succinct and packed with great takeaways. Whether you're new to DuckDB or a veteran looking to integrate MotherDuck and DuckLake, it's a worthwhile investment of a few hours. Kudos to the creators for making it free and focused on bridging local-to-cloud gaps. If you're in data science or engineering, check it out; you might just tweak your toolkit in unexpected ways. All of the code for the course is available in the author's companion repo.

Have you taken this course or experimented with DuckLake? Drop your thoughts in the comments.

Appendix: Some DuckLake Basics

  • Why DuckLake? Existing open table formats like Apache Iceberg and Delta Lake add powerful features (ACID transactions, time travel, schema evolution) to data lakes but introduce complexity through file-based metadata, which can lead to performance bottlenecks and thousands of tiny files. DuckLake simplifies this by leveraging a relational database for metadata management, reducing overhead while providing similar (or enhanced) lakehouse capabilities.

  • What is DuckLake? An open lakehouse table format (released in May 2025 by the DuckDB team) that stores data in Parquet files on object storage (e.g., S3-compatible) and all metadata (schemas, file pointers, snapshots, transactions) in a standard SQL database of your choice, depending on your use case, e.g., DuckDB, PostgreSQL, MySQL. It supports multi-table ACID transactions, partitioning, and time travel, without requiring external catalogs.

  • How does it work? Attach a DuckLake "database" in DuckDB via the ducklake extension (e.g., ATTACH 'ducklake:my_lake.ducklake' AS my_lake;). Query and modify tables with standard SQL—DuckDB handles reading/writing Parquet files and updating the metadata catalog transparently. MotherDuck offers managed DuckLake support for cloud scaling. It's designed for simplicity and speed, especially in metadata-heavy operations, while remaining engine-agnostic in principle.