A new, yet old, data validation package
Exec Summary
Excited to explore pointblank
, the new Python package from Posit for data validation! https://posit-dev.github.io/pointblank/ It builds upon https://rstudio.github.io/pointblank/ the original R version. 🎉 Ensuring data quality is crucial for reliable analysis, decision-making and genAI (think data governance!). It offers a simple way to define validation rules for Pandas, Polars, and DuckDB tables (and more!), generating clear reports. I'm in the process of testing it out and will post a link to my technical deep dive shortly #datagovernance #datavalidation #python #posit #opensource
Technical Evaluation
Initial Experience with pointblank
for Data Validation
Posit's new pointblank
package for Python offers a promising approach to data validation in Python. My initial experience, using it with the Titanic dataset and DuckDB
, has been encouraging.
Also mention GreatExpectations, Pandera and Deequ for comparison. Others?
Also commercial data quality tools?
Also link to my post re: Titanic data quality.
Setup and Integration
The installation was straightforward: pip install pointblank
(or uv add pointblank
). The package seamlessly integrates with Pandas, Polars, and, importantly for my use case, DuckDB. The ability to work directly with DuckDB tables is a major advantage, as it allows for efficient data handling and validation within a single environment.
Validation Implementation
Using the Validate
class is intuitive. I was able to define a series of validation steps for the Titanic dataset, including:
- Checking for missing values in key columns (e.g., "survived", "pclass", "age").
- Ensuring values within specific ranges (e.g., "age" between 0 and 70, "fare" between 0 and 500).
- Verifying that categorical columns ("pclass", "embarked", "survived", "sex") contain only allowed values.
validation = (
pb.Validate(data=sample_df, label="Example Validation")
.col_vals_not_null("survived")
.col_vals_not_null("pclass")
.col_vals_not_null("sex")
.col_vals_not_null("age")
.col_vals_not_null("ticket")
.col_vals_not_null("fare")
.col_vals_not_null("embarked")
.col_vals_between("age", 0, 70, na_pass=True)
.col_vals_between("fare", 0, 500)
.col_vals_in_set("pclass", {1, 2, 3})
.col_vals_in_set("embarked", {"C", "Q", "S"})
.col_vals_in_set("survived", {0, 1})
.col_vals_in_set("sex", {"male", "female"})
.interrogate()
)
Of course there is also the option to enlist the help of an LLM to write the validation rules. Of course depending on your dataset and the nature of its sensitivity, you may want to consider the implications of this.
The resulting validation report provides a clear overview of the validation results, indicating the number of tests run, passed, and failed. The get_data_extracts()
function is useful for inspecting the specific records that failed validation.
Key Observations and Next Steps
pointblank
aligns well with data governance principles by providing a structured way to define and enforce data quality rules. The ability to generate reports and extract failing records facilitates the identification and remediation of data quality issues.
My next steps involve exploring:
- Thresholds: Defining thresholds for acceptable failure rates and triggering actions when those thresholds are exceeded (link to https://posit-dev.github.io/pointblank/user-guide/thresholds.html). This will allow for automated monitoring of data quality.
- Actions: Implementing actions to take when validation fails, such as logging errors or sending alerts (link to https://posit-dev.github.io/pointblank/user-guide/actions.html).
- Custom Validations: Creating custom validation functions to address specific data quality requirements.
Overall, pointblank
appears to be a promising valuable tool for data validation, offering a balance of ease of use and powerful features. I'm excited to continue exploring its capabilities.
https://posit-dev.github.io/pointblank/