Posted on :: 494 Words :: Tags: , , , , ,

A new, yet old, data validation package

Exec Summary

Excited to explore pointblank, the new Python package from Posit for data validation! https://posit-dev.github.io/pointblank/ It builds upon https://rstudio.github.io/pointblank/ the original R version. 🎉 Ensuring data quality is crucial for reliable analysis, decision-making and genAI (think data governance!). It offers a simple way to define validation rules for Pandas, Polars, and DuckDB tables (and more!), generating clear reports. I'm in the process of testing it out and will post a link to my technical deep dive shortly #datagovernance #datavalidation #python #posit #opensource

Technical Evaluation

Initial Experience with pointblank for Data Validation

Posit's new pointblank package for Python offers a promising approach to data validation in Python. My initial experience, using it with the Titanic dataset and DuckDB, has been encouraging.

Also mention GreatExpectations, Pandera and Deequ for comparison. Others?

Also commercial data quality tools?

Also link to my post re: Titanic data quality.

Setup and Integration

The installation was straightforward: pip install pointblank (or uv add pointblank). The package seamlessly integrates with Pandas, Polars, and, importantly for my use case, DuckDB. The ability to work directly with DuckDB tables is a major advantage, as it allows for efficient data handling and validation within a single environment.

Validation Implementation

Using the Validate class is intuitive. I was able to define a series of validation steps for the Titanic dataset, including:

  • Checking for missing values in key columns (e.g., "survived", "pclass", "age").
  • Ensuring values within specific ranges (e.g., "age" between 0 and 70, "fare" between 0 and 500).
  • Verifying that categorical columns ("pclass", "embarked", "survived", "sex") contain only allowed values.
validation = (
    pb.Validate(data=sample_df, label="Example Validation")
    .col_vals_not_null("survived")
    .col_vals_not_null("pclass")
    .col_vals_not_null("sex")
    .col_vals_not_null("age")
    .col_vals_not_null("ticket")
    .col_vals_not_null("fare")
    .col_vals_not_null("embarked")
    .col_vals_between("age", 0, 70, na_pass=True)
    .col_vals_between("fare", 0, 500)
    .col_vals_in_set("pclass", {1, 2, 3})
    .col_vals_in_set("embarked", {"C", "Q", "S"})
    .col_vals_in_set("survived", {0, 1})
    .col_vals_in_set("sex", {"male", "female"})
    .interrogate()
)

Of course there is also the option to enlist the help of an LLM to write the validation rules. Of course depending on your dataset and the nature of its sensitivity, you may want to consider the implications of this.

The resulting validation report provides a clear overview of the validation results, indicating the number of tests run, passed, and failed. The get_data_extracts() function is useful for inspecting the specific records that failed validation.

Key Observations and Next Steps

pointblank aligns well with data governance principles by providing a structured way to define and enforce data quality rules. The ability to generate reports and extract failing records facilitates the identification and remediation of data quality issues.

My next steps involve exploring:

Overall, pointblank appears to be a promising valuable tool for data validation, offering a balance of ease of use and powerful features. I'm excited to continue exploring its capabilities.

https://posit-dev.github.io/pointblank/