Posted on :: 1056 Words :: Tags: , , , , ,

I thought I was tracking the full DuckDB extensions ecosystem. Then I stumbled across read_rdf — a real extension in the wild that wasn’t in my reports — and realised my monitoring was only as complete as the official registry. This post covers what I learned, how big the gap might be, and the tooling I built to discover the long tail safely.


The moment I realised: my report wasn’t the whole story

In my earlier posts, I walked through why I built an automated monitor for DuckDB extensions and how it helps de-risk upgrades:

That tool deliberately focuses on DuckDB’s official extension ecosystem:

  • Core extensions sourced from DuckDB docs.
  • Community extensions sourced from the curated registry in duckdb/community-extensions.

It’s a sensible default. The official lists are what most users should rely on.

But then I came across read_rdf — a DuckDB extension hosted at nonodename/read_rdf — and it wasn’t in my reports.

That raised three questions I couldn’t ignore:

  1. Why wasn’t it being tracked?
  2. By how much am I under-counting the ecosystem?
  3. Can I rectify that gap without turning the project into a noisy web crawler?

Why “not in the registry” doesn’t mean “not real”

The official community registry is great, but it is (by design) a curated list. That means:

  • An extension can be legitimate and useful, but simply not submitted yet.
  • Some repos may be experimental or early-stage and never make it into the registry.
  • Some extensions might be internal to organisations, but still public.

So the absence of a repo in the registry is not a judgement on quality — it’s often just a signal about process and maturity.

A new goal: discover candidates, don’t automatically bless them

I didn’t want to dilute the existing daily reports (which are meant to be trustworthy and low-noise).

Instead, I framed a separate piece of work:

  • Discover likely extension repos on GitHub.
  • Surface candidates for manual review.
  • Measure the gap between the official registry and what exists “in the wild”.
  • Eventually, feed verified extras back into the main tool as a clearly-labelled category (e.g. “unofficial / candidate”).

This is an important distinction:

  • Discovery is a recall problem: find as much as possible.
  • Inclusion in the main report is a trust problem: only include what we can justify.

The approach: signals that suggest a repo is an extension

GitHub doesn’t have a first-class “this repo is a DuckDB extension” flag.

And I also learned something important: a DuckDB extension does not have to be distributed via the official community repository.

DuckDB supports multiple extension delivery patterns:

  • Core repository (default): INSTALL httpfs; or INSTALL httpfs FROM core;
  • Community repository (official): INSTALL avro FROM community;
  • Third-party repositories: INSTALL myext FROM 'https://example.com/repo';
  • Local files: LOAD '/path/to/myext.duckdb_extension'; (and INSTALL '/path/to/...')
  • Direct remote URLs: INSTALL 'https://…/myext.duckdb_extension';

So the discovery problem isn’t “find extensions in the community repo”. It’s “find extension projects and identify how (and if) they’re distributed”.

With that in mind, I leaned on a handful of pragmatic signals:

Many extension authors tag repos with a topic like duckdb-extension.

This is high-precision (usually correct), but low-recall (many repos won’t bother).

Extensions often contain build/config artefacts or identifiers such as:

  • .duckdb_extension files
  • extension_config.cmake
  • occurrences of DUCKDB_EXTENSION / duckdb_extension

Code search tends to find a lot more, but also brings more noise (anything that mentions these strings).

3) Deduplication, enrichment, and validation

Search results can be incomplete and may omit useful metadata.

So the workflow becomes:

  1. Collect candidates via multiple searches.
  2. Deduplicate by repo owner/name.
  3. Enrich metadata (stars, last push date, description) from the repo API.
  4. Validate candidates using two complementary approaches:
    • API-only structure checks (fast, scalable): look for extension build artefacts like extension_config.cmake, .duckdb_extension, and DuckDB extension markers in CMakeLists.txt.
    • Runtime smoke tests (high confidence when positive): attempt to load an extension binary (if the repo publishes one), or attempt INSTALL/LOAD where applicable.

The tooling: discovery + validation scripts built for safe iteration

I ended up with a small toolchain of stand-alone scripts:

  • scripts/discover_additional_extensions.py — broad discovery (topic + code search) with caching and rate-limit handling.
  • scripts/analyse_discovered_extensions.py — dedupe, subtract the already-known core/community set, and produce a “novel candidates” list.
  • scripts/validate_extension_candidates.py — validate candidates using repo-structure signals (including a git tree scan) and optional runtime smoke tests.

Key design goals:

  • Proactive rate limit handling (including backoff).
  • Early stop so I can build confidence before scaling up.
  • Caching so re-runs don’t smash the GitHub API.
  • Separation of concerns: discovery for recall, validation for precision, and the main daily report stays trustworthy.

The outputs are intentionally simple: JSON/CSV lists that are easy to review and refine.

What I learned (so far)

Even a conservative run found hundreds of candidates.

That doesn’t mean there are hundreds of high-quality missing extensions — it means:

  • The official registry is not the full set of repos that look like extensions.
  • Installability is a separate axis: some legitimate extension repos won’t be installable via INSTALL <name> because they’re not published in the default repositories (core/community). They may only be available via third-party repos, direct URLs, or local builds.
  • Discovery at scale needs filtering, scoring, and manual verification.

In other words: the ecosystem is larger than the official registry, but “larger” includes plenty of false positives and multiple distribution paths.

Next: quantify, filter, and integrate carefully

From here, there are three practical next steps:

  1. Measure overlap Compare discovered candidates to the official lists, and quantify the delta.

  2. Refine the candidate set Remove obvious false positives (e.g. the DuckDB core repo itself), and add heuristics that prioritise likely extension repos.

  3. Integrate as a distinct category If we do add candidates into the main pipeline, they should be clearly labelled:

    • Official core
    • Official community
    • Unofficial / candidate

That keeps the daily report trustworthy, while still making the project a more comprehensive map of the ecosystem.


This is a companion post in the DuckDB extensions series. For the original motivation and the main monitoring tool, see Mapping the DuckDB Extension Ecosystem: From Problem to Solution.