Posted on :: 620 Words :: Tags: , , , ,

Exploring Streamlit's Connection API with DuckDB as the target

Exec Summary

In a recent article, Using DuckDB in Streamlit, the author explored integrating DuckDB with Streamlit for data analysis. While the article provides a functional approach, it overlooks Streamlit's built-in st.connection API, a feature designed to simplify and enhance database interactions within Streamlit applications.

This post offers an alternative perspective, demonstrating how st.connection streamlines DuckDB integration and unlocks several key advantages. We'll re-use the NL railway data example from the original article to illustrate the benefits of this approach.

The Traditional Approach (and its limitations)

The original article demonstrates connecting to DuckDB using standard Python code. While this works, it requires manual connection management, caching, and can lead to verbose code (notwithstanding the initial code required for the custom class).

st.connection: A Simpler, More Powerful Solution

Streamlit's st.connection API provides a more elegant and efficient way to manage database connections. It handles connection creation, caching, and resource management automatically, reducing boilerplate code and improving performance. Moreover it allows a consistent and unified method to connect to different database types.

DuckDBConnectionLite Class

DuckDB Connection Types

The DuckDBConnectionLite class presented here extends Streamlit's BaseConnection for DuckDB. It is designed to flexibly handle multiple connection scenarios:

  • In-memory database: For quick, ephemeral analysis, use :memory: or None as the target (e.g., connection_target=":memory:").
  • Local DuckDB file: To persist data, specify a file path ending in .duckdb (e.g., connection_target="mydata.duckdb"). This could easily be tweaked/extended for example to support SQLite.
  • MotherDuck cloud database: For cloud-hosted DuckDB, use the md: prefix and ensure you set your MOTHERDUCK_TOKEN environment variable (e.g., connection_target="md:my_cloud_db"). I typically use the python-dotenv package with a .env file to manage such secrets (tokens).
  • Dynamic table from URL: Instantly create a table from a remote CSV or Parquet file by providing its URL (e.g., connection_target="https://example.com/data.csv").

This versatility allows you to adapt your Streamlit app to a wide range of data sources and deployment scenarios with minimal code changes. e.g. Swapping between a local database for development and a cloud-hosted database for production.

Note that the return type of the query method is chosen to be a Pandas DataFrame, which is compatible with Streamlit's st.dataframe() function.

import duckdb
import os
from pathlib import Path
from loguru import logger
from streamlit.connections import BaseConnection

class DuckDBConnectionLite(BaseConnection):
    def _connect(self, connection_target: str = None) -> duckdb.DuckDBPyConnection:
        try:
            # Handle in-memory database
            if connection_target in [None, "", "memory", ":memory:"]:
                conn = duckdb.connect(":memory:")
                logger.info("Connected to in-memory DuckDB database.")
            
            # Handle file-based database
            elif connection_target.endswith(".duckdb"):
                db_path = Path(connection_target).resolve()
                conn = duckdb.connect(str(db_path))
                logger.info(f"Connected to DuckDB file at {db_path}.")
            
            # Handle MotherDuck connection
            elif connection_target.startswith("md:"):
                token = os.getenv("MOTHERDUCK_TOKEN")
                if not token:
                    raise ValueError("MOTHERDUCK_TOKEN environment variable is required for md: connections.")
                conn = duckdb.connect(f"{connection_target}?motherduck_token={token}")
                logger.info(f"Connected to MotherDuck database with target: {connection_target}")
            
            # Handle URLs for dynamic table creation
            elif connection_target.startswith(("http://", "https://")):
                conn = duckdb.connect(":memory:")
                table_name = "auto_table"
                conn.execute(f"CREATE TABLE {table_name} AS SELECT * FROM '{connection_target}'")
                logger.info(f"Connected to in-memory DuckDB and created table from URL: {connection_target}")
            
            else:
                raise ValueError(f"Unsupported connection target format: {connection_target}")
            
            return conn
        except Exception as e:
            logger.error(f"Failed to establish DuckDB connection for target '{connection_target}': {e}")
            raise


    def query(self, sql: str):
        if not self._instance:
            raise RuntimeError("No active DuckDB connection. Call _connect() first.")
        try:
            return self._instance.execute(sql).df()
        except Exception as e:
            logger.error(f"Failed to execute query: {e}")
            raise

Example: NL Railway Data with st.connection and DuckDBConnectionLite

Here's how we can connect to DuckDB and load the NL railway data using st.connection and a custom DuckDBConnectionLite class:

import streamlit as st
import duckdb
from pathlib import Path
from loguru import logger

import streamlit as st
from databooth.duckdb_connection_lite import DuckDBConnectionLite

def get_railway_connection():
    """
    Establishes a connection to an in-memory DuckDB database
    and loads the railway data using DuckDBConnectionLite.
    """
    try:
        # Create or retrieve the connection using st.connection
        conn = st.connection(
            name="railway_db",
            type=DuckDBConnectionLite,
            connection_target=":memory:"
        )
        
        # Load railway data into DuckDB tables
        conn._instance.sql("CREATE TABLE IF NOT EXISTS services AS SELECT * FROM 'https://blobs.duckdb.org/nl-railway/services-2024.csv.gz'")
        conn._instance.sql("CREATE TABLE IF NOT EXISTS stations AS SELECT * FROM 'https://blobs.duckdb.org/nl-railway/stations-2023-09.csv'")
        
        return conn
    except Exception as e:
        st.error(f"Failed to load data: {e}")
        return None

Example: Using the Connection in Streamlit

def main():
    st.title("NL Railway Data with DuckDB and Streamlit")

    db = get_railway_connection()
    if db:
        query = "SELECT * FROM services LIMIT 10"
        try:
            df = db.query(query)
            st.dataframe(df)
        except Exception as e:
            st.error(f"Query failed: {e}")

if __name__ == "__main__":
    main()

Benefits of Using st.connection

  • Simplified Code: st.connection reduces the amount of boilerplate code required to manage database connections
  • Automatic Caching: Streamlit automatically caches the connection object, improving performance by reusing existing connections. This is especially important for frequently accessed databases.
  • Centralised Configuration: Connection parameters can be stored in Streamlit's secrets.toml file, providing a secure and organised way to manage credentials.
  • Customisable Connections: You can integrate your own connection classes (like DuckDBConnectionLite) to extend st.connection's functionality and tailor it to your specific needs.
  • Improved Readability: The code becomes more readable and easier to maintain, as connection logic is encapsulated within the st.connection decorator.

Conclusion

Streamlit's st.connection API offers a standardised approach to managing DuckDB connections in Streamlit applications. By simplifying connection management, providing automatic caching, and promoting code readability, st.connection empowers developers to build more efficient and maintainable data-driven applications. Consider embracing st.connection and unlock the full potential of DuckDB within your Streamlit projects.

Keen to hear any feedback or questions you may have!

References

[1] https://docs.streamlit.io/develop/api-reference/connections/st.connection