Exploring Streamlit's Connection API with DuckDB as the target
Exec Summary
In a recent article, Using DuckDB in Streamlit, the author explored integrating DuckDB with Streamlit for data analysis. While the article provides a functional approach, it overlooks Streamlit's built-in st.connection
API, a feature designed to simplify and enhance database interactions within Streamlit applications.
This post offers an alternative perspective, demonstrating how st.connection
streamlines DuckDB integration and unlocks several key advantages. We'll re-use the NL railway data example from the original article to illustrate the benefits of this approach.
The Traditional Approach (and its limitations)
The original article demonstrates connecting to DuckDB using standard Python code. While this works, it requires manual connection management, caching, and can lead to verbose code (notwithstanding the initial code required for the custom class).
st.connection
: A Simpler, More Powerful Solution
Streamlit's st.connection
API provides a more elegant and efficient way to manage database connections. It handles connection creation, caching, and resource management automatically, reducing boilerplate code and improving performance. Moreover it allows a consistent and unified method to connect to different database types.
DuckDBConnectionLite Class
DuckDB Connection Types
The DuckDBConnectionLite
class presented here extends Streamlit's BaseConnection
for DuckDB. It is designed to flexibly handle multiple connection scenarios:
- In-memory database: For quick, ephemeral analysis, use
:memory:
orNone
as the target (e.g.,connection_target=":memory:"
). - Local DuckDB file: To persist data, specify a file path ending in
.duckdb
(e.g.,connection_target="mydata.duckdb"
). This could easily be tweaked/extended for example to support SQLite. - MotherDuck cloud database: For cloud-hosted DuckDB, use the
md:
prefix and ensure you set yourMOTHERDUCK_TOKEN
environment variable (e.g.,connection_target="md:my_cloud_db"
). I typically use thepython-dotenv
package with a.env
file to manage such secrets (tokens). - Dynamic table from URL: Instantly create a table from a remote CSV or Parquet file by providing its URL (e.g.,
connection_target="https://example.com/data.csv"
).
This versatility allows you to adapt your Streamlit app to a wide range of data sources and deployment scenarios with minimal code changes. e.g. Swapping between a local database for development and a cloud-hosted database for production.
Note that the return type of the query
method is chosen to be a Pandas DataFrame, which is compatible with Streamlit's st.dataframe()
function.
import duckdb
import os
from pathlib import Path
from loguru import logger
from streamlit.connections import BaseConnection
class DuckDBConnectionLite(BaseConnection):
def _connect(self, connection_target: str = None) -> duckdb.DuckDBPyConnection:
try:
# Handle in-memory database
if connection_target in [None, "", "memory", ":memory:"]:
conn = duckdb.connect(":memory:")
logger.info("Connected to in-memory DuckDB database.")
# Handle file-based database
elif connection_target.endswith(".duckdb"):
db_path = Path(connection_target).resolve()
conn = duckdb.connect(str(db_path))
logger.info(f"Connected to DuckDB file at {db_path}.")
# Handle MotherDuck connection
elif connection_target.startswith("md:"):
token = os.getenv("MOTHERDUCK_TOKEN")
if not token:
raise ValueError("MOTHERDUCK_TOKEN environment variable is required for md: connections.")
conn = duckdb.connect(f"{connection_target}?motherduck_token={token}")
logger.info(f"Connected to MotherDuck database with target: {connection_target}")
# Handle URLs for dynamic table creation
elif connection_target.startswith(("http://", "https://")):
conn = duckdb.connect(":memory:")
table_name = "auto_table"
conn.execute(f"CREATE TABLE {table_name} AS SELECT * FROM '{connection_target}'")
logger.info(f"Connected to in-memory DuckDB and created table from URL: {connection_target}")
else:
raise ValueError(f"Unsupported connection target format: {connection_target}")
return conn
except Exception as e:
logger.error(f"Failed to establish DuckDB connection for target '{connection_target}': {e}")
raise
def query(self, sql: str):
if not self._instance:
raise RuntimeError("No active DuckDB connection. Call _connect() first.")
try:
return self._instance.execute(sql).df()
except Exception as e:
logger.error(f"Failed to execute query: {e}")
raise
Example: NL Railway Data with st.connection
and DuckDBConnectionLite
Here's how we can connect to DuckDB and load the NL railway data using st.connection
and a custom DuckDBConnectionLite
class:
import streamlit as st
import duckdb
from pathlib import Path
from loguru import logger
import streamlit as st
from databooth.duckdb_connection_lite import DuckDBConnectionLite
def get_railway_connection():
"""
Establishes a connection to an in-memory DuckDB database
and loads the railway data using DuckDBConnectionLite.
"""
try:
# Create or retrieve the connection using st.connection
conn = st.connection(
name="railway_db",
type=DuckDBConnectionLite,
connection_target=":memory:"
)
# Load railway data into DuckDB tables
conn._instance.sql("CREATE TABLE IF NOT EXISTS services AS SELECT * FROM 'https://blobs.duckdb.org/nl-railway/services-2024.csv.gz'")
conn._instance.sql("CREATE TABLE IF NOT EXISTS stations AS SELECT * FROM 'https://blobs.duckdb.org/nl-railway/stations-2023-09.csv'")
return conn
except Exception as e:
st.error(f"Failed to load data: {e}")
return None
Example: Using the Connection in Streamlit
def main():
st.title("NL Railway Data with DuckDB and Streamlit")
db = get_railway_connection()
if db:
query = "SELECT * FROM services LIMIT 10"
try:
df = db.query(query)
st.dataframe(df)
except Exception as e:
st.error(f"Query failed: {e}")
if __name__ == "__main__":
main()
Benefits of Using st.connection
- Simplified Code:
st.connection
reduces the amount of boilerplate code required to manage database connections - Automatic Caching: Streamlit automatically caches the connection object, improving performance by reusing existing connections. This is especially important for frequently accessed databases.
- Centralised Configuration: Connection parameters can be stored in Streamlit's
secrets.toml
file, providing a secure and organised way to manage credentials. - Customisable Connections: You can integrate your own connection classes (like
DuckDBConnectionLite
) to extendst.connection
's functionality and tailor it to your specific needs. - Improved Readability: The code becomes more readable and easier to maintain, as connection logic is encapsulated within the
st.connection
decorator.
Conclusion
Streamlit's st.connection
API offers a standardised approach to managing DuckDB connections in Streamlit applications. By simplifying connection management, providing automatic caching, and promoting code readability, st.connection
empowers developers to build more efficient and maintainable data-driven applications. Consider embracing st.connection
and unlock the full potential of DuckDB within your Streamlit projects.
Keen to hear any feedback or questions you may have!
References
[1] https://docs.streamlit.io/develop/api-reference/connections/st.connection