DuckDB β NHS Quickstart
π¦ In-process SQL Β· Parquet/CSV Β· Zero server Β· Fast prototypes & validation
Why DuckDB in the NHS
DuckDB gives you blazing-fast SQL over files (CSV/Parquet) with no server to run. Itβs ideal for: quick audits, reproducible examples, schema checks before landing in SQL Server, and lightweight dashboards.
Great for: BI Analyst Β· Data Scientist Β· Clinician-Researcher.
βοΈ 10-minute installβ
python -m venv .venv && . .venv/bin/activate # Windows: .venv\Scripts\activate
pip install duckdb pandas pyarrow
pyarrowspeeds up Parquet/Arrow IO.
π βHello NHSβ β three waysβ
- A) Python dataframe β SQL
- B) Query Parquet/CSV files
- C) CLI only (no Python)
hello_duck.py
import duckdb, pandas as pd
df = pd.DataFrame({
"practice_id": ["A","B","C"],
"appointments": [120, 140, 128]
})
# Register dataframe then run SQL
duckdb.register("t", df)
out = duckdb.sql("SELECT practice_id, appointments FROM t ORDER BY appointments DESC").df()
print(out)
files_query.py
import duckdb
# Single file
print(duckdb.sql("""
SELECT * FROM 'data/appointments.parquet' WHERE appointments > 100
""" ).df())
# Folder or glob of files
print(duckdb.sql("""
SELECT month, SUM(appointments) AS total_appts
FROM read_parquet('data/kpi/*.parquet')
GROUP BY month ORDER BY month
""" ).df())
# Start an interactive DuckDB shell
duckdb nhs.duckdb
-- inside the shell
CREATE TABLE appts AS SELECT * FROM read_csv_auto('data/appointments.csv');
SELECT practice_id, SUM(appointments) AS total FROM appts GROUP BY practice_id;
COPY (SELECT * FROM appts ORDER BY appointments DESC) TO 'out/appts.parquet' (FORMAT PARQUET);
.quit
π From SQL Server β Parquet β DuckDB (safe prototyping)β
Extract a de-identified slice once, then iterate locally at speed.
extract_sqlserver_to_parquet.py
import os, urllib.parse, pandas as pd
from sqlalchemy import create_engine
params = urllib.parse.quote_plus(
"DRIVER={ODBC Driver 18 for SQL Server};"
f"SERVER={os.getenv('SQLSERVER_SERVER')};"
f"DATABASE={os.getenv('SQLSERVER_DATABASE')};"
"Trusted_Connection=Yes;Encrypt=Yes;"
)
engine = create_engine(f"mssql+pyodbc:///?odbc_connect={params}")
sql = """
SELECT practice_id,
CAST(CONVERT(char(7), start_time, 126) AS varchar(7)) AS month,
COUNT(*) AS appointments
FROM dbo.appointments
WHERE start_time >= DATEADD(MONTH,-3,SYSUTCDATETIME())
GROUP BY practice_id, CONVERT(char(7), start_time, 126);
"""
df = pd.read_sql(sql, engine)
os.makedirs("data", exist_ok=True)
df.to_parquet("data/kpi_3m.parquet", index=False)
print("Wrote data/kpi_3m.parquet", df.shape)
Then analyse with DuckDB:
analyse_in_duckdb.py
import duckdb
q = duckdb.sql("""
SELECT month, practice_id, appointments,
appointments / NULLIF(SUM(appointments) OVER (PARTITION BY month),0) AS share
FROM read_parquet('data/kpi_3m.parquet')
ORDER BY month, practice_id
""" ).df()
print(q.head())
β Lightweight data checksβ
Use DuckDB SQL to catch problems early.
checks.sql
-- 1) Rowcount by month (look for gaps)
SELECT month, COUNT(*) AS n
FROM read_parquet('data/kpi_3m.parquet') GROUP BY month ORDER BY month;
-- 2) Nulls where not expected
SELECT COUNT(*) AS null_practice
FROM read_parquet('data/kpi_3m.parquet') WHERE practice_id IS NULL;
-- 3) Simple range checks
SELECT COUNT(*) AS neg_appointments
FROM read_parquet('data/kpi_3m.parquet') WHERE appointments < 0;
You can save each check as a separate .sql and run via the CLI in CI.
π§³ Save results for dashboardsβ
publish_for_dash.py
import duckdb, os
os.makedirs("out", exist_ok=True)
duckdb.sql("""
COPY (
SELECT month, practice_id, appointments
FROM read_parquet('data/kpi_3m.parquet')
ORDER BY month, practice_id
) TO 'out/kpi_clean.parquet' (FORMAT PARQUET);
""" )
Point Dash/Evidence.dev at out/kpi_clean.parquet for fast load times.
π Tips & patternsβ
- Prefer Parquet for Trust exports (columnar + compressed).
- Use
read_parquet('folder/*.parquet')for partitioned data. - Keep a small synthetic sample in the repo for demos/tests.
- Version your analysis SQL files; run them in CI for quick validation.
- For SQL-heavy workflows, consider dbt-duckdb locally, then port final models to SQL Server views.
π‘ IG & safety checklistβ
- Use de-identified/synthetic extracts for local work.
- Keep secrets out of notebooks/scripts; use env vars or secret stores.
- Avoid exporting free-text; apply small-number suppression before sharing.
- Document data sources, owners, and update cadence in the repo README.
π Measuring impactβ
- Velocity: time from data drop β first chart/insight.
- Quality: validation checks passing in CI.
- Reusability: number of analyses that run from a clean clone.
- Cost: zero server spend for prototypes; minimal storage via Parquet.
π See alsoβ
Whatβs next?
Youβve completed the Learn β DuckDB stage. Keep momentum: