🔍 Discover →📚 Learn →🛠 Build →📢 Share →🤖 AI & ML

DuckDB — NHS Quickstart

🦆 In-process SQL · Parquet/CSV · Zero server · Fast prototypes & validation

Why DuckDB in the NHS

DuckDB gives you blazing-fast SQL over files (CSV/Parquet) with no server to run. It’s ideal for: quick audits, reproducible examples, schema checks before landing in SQL Server, and lightweight dashboards.

Great for: BI Analyst · Data Scientist · Clinician-Researcher.

⚙️ 10-minute install

python -m venv .venv && . .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install duckdb pandas pyarrow

pyarrow speeds up Parquet/Arrow IO.

🚀 “Hello NHS” — three ways

A) Python dataframe → SQL
B) Query Parquet/CSV files
C) CLI only (no Python)

hello_duck.py
import duckdb, pandas as pd

df = pd.DataFrame({
    "practice_id": ["A","B","C"],
    "appointments": [120, 140, 128]
})

# Register dataframe then run SQL
duckdb.register("t", df)
out = duckdb.sql("SELECT practice_id, appointments FROM t ORDER BY appointments DESC").df()
print(out)

files_query.py
import duckdb

# Single file
print(duckdb.sql("""
  SELECT * FROM 'data/appointments.parquet' WHERE appointments > 100
""" ).df())

# Folder or glob of files
print(duckdb.sql("""
  SELECT month, SUM(appointments) AS total_appts
  FROM read_parquet('data/kpi/*.parquet')
  GROUP BY month ORDER BY month
""" ).df())

# Start an interactive DuckDB shell
duckdb nhs.duckdb

-- inside the shell
CREATE TABLE appts AS SELECT * FROM read_csv_auto('data/appointments.csv');
SELECT practice_id, SUM(appointments) AS total FROM appts GROUP BY practice_id;
COPY (SELECT * FROM appts ORDER BY appointments DESC) TO 'out/appts.parquet' (FORMAT PARQUET);
.quit

🔄 From SQL Server → Parquet → DuckDB (safe prototyping)

Extract a de-identified slice once, then iterate locally at speed.

extract_sqlserver_to_parquet.py
import os, urllib.parse, pandas as pd
from sqlalchemy import create_engine

params = urllib.parse.quote_plus(
    "DRIVER={ODBC Driver 18 for SQL Server};"
    f"SERVER={os.getenv('SQLSERVER_SERVER')};"
    f"DATABASE={os.getenv('SQLSERVER_DATABASE')};"
    "Trusted_Connection=Yes;Encrypt=Yes;"
)
engine = create_engine(f"mssql+pyodbc:///?odbc_connect={params}")

sql = """
SELECT practice_id,
       CAST(CONVERT(char(7), start_time, 126) AS varchar(7)) AS month,
       COUNT(*) AS appointments
FROM dbo.appointments
WHERE start_time >= DATEADD(MONTH,-3,SYSUTCDATETIME())
GROUP BY practice_id, CONVERT(char(7), start_time, 126);
"""
df = pd.read_sql(sql, engine)
os.makedirs("data", exist_ok=True)
df.to_parquet("data/kpi_3m.parquet", index=False)
print("Wrote data/kpi_3m.parquet", df.shape)

Then analyse with DuckDB:

analyse_in_duckdb.py
import duckdb
q = duckdb.sql("""
  SELECT month, practice_id, appointments,
         appointments / NULLIF(SUM(appointments) OVER (PARTITION BY month),0) AS share
  FROM read_parquet('data/kpi_3m.parquet')
  ORDER BY month, practice_id
""" ).df()
print(q.head())

✅ Lightweight data checks

Use DuckDB SQL to catch problems early.

checks.sql
-- 1) Rowcount by month (look for gaps)
SELECT month, COUNT(*) AS n
FROM read_parquet('data/kpi_3m.parquet') GROUP BY month ORDER BY month;

-- 2) Nulls where not expected
SELECT COUNT(*) AS null_practice
FROM read_parquet('data/kpi_3m.parquet') WHERE practice_id IS NULL;

-- 3) Simple range checks
SELECT COUNT(*) AS neg_appointments
FROM read_parquet('data/kpi_3m.parquet') WHERE appointments < 0;

You can save each check as a separate .sql and run via the CLI in CI.

🧳 Save results for dashboards

publish_for_dash.py
import duckdb, os
os.makedirs("out", exist_ok=True)
duckdb.sql("""
  COPY (
    SELECT month, practice_id, appointments
    FROM read_parquet('data/kpi_3m.parquet')
    ORDER BY month, practice_id
  ) TO 'out/kpi_clean.parquet' (FORMAT PARQUET);
""" )

Point Dash/Evidence.dev at out/kpi_clean.parquet for fast load times.

🛠 Tips & patterns

Prefer Parquet for Trust exports (columnar + compressed).
Use read_parquet('folder/*.parquet') for partitioned data.
Keep a small synthetic sample in the repo for demos/tests.
Version your analysis SQL files; run them in CI for quick validation.
For SQL-heavy workflows, consider dbt-duckdb locally, then port final models to SQL Server views.

🛡 IG & safety checklist

Use de-identified/synthetic extracts for local work.
Keep secrets out of notebooks/scripts; use env vars or secret stores.
Avoid exporting free-text; apply small-number suppression before sharing.
Document data sources, owners, and update cadence in the repo README.

📏 Measuring impact

Velocity: time from data drop → first chart/insight.
Quality: validation checks passing in CI.
Reusability: number of analyses that run from a clean clone.
Cost: zero server spend for prototypes; minimal storage via Parquet.

🔗 See also

See also: SQL Server · Evidence.dev · Dash · Python · GitHub

What’s next?

You’ve completed the Learn — DuckDB stage. Keep momentum:

Continue to Build Jump to my persona

DuckDB — NHS Quickstart

⚙️ 10-minute install​

🚀 “Hello NHS” — three ways​

🔄 From SQL Server → Parquet → DuckDB (safe prototyping)​

✅ Lightweight data checks​

🧳 Save results for dashboards​

🛠 Tips & patterns​

🛡 IG & safety checklist​

📏 Measuring impact​

🔗 See also​