A Beginner’s Guide to Building a DuckDB-Python Analysis Pipeline with SQL, Data Frames, Parquet, UDFs, and Performance Profiling

pleasuremandarya@gmail.com 13/04/2026

0 17 12 minutes read

A Beginner’s Guide to Building a DuckDB-Python Analysis Pipeline with SQL, Data Frames, Parquet, UDFs, and Performance Profiling

In this course, we build a comprehensive, hands-on understanding DuckDB-Python by using its features directly through code in Colab. We start with the basics of connection management and data generation, then move on to real analysis workflows, including Pandas queries, Polars, and Arrows objects without manual loading, converting results to multiple formats, and writing plain SQL for windowing functions, pivots, macros, recursive CTEs, and joins. As we progress, we also explore performance-oriented capabilities such as batching, profiling, partitioned storage, multi-threaded access, remote file querying, and efficient deployment patterns, so we can learn not only what DuckDB can do, but also how to use it as a deep analysis engine within Python.

import subprocess, sys


for pkg in ["duckdb", "pandas", "pyarrow", "polars"]:
   try:
       subprocess.check_call(
           [sys.executable, "-m", "pip", "install", "-q", pkg],
           stderr=subprocess.DEVNULL,
       )
   except subprocess.CalledProcessError:
       subprocess.check_call(
           [sys.executable, "-m", "pip", "install", "-q", "--break-system-packages", pkg],
           stderr=subprocess.DEVNULL,
       )


import duckdb
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import polars as pl
import numpy as np
import json, os, time, threading, tempfile
from datetime import date, datetime, timedelta


print(f"DuckDB version : {duckdb.__version__}")
print(f"Pandas version : {pd.__version__}")
print(f"PyArrow version: {pa.__version__}")
print(f"Polars version : {pl.__version__}")
print("=" * 72)


WORKDIR = tempfile.mkdtemp(prefix="duckdb_tutorial_")
os.chdir(WORKDIR)
print(f"Working directory: {WORKDIR}n")


print("=" * 72)
print("SECTION 1: Connection Management")
print("=" * 72)


con = duckdb.connect()
print(con.sql("SELECT 'Hello from in-memory DuckDB!' AS greeting").fetchone()[0])


DB_PATH = os.path.join(WORKDIR, "tutorial.duckdb")
pcon = duckdb.connect(DB_PATH)
pcon.sql("CREATE OR REPLACE TABLE persisted(id INT, val TEXT)")
pcon.sql("INSERT INTO persisted VALUES (1,'alpha'), (2,'beta')")
print("Persisted rows:", pcon.sql("SELECT count(*) FROM persisted").fetchone()[0])
pcon.close()


pcon2 = duckdb.connect(DB_PATH)
print("After re-open :", pcon2.sql("SELECT * FROM persisted ORDER BY id").fetchall())
pcon2.close()


con_cfg = duckdb.connect(config={
   "threads": 2,
   "memory_limit": "512MB",
   "default_order": "DESC",
})
print("Configured threads:", con_cfg.sql("SELECT current_setting('threads')").fetchone()[0])
con_cfg.close()


with duckdb.connect() as ctx:
   ctx.sql("SELECT 42 AS answer").show()


print()


print("=" * 72)
print("SECTION 2: Synthetic Data Generation")
print("=" * 72)


con = duckdb.connect()


con.sql("""
   CREATE OR REPLACE TABLE sales AS
   SELECT
       i                                                     AS order_id,
       '2023-01-01'::DATE + (i % 365)::INT                  AS order_date,
       CASE (i % 5)
           WHEN 0 THEN 'Electronics'
           WHEN 1 THEN 'Clothing'
           WHEN 2 THEN 'Groceries'
           WHEN 3 THEN 'Furniture'
           ELSE        'Books'
       END                                                   AS category,
       ROUND(10 + random() * 990, 2)                         AS amount,
       CASE (i % 3)
           WHEN 0 THEN 'US'
           WHEN 1 THEN 'EU'
           ELSE        'APAC'
       END                                                   AS region,
       CASE WHEN random() < 0.1 THEN TRUE ELSE FALSE END    AS returned
   FROM generate_series(1, 100000) t(i)
""")


con.sql("SUMMARIZE sales").show()
print()


print("=" * 72)
print("SECTION 3: Zero-Copy DataFrame Integration")
print("=" * 72)


pdf = pd.DataFrame({
   "product": ["Widget", "Gadget", "Doohickey", "Thingamajig"],
   "price":   [9.99, 24.50, 4.75, 15.00],
   "stock":   [120, 45, 300, 78],
})
print("Query Pandas DF directly:")
con.sql("SELECT product, price * stock AS inventory_value FROM pdf ORDER BY inventory_value DESC").show()


plf = pl.DataFrame({
   "city":   ["Montreal", "Toronto", "Vancouver", "Calgary"],
   "temp_c": [-12.5, -5.0, 3.0, -18.0],
})
print("Query Polars DF directly:")
con.sql("SELECT city, temp_c, temp_c * 9/5 + 32 AS temp_f FROM plf WHERE temp_c < 0").show()


arrow_tbl = pa.table({
   "sensor_id": [1, 2, 3, 4, 5],
   "reading":   [23.1, 47.8, 12.3, 99.0, 55.5],
})
print("Query Arrow table directly:")
con.sql("SELECT sensor_id, reading FROM arrow_tbl WHERE reading > 30").show()


print()


print("=" * 72)
print("SECTION 4: Result Conversion")
print("=" * 72)


q = "SELECT category, SUM(amount) AS total FROM sales GROUP BY category ORDER BY total DESC"


print("→ Python list  :", con.sql(q).fetchall()[:2], "...")
print("→ Pandas DF    :n", con.sql(q).df().head(3))
print("→ Polars DF    :n", con.sql(q).pl().head(3))
print("→ Arrow Table  :", con.sql(q).arrow().schema)
print("→ NumPy arrays :", {k: v[:2] for k, v in con.sql(q).fetchnumpy().items()})
print()

Set up a full DuckDB-Python environment by installing the necessary libraries and importing all the necessary tutorial modules. We create our working directory, set up a DuckDB connection, and test both in-memory and persistent data usage and basic configuration options. We’re also generating large synthetic sales datasets and starting work with direct integration of DuckDB with Pandas, Polars, and PyArrow, showing how DuckDB naturally fits into Python-based workflows.

print("=" * 72)
print("SECTION 5: Relational API")
print("=" * 72)


rel = (
   con.table("sales")
      .filter("NOT returned")
      .aggregate("category, region, SUM(amount) AS revenue, COUNT(*) AS orders")
      .filter("revenue > 1000000")
      .order("revenue DESC")
      .limit(10)
)
print("Relational API result:")
rel.show()


top_cats = con.sql("SELECT DISTINCT category FROM sales ORDER BY category LIMIT 3")
print("Top categories relation fed into next query:")
con.sql("SELECT s.* FROM sales s SEMI JOIN top_cats ON s.category = top_cats.category LIMIT 5").show()


print()


print("=" * 72)
print("SECTION 6: Window Functions & Advanced SQL")
print("=" * 72)


con.sql("""
   WITH daily AS (
       SELECT
           order_date,
           region,
           SUM(amount) AS daily_rev
       FROM sales
       WHERE NOT returned
       GROUP BY order_date, region
   )
   SELECT
       order_date,
       region,
       daily_rev,
       SUM(daily_rev) OVER (
           PARTITION BY region ORDER BY order_date
       ) AS cum_revenue,
       AVG(daily_rev) OVER (
           PARTITION BY region ORDER BY order_date
           ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
       ) AS rolling_7d_avg
   FROM daily
   QUALIFY row_number() OVER (PARTITION BY region ORDER BY order_date DESC) <= 3
   ORDER BY region, order_date DESC
""").show()


print("PIVOT table:")
con.sql("""
   PIVOT sales
   ON region
   USING SUM(amount)
   GROUP BY category
   ORDER BY category
""").show()


print()


print("=" * 72)
print("SECTION 7: Complex / Nested Types")
print("=" * 72)


con.sql("""
   CREATE OR REPLACE TABLE users AS
   SELECT
       i AS user_id,
       {'first': 'User_' || i::TEXT, 'last': 'Surname_' || (i % 100)::TEXT}  AS name,
       [i * 10, i * 20, i * 30]                                              AS scores,
       MAP {'tier': CASE WHEN i % 2 = 0 THEN 'gold' ELSE 'silver' END,
            'region': CASE WHEN i % 3 = 0 THEN 'US' ELSE 'EU' END}          AS metadata
   FROM generate_series(1, 5) t(i)
""")


print("Struct field access, list indexing, map extraction:")
con.sql("""
   SELECT
       user_id,
       name.first                  AS first_name,
       scores[1]                   AS first_score,
       list_aggregate(scores,'sum') AS total_score,
       metadata['tier']            AS tier
   FROM users
""").show()


print("Unnesting a list column:")
con.sql("""
   SELECT user_id, unnest(scores) AS individual_score
   FROM users
   WHERE user_id <= 3
""").show()


print()


print("=" * 72)
print("SECTION 8: Python UDFs")
print("=" * 72)


def celsius_to_fahrenheit(c):
   return c * 9 / 5 + 32


con.create_function("c2f", celsius_to_fahrenheit, ["DOUBLE"], "DOUBLE")
con.sql("SELECT city, temp_c, c2f(temp_c) AS temp_f FROM plf").show()


import pyarrow.compute as pc


def vectorized_discount(prices):
   """Apply a 15% discount to all prices."""
   return pc.multiply(prices, 0.85)


con.create_function(
   "discount",
   vectorized_discount,
   ["DOUBLE"],
   "DOUBLE",
   type="arrow",
)
print("Vectorized UDF (discount):")
con.sql("SELECT product, price, discount(price) AS sale_price FROM pdf").show()


print()


print("=" * 72)
print("SECTION 9: File I/O")
print("=" * 72)


con.sql("COPY (SELECT * FROM sales LIMIT 1000) TO 'sales_sample.csv'  (HEADER, DELIMITER ',')")
con.sql("COPY (SELECT * FROM sales LIMIT 1000) TO 'sales_sample.parquet' (FORMAT PARQUET)")
con.sql("COPY (SELECT * FROM sales LIMIT 100)  TO 'sales_sample.json'    (FORMAT JSON, ARRAY true)")


print("Files written:", [f for f in os.listdir('.') if 'sales_sample' in f])


print("nCSV row count     :", con.sql("SELECT count(*) FROM 'sales_sample.csv'").fetchone()[0])
print("Parquet row count :", con.sql("SELECT count(*) FROM 'sales_sample.parquet'").fetchone()[0])
print("JSON row count    :", con.sql("SELECT count(*) FROM 'sales_sample.json'").fetchone()[0])


print("nParquet file metadata:")
con.sql("SELECT * FROM parquet_metadata('sales_sample.parquet')").show()


print()

We go from a basic query to more expressive analysis patterns using DuckDB’s relational API and advanced SQL features. We work with sorted and aggregated sales relationships, use window functions to get cumulative and collapsible statistics, and reshape data using a pivot table for cross-sectional analysis. We also explore complex nested types, listing Python UDFs, and file I/O, which helps us see how DuckDB handles both structured analysis and virtual data engineering tasks in one place.

print("=" * 72)
print("SECTION 10: Hive-Partitioned Parquet")
print("=" * 72)


con.sql("""
   COPY sales TO 'partitioned_data' (
       FORMAT PARQUET,
       PARTITION_BY (region, category),
       OVERWRITE_OR_IGNORE
   )
""")
print("Partitioned directory tree (first 15 items):")
for i, (root, dirs, files) in enumerate(os.walk("partitioned_data")):
   for f in files:
       fp = os.path.join(root, f)
       print(f"  {fp}")
   if i > 15:
       print("  ...")
       break


print("nReading partitioned data back (US + Electronics only):")
con.sql("""
   SELECT count(*) AS cnt, MIN(amount), MAX(amount)
   FROM read_parquet('partitioned_data/**/*.parquet', hive_partitioning = true)
   WHERE region = 'US' AND category = 'Electronics'
""").show()


print()


print("=" * 72)
print("SECTION 11: Prepared Statements")
print("=" * 72)


result = con.execute(
   "SELECT * FROM sales WHERE category = $1 AND amount > $2 LIMIT 5",
   ["Electronics", 500.0]
).fetchdf()
print("Parameterized query result:n", result)


con.sql("SET VARIABLE target_region = 'EU'")
con.sql("""
   SELECT category, AVG(amount) AS avg_amt
   FROM sales
   WHERE region = getvariable('target_region')
   GROUP BY category
   ORDER BY avg_amt DESC
""").show()


print()


print("=" * 72)
print("SECTION 12: Transactions")
print("=" * 72)


con.sql("CREATE OR REPLACE TABLE accounts(id INT, balance DOUBLE)")
con.sql("INSERT INTO accounts VALUES (1, 1000), (2, 500)")


con.begin()
try:
   con.sql("UPDATE accounts SET balance = balance - 200 WHERE id = 1")
   con.sql("UPDATE accounts SET balance = balance + 200 WHERE id = 2")
   con.commit()
   print("Transaction committed successfully")
except Exception as e:
   con.rollback()
   print(f"Transaction rolled back: {e}")


con.sql("SELECT * FROM accounts").show()
print()


print("=" * 72)
print("SECTION 13: Appender (Bulk Insert)")
print("=" * 72)


con.sql("CREATE OR REPLACE TABLE fast_load(id INT, name TEXT, value DOUBLE)")


bulk_df = pd.DataFrame({
   "id":    range(50_000),
   "name":  [f"item_{i}" for i in range(50_000)],
   "value": [float(i) * 1.1 for i in range(50_000)],
})


start = time.perf_counter()
con.append("fast_load", bulk_df)
elapsed = time.perf_counter() - start


print(f"Bulk-inserted {con.sql('SELECT count(*) FROM fast_load').fetchone()[0]:,} rows in {elapsed:.4f}s")
con.sql("SELECT * FROM fast_load LIMIT 5").show()
print()


print("=" * 72)
print("SECTION 14: Replacement Scans (query any Python var)")
print("=" * 72)


my_dict_data = pd.DataFrame({
   "name": ["Alice", "Bob", "Charlie"],
   "age":  [30, 25, 35],
})
con.sql("SELECT * FROM my_dict_data WHERE age > 28").show()
print()

We focus on maintaining and implementing patterns that are particularly useful in real-world workflows, starting with Hive-style distributed Parquet output and selective learning from distributed data. We then use parameterized queries, runtime variables, and transaction control to make our queries safer, more powerful, and more reliable. Finally, we explore high-speed batch insertion by using the appender interface and use dynamic scanners to query Python objects directly by name, further reinforcing the tight integration of DuckDB with the Python runtime.

print("=" * 72)
print("SECTION 15: SQL Macros")
print("=" * 72)


con.sql("""
   CREATE OR REPLACE MACRO revenue_tier(amt) AS
       CASE
           WHEN amt > 800 THEN 'High'
           WHEN amt > 400 THEN 'Medium'
           ELSE                'Low'
       END
""")
print("Scalar macro:")
con.sql("SELECT category, amount, revenue_tier(amount) AS tier FROM sales LIMIT 8").show()


con.sql("""
   CREATE OR REPLACE MACRO top_by_category(cat, n) AS TABLE
       SELECT * FROM sales WHERE category = cat ORDER BY amount DESC LIMIT n
""")
print("Table macro — top 3 Electronics:")
con.sql("SELECT * FROM top_by_category('Electronics', 3)").show()
print()


print("=" * 72)
print("SECTION 16: Recursive CTE")
print("=" * 72)


con.sql("""
   CREATE OR REPLACE TABLE employees(id INT, name TEXT, manager_id INT);
   INSERT INTO employees VALUES
       (1, 'CEO',     NULL),
       (2, 'VP Eng',  1),
       (3, 'VP Sales',1),
       (4, 'Sr Eng',  2),
       (5, 'Jr Eng',  4),
       (6, 'Sales Rep', 3);
""")


print("Org chart via recursive CTE:")
con.sql("""
   WITH RECURSIVE org AS (
       SELECT id, name, manager_id, 0 AS depth, name AS path
       FROM employees WHERE manager_id IS NULL
       UNION ALL
       SELECT e.id, e.name, e.manager_id, o.depth + 1,
              o.path || ' → ' || e.name
       FROM employees e JOIN org o ON e.manager_id = o.id
   )
   SELECT repeat('  ', depth) || name AS hierarchy, path
   FROM org
   ORDER BY path
""").show()
print()


print("=" * 72)
print("SECTION 17: Full-Text Search (FTS)")
print("=" * 72)


try:
   con.install_extension("fts")
   con.load_extension("fts")


   con.sql("""
       CREATE OR REPLACE TABLE documents(id INT, body TEXT);
       INSERT INTO documents VALUES
           (1, 'DuckDB is a fast in-process analytical database'),
           (2, 'Python integration allows querying Pandas DataFrames'),
           (3, 'Parquet files can be read directly without loading'),
           (4, 'Window functions and CTEs make complex analytics easy'),
           (5, 'The columnar engine processes data blazingly fast');
   """)


   con.sql("PRAGMA create_fts_index('documents', 'id', 'body', stemmer="english")")


   print("FTS search for 'fast analytical':")
   con.sql("""
       SELECT id, body, fts_main_documents.match_bm25(id, 'fast analytical') AS score
       FROM documents
       WHERE score IS NOT NULL
       ORDER BY score DESC
   """).show()
except Exception as e:
   print(f"(Skipped — FTS extension not available: {e})")
print()


print("=" * 72)
print("SECTION 18: AsOf Joins")
print("=" * 72)


con.sql("""
   CREATE OR REPLACE TABLE stock_prices AS
   SELECT * FROM (VALUES
       ('2024-01-01 09:30'::TIMESTAMP, 'AAPL', 150.0),
       ('2024-01-01 10:00'::TIMESTAMP, 'AAPL', 152.5),
       ('2024-01-01 10:30'::TIMESTAMP, 'AAPL', 151.0),
       ('2024-01-01 11:00'::TIMESTAMP, 'AAPL', 153.0)
   ) AS t(ts, ticker, price);


   CREATE OR REPLACE TABLE trades AS
   SELECT * FROM (VALUES
       ('2024-01-01 09:45'::TIMESTAMP, 'AAPL', 100),
       ('2024-01-01 10:15'::TIMESTAMP, 'AAPL', 200),
       ('2024-01-01 10:50'::TIMESTAMP, 'AAPL', 150)
   ) AS t(trade_ts, ticker, shares);
""")


print("AsOf Join — match each trade to the most recent price:")
con.sql("""
   SELECT
       t.trade_ts,
       t.shares,
       p.ts   AS price_ts,
       p.price,
       t.shares * p.price AS trade_value
   FROM trades t
   ASOF JOIN stock_prices p
       ON t.ticker = p.ticker AND t.trade_ts >= p.ts
""").show()
print()

We build reusable query logic using scalars and table macros, reducing repetition and making our SQL more modular. We then create an iterative CTE to traverse the worker class, showing how DuckDB can handle structured recursive logic within SQL. Then, we explore full-text search and AsOf joins, showing that DuckDB supports not only standard analysis but also more advanced search and time-aware matching.

print("=" * 72)
print("SECTION 19: Query Profiling")
print("=" * 72)


print("EXPLAIN output:")
con.sql("EXPLAIN SELECT category, SUM(amount) FROM sales GROUP BY category").show()


con.sql("PRAGMA enable_profiling = 'json'")
con.sql("PRAGMA profiling_output="profile.json"")
con.sql("SELECT category, AVG(amount), STDDEV(amount) FROM sales GROUP BY category")
con.sql("PRAGMA disable_profiling")


if os.path.exists("profile.json"):
   with open("profile.json") as f:
       profile = json.load(f)
   print(f"Query timing: {profile.get('timing', 'N/A')}s")


print()


print("=" * 72)
print("SECTION 20: Multi-Threaded Queries")
print("=" * 72)


results = {}


def worker(thread_id, db_path=None):
   """Each thread creates its own connection."""
   local_con = duckdb.connect()
   local_con.sql(f"""
       CREATE TABLE t AS SELECT i, i*{thread_id} AS val
       FROM generate_series(1, 10000) t(i)
   """)
   total = local_con.sql("SELECT SUM(val) FROM t").fetchone()[0]
   results[thread_id] = total
   local_con.close()


threads = [threading.Thread(target=worker, args=(tid,)) for tid in range(4)]
for t in threads:
   t.start()
for t in threads:
   t.join()


print("Thread results:", results)
print()


print("=" * 72)
print("SECTION 21: Quick Benchmark — DuckDB vs Pandas")
print("=" * 72)


N = 1_000_000
big_df = pd.DataFrame({
   "group": np.random.choice(["A","B","C","D","E"], N),
   "value": np.random.randn(N) * 100,
   "ts":    pd.date_range("2020-01-01", periods=N, freq="s"),
})


start = time.perf_counter()
_ = big_df.groupby("group")["value"].agg(["sum","mean","std","min","max"])
pd_time = time.perf_counter() - start


start = time.perf_counter()
_ = con.sql("""
   SELECT "group",
          SUM(value), AVG(value), STDDEV(value), MIN(value), MAX(value)
   FROM big_df
   GROUP BY "group"
""").fetchall()
duck_time = time.perf_counter() - start


print(f"Pandas  : {pd_time:.4f}s")
print(f"DuckDB  : {duck_time:.4f}s")
print(f"Speedup : {pd_time/duck_time:.1f}x")
print()


print("=" * 72)
print("SECTION 22: Reading Remote Parquet over HTTPS")
print("=" * 72)


try:
   con.install_extension("httpfs")
   con.load_extension("httpfs")


   print("Querying a remote Parquet file (DuckDB taxi sample):")
   con.sql("""
       SELECT count(*) AS total_rows
       FROM read_parquet(
           '
       )
   """).show()
except Exception as e:
   print(f"(Skipped — httpfs not available or network restricted: {e})")


print()


print("=" * 72)
print("SECTION 23: Custom Enum Types")
print("=" * 72)


con.sql("CREATE TYPE mood AS ENUM ('happy', 'neutral', 'sad')")
con.sql("""
   CREATE OR REPLACE TABLE survey(
       respondent_id INT,
       feeling       mood,
       comment       TEXT
   )
""")
con.sql("""
   INSERT INTO survey VALUES
       (1, 'happy',   'Great product!'),
       (2, 'neutral', 'It is okay'),
       (3, 'sad',     'Did not work'),
       (4, 'happy',   'Love it')
""")
con.sql("SELECT feeling, count(*) AS cnt FROM survey GROUP BY feeling ORDER BY feeling").show()
print()

We examine how DuckDB executes queries by using EXPLAIN programs and JSON profiler output to better understand performance behavior. We also demonstrate thread-based parallel implementations by creating separate DuckDB connections to multiple worker threads and securely collecting their results. To complete the performance angle, we put DuckDB against Pandas in a large group integration, and extend the workflow with Parquet remote access over HTTPS and the creation of a custom ENUM type.

print("=" * 72)
print("SECTION 24: Lambda Functions in SQL")
print("=" * 72)


con.sql("""
   SELECT
       [1, 2, 3, 4, 5]                                   AS original,
       list_transform([1, 2, 3, 4, 5], x -> x * x)       AS squared,
       list_filter([1, 2, 3, 4, 5], x -> x % 2 = 0)     AS evens_only,
       list_reduce([1, 2, 3, 4, 5], (a, b) -> a + b)     AS total
""").show()
print()


print("=" * 72)
print("SECTION 25: Multi-Format Export")
print("=" * 72)


summary_q = """
   SELECT
       category,
       region,
       COUNT(*)       AS orders,
       SUM(amount)    AS revenue,
       AVG(amount)    AS avg_order
   FROM sales
   WHERE NOT returned
   GROUP BY category, region
   ORDER BY revenue DESC
"""


con.sql(f"COPY ({summary_q}) TO 'summary.csv'     (HEADER)")
con.sql(f"COPY ({summary_q}) TO 'summary.parquet'  (FORMAT PARQUET, COMPRESSION ZSTD)")
con.sql(f"COPY ({summary_q}) TO 'summary.json'     (FORMAT JSON, ARRAY TRUE)")


sizes = {f: os.path.getsize(f) for f in ['summary.csv','summary.parquet','summary.json']}
print("File sizes:")
for name, sz in sizes.items():
   print(f"  {name:25s}  {sz:>8,} bytes")


print()


con.close()


print("=" * 72)
print("TUTORIAL COMPLETE")
print("=" * 72)
print("""
Sections covered:
 1.  Connection management (in-memory, persistent, config)
 2.  Synthetic data generation with generate_series()
 3.  Zero-copy querying of Pandas, Polars, PyArrow
 4.  Result conversion (.df(), .pl(), .arrow(), .fetchnumpy())
 5.  Relational API (filter / aggregate / order / limit)
 6.  Window functions, QUALIFY, PIVOT
 7.  Complex types (struct, list, map, unnest)
 8.  Python UDFs (scalar & vectorized)
 9.  File I/O (CSV, Parquet, JSON)
 10. Hive-partitioned Parquet writes & reads
 11. Prepared statements & parameterized queries
 12. Transactions (BEGIN / COMMIT / ROLLBACK)
 13. Appender for high-speed bulk insert
 14. Replacement scans (query Python dicts by name)
 15. SQL macros (scalar + table macros)
 16. Recursive CTEs
 17. Full-text search (FTS extension)
 18. AsOf joins
 19. Query profiling & EXPLAIN
 20. Multi-threaded access
 21. Performance benchmark (DuckDB vs Pandas)
 22. Remote Parquet over HTTPS (httpfs)
 23. Custom ENUM types
 24. Lambda functions in SQL
 25. Multi-format export with compression
""")

We explore DuckDB’s support for lambda-powered list transformations directly in SQL, providing a short way to manipulate list data within queries. We then export the summary analysis result to multiple file formats, including CSV, Parquet, and JSON, highlighting DuckDB’s flexibility in delivering the underlying data. We close the tutorial by cutting through and printing an organized recap of all the sections covered, giving a clear end-to-end picture of the full DuckDB-Python workflow.

In conclusion, we have provided a practical, code-level view of how DuckDB-Python fits into modern database workflows. We’ve seen that it’s not just an embedded SQL engine in Python, but a very flexible analysis system that works seamlessly with DataFrames, arrow tables, local files, remote datasets, custom functions, and advanced SQL architecture in one place. We’ve moved from simple in-memory queries to more production-related patterns like persistent databases, parameterized execution, transactions, partitioned Parquet, profiling, and performance measurement, helping us understand both the breadth and depth of DuckDB’s capabilities. And, we’ve come up with a reusable Colab workflow that shows us how to combine the Python ecosystem with DuckDB’s speed, SQL expression, and interoperability to build fast, efficient, and scalable data analysis pipelines.

Check out Full Use Codes here. Also, feel free to follow us Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us

pleasuremandarya@gmail.com 13/04/2026

0 17 12 minutes read

A Beginner’s Guide to Building a DuckDB-Python Analysis Pipeline with SQL, Data Frames, Parquet, UDFs, and Performance Profiling

pleasuremandarya@gmail.com

Leave a Reply Cancel reply

Cybersecurity in 2026: AI Attacks, Identity‑First Defense, and the New Playbook for Resilience

Top 10 Best Earning Apps in 2026 (Legit, Paying Fast & With Official Download Links)

“Master Google AI Overviews in 2026 with the GEO Playbook”

How Agentic AI Will Change Workflows in 2026 (Complete Guide)

“12 Best Fully Funded Scholarships for International Students in 2026 (Apply Now with Official Links)”

Smartphone shipments fell by 11pc after persistent chip shortages

pleasuremandarya@gmail.com

OpenAI Revokes macOS App Certification After Malicious Axios Supply Incident

Mythos testing begins as governments raise cyber concerns

Related Articles

Meet Blume: An Open-Source, Zero-Config Documentation Framework That Exports AI-Ready Documents From a Markdown Folder

Anthropic Claude Sonnet 5 vs Sonnet 4.6 vs Opus 4.8: Agentic Coding Benchmarks, API Pricing, and Cost Performance Tradeoffs Compared

How MIT students help prevent cyber attacks | MIT News

Building a VideoAgent-Style Multi-Agent System: Objective Analysis, Scheduling Graphs, and Toolpaths for Video Editing Tasks