DuckDB
DuckDB is an open-source, high-performance, in-process SQL database management system (RDBMS) for analytics:
- Designed for OLAP - DuckDB is designed for online analytical processing (OLAP) workloads, rather than transactional (OLTP) applications.
- Embedded - DuckDB operates within the same process as your application or notebook, eliminating network overhead.
- Versatile - DuckDB can handle diverse data formats, such as CSV, JSON, Parquet, and Apache Arrow. It also integrates with databases like MySQL, SQLite, and Postgres.
- Easy to use - DuckDB provides a rich SQL dialect, with support for arbitrary and nested correlated subqueries, window functions, collations, and complex types.
- Fast - DuckDB is designed to be fast, reliable, and portable. It can efficiently process and query gigabytes of data from various sources.
- Embeddable - DuckDB enables users to analyze data on edge, which can improve response times and preserve bandwidth.
Commands
brew install duckdb
Performance Optimization
Appender
If you're streaming data into DuckDB, INSERT statements become a bottleneck fast.
DuckDB's Appender API bypasses the SQL layer entirely. No parsing, no query planning. You write directly to the columnar storage format, which means you can handle real-time ingestion without the usual speed/batch size trade-off.
Stream rows through a low-level API. Data caches in batches before writing to disk. You're essentially using a binary protocol instead of SQL strings.
Good for:
- Kafka consumers or message queue ingestion
- Log aggregation pipelines
- IoT sensor data collection
- Any scenario where data arrives continuously
A few things to watch out for. It's order and type sensitive. You match columns exactly, no inference. One constraint violation fails the entire batch, no partial inserts. And you're writing to a single table per Appender instance.
Available in C, C++, Go, Java, and Rust. For batch ETL or small datasets, regular INSERT is simpler and fine. But for streaming? This is the tool.
Tutorials
Links
- My First Billion (of Rows) in DuckDB | by João Pedro | Towards Data Science
- How fast is DuckDB really? | Blog | Fivetran
- Benchmarking Ourselves over Time at DuckDB – DuckDB
- "One Size Fits All": An Idea Whose Time Has Come and Gone - stonebraker-centintemel-one-size-fits-all-icde-2015.pdf
- GitHub - duckdb/duckdb: DuckDB is an analytical in-process SQL database management system
- DuckDB – An in-process SQL OLAP database management system
- GitHub - duckdb-in-action/examples
- Introduction to DuckDB: A Guide for Data Analysis | DataCamp
- Handling Billions of Rows with SQL in Minutes Using DuckDB | Towards Data Science
- QuackETL| DuckDB-Powered Lightweight ETL: An Extensible Framework for Seamless Data Integration - YouTube
- DuckDB in 100 Seconds - YouTube
- Announcing DuckDB 1.4.2 LTS – DuckDB