DuckDB — What’s the Hype About?
This was a blog post that I already planned to write during the spring when I saw that the hype around DuckDB started taking new heights. Since then the discussion around DuckDB has only intensified in the developer and data engineering community. I currently see two trends within the data community with high engagement levels: DuckDB and if Rust will take over data engineering. But what’s the hype around DuckDB really about? Let’s scratch the surface a little bit.
A lot of today’s acceleration in the data space can be coupled with the explosive rise of cloud data warehouses over the last few years. Cloud data warehouses have become the cornerstone of data stacks: companies and organizations of all sizes use a data warehouse to power analytics use cases. Snowflake’s meteoric rise — culminated by its blockbuster IPO in September 2020 that became the largest software IPO in history — has been the poster child of this trend.
When looking at the 3 Vs of Big Data (Velocity, Volume, Variety), many in the data community that I’ve spoken with lately have said that the most required dimension during the past years has been velocity.
As noted by Mehdi Ouazza (Staff Data Engineer at Trade Republic) — the truth is that everyone doesn’t have “Big” data — but a requirement for low latency consumption from micro-service on data assets processed out of your OLTP database is a common use case.
As Mehdi says, if one looks at some product trends (RocksDB, DuckDB, Clickhouse), they all provide an easier interface for low-latency consumption. Even some cloud data warehouse giants have invested in these applications, such as Snowflake Unistore.
However, the current cloud data warehouse paradigm is still heavily skewed for a client-server use case and ignores a growing segment of users. As noted by Tomasz Tunguz (investor at Redpoint Ventures):
“Most workloads aren’t massive. Instead of requiring a scale-out database in the sky, most analyses are faster with an optimized database on your computer that can leverage the cloud when needed.”
DuckDB is changing this.
As visible through the Google Trends search data above — during the past few months there has been a growing discussion and palpable hype around DuckDB in the data community.
The growing momentum
“I’ve heard a lot of good things about DuckDB lately and have found podcasts with both Jordan Tigani (founder of MotherDuck and BigQuery celebrity) and Hannes Mühleisen (creator founder of DuckDB Labs) really good. Hence I had to give it a try. My first program was to create a DuckDB table by reading directly from a BigQuery table using the BigQuery Storage Read API since it supports arrow tables (and no compute). Turned out to be really easy, sharing as a gist. Can’t wait to experiment some more with DuckDB and with bigger data volumes, it sure has huge potential.”
And Robert is definitely not the only one excited about DuckDB when reading social media posts (I recommend just doing a search on the hashtag #duckdb on LinkedIn or Twitter). E.g. Abhishek Choudhary (Senior Lead Data Engineer Bayer) recently wrote the following on LinkedIn:
“Opinion: One of the most exciting new technologies for Data Engineering/ Data Science is DuckDB. DuckDB is insanely fast and with Apache Arrow, the duo is capable of delivering astonishing results. Another important point behind DuckDB is it’s simple. It doesn’t claim any groundbreaking stuff but sticks to the core of simple and faster data access.”
However, my favorite social media comment about DuckDB is most likely by Josh Wills, in a Twitter thread that discusses the “How Snowflake fails” blog post by the always entertaining Benn Stancil (I recommend subscribing to his Substack):
Find below some additional screenshots of tweets showcasing the interest that DuckDB currently catalyzes in the data community:
Building a managed solution on top of DuckDB
It’s a pretty classic playbook — take an open source tool showcasing momentum and build a service on top of it. E.g. Databricks did this with Spark and Confluent with Kafka.
Jordan Tigani, long-time product leader of BigQuery at Google (a BigQuery celebrity as noted by Robert Sahlin earlier), announced in May that he’s co-founding a serverless cloud version of DuckDB called MotherDuck. Joining him is his Google colleague Tino Tereshko.
Besides MotherDuck, we have DuckDB Labs, which is a commercial company formed by Hannes Mühleisen and the other creators of DuckDB in July 2021 to provide support, custom extensions, and even custom versions of the product as a way to monetize it.
As Lauren Balik noted in her “6 Reality-Based Predictions for Data in 2023” blog post — venture capitalists and data professionals are right to be flocking to DuckDB.
This interest materialized yesterday with MotherDuck announcing their $47.5M funding round led by e.g. a16z (early investors in e.g. Databricks) and Redpoint Ventures (early investors in e.g. Snowflake). MotherDuck and DuckDB Labs also announced a strategic partnership at the same time.
Jordan Tigani (co-founder at MotherDuck) commented the following to TechCrunch when announcing the funding round:
“Users want easy and fast answers to their questions — they don’t want to wait for the cloud… The fact is that a modern laptop is faster than a modern data warehouse. Cloud data vendors are focused on the performance of 100TB queries, which is not only irrelevant for the vast majority of users, but also distracts from vendors’ ability to deliver a great user experience.”
But what’s this hype all about? Let’s scratch the surface a little bit.
DuckDB is an easy-to-use open source in-process OLAP database (that processes data in memory and doesn’t require a dedicated server/service) — described by many in simplified terms as the SQLite equivalent for analytical OLAP workloads.
On HackerNoon, it was once described as “mutant offspring of SQLite and Redshift”.
As noted by the MoterDuck team, as an in-process database, DuckDB is a storage and compute engine that enables developers, data scientists, data engineers and data analysts to power their code with extremely fast analyses using plain SQL. Further, DuckDB has the capability to analyze data where it might live, e.g. on the laptop or in the cloud. Additionally, DuckDB comes with a simple CLI for quick prototyping — without the need for setup, permissions, creating and managing tables, etc.
Based on reading threads on e.g. HackerNews, Reddit and Twitter, there seems to be a lot to like about DuckDB, e.g.:
- Its performance for analytical workloads on single-node machines seems to be impressive and the setup is pain-free (you can technically start exploring DuckDB within 5 minutes).
- DuckDB is embeddable — like SQLite — and is optimized for analytics. The big deal here is the embeddable part (like a library without bringing in the typical PostgreSQL dependency), eliminating the network latency you usually get when talking to a database.
- DuckDB has also really low deployment effort — `pip install duckdb` and you are off to the races.
- Further DuckDB is fast — compared to querying Postgres, DuckDB is 80X faster and when benchmarking other systems we can see similarly impressive results.
These are some of the reasons DuckDB has witnessed impressive growth over the past 12 months.
In practice, any CPU can be mobilized to perform powerful analytics via DuckDB. Further, DuckDB is portable and modular, with no external dependencies. In concrete terms, this means that you can run DuckDB on a cloud virtual machine, in a cloud function, in the browser, or on your laptop as mentioned prior.
Let’s take a step back
As Kojo mentions, an emerging category of data warehouses sits at the intersection of analytical queries and embedded deployments. To illustrate why this is so compelling, he catgeorizes databases along two axes:
As visible above and noted by Kojo, current databases are optimized for analytical or transactional workloads. Analytical workloads — also called Online Analytical Processing (OLAP) — are complex queries on historical data. For example, you may want to analyze user signups broken down by demographics such as age and location. On the other hand, transactional workloads — also referred to as Online Transactional Processing (OLTP) — are optimized for quick real-time reads and writes.
Let’s move ahead to deployment types.
As visible above and noted by Kojo, current database technologies are deployed as stand-alone or embedded solutions. Stand-alone databases are typically deployed in a client-server paradigm. The database sits on a centralized server and is queried by a client application. Embedded databases run within the host process of whatever application is accessing the database.
Now some magic. When we merge these two axes we can see an innovation gap! As Kojo underlines, current innovation in OLAP databases has focused on stand-alone OLAP databases such as Snowflake, ClickHouse, and Redshift (don’t know why he left out BigQuery). This has led us to a situation where embedded analytics use cases have been overlooked and underserved. DuckDB is changing this.
Use cases for DuckDB
Airbyte has in their glossary a short summary of example use cases for DuckDB:
- Ultra-fast analytical use-case locally. E.g., a Taxi example in the Airbyte glossary includes a 10 Year, 1.5 Billion row Taxi data example that still works on a laptop. See benchmarks here.
- It can be used as an SQL wrapper with zero copies (on top of parquets in S3).
- Bring your data to the users instead of having big roundtrips and latency by doing REST calls. Instead, you can put data inside the client. You can do 60 frames per second as data is where the query is.
- DuckDB on Kubernetes for a zero-copy layer to read S3 in the Data Lake! Inspired by this Tweet. The cheapest and fastest option to get started.
Based on documentation, DuckDB should be used when:
- Processing and storing tabular datasets, e.g. from CSV or Parquet files
- Doing interactive data analysis, e.g. joining & aggregate multiple large tables
- Having concurrent large changes, to multiple large tables, e.g. appending rows, adding/removing/updating columns
- Having large result set transfer to client
Based on documentation, DuckDB should not be used when:
- Having high-volume transactional use cases (e.g. tracking e-commerce orders)
- Writing to a single database from multiple concurrent processes
- Having large client/server installations for centralized enterprise data warehousing
To learn more about use cases for DuckDB, listen to this The Data Engineering Podcast episode with Hannes Mühleisen, one of the creators of DuckDB (use case discussion starts at ca 14min).
There are many database management systems out there. But as noted by the DuckDB creators: there is no one-size-fits-all database system. All take different trade-offs to better adjust to specific use cases. DuckDB is no different.
When you think about selecting a database engine for your project you typically consider options focused on serving multiple concurrent users. Sometimes what you really need is an embedded database that is blazing fast for single-user workloads. Enter DuckDB.
Further, it seems like DuckDB also allows an entire community of SQL enthusiasts to be instantly productive in Python without ever learning more than very basic Pandas. There’s a growing number of data community members who never use Pandas for anything complex anymore because they favor SQL.
1. We are living in the great disaggregation of the central data platforms era. The more extreme compute decentralized paradigm I can think of is a grid of laptops. The combo of technology like parquet + pyaArrow with vectorized execution makes it efficient to query large datasets in personal devices.
2. With increased data literacy and improved coding skills in the vast majority of data workers, insight consumption is far from static — dashboards — but exploratory and self-service. So I envision data analysts accessing data in cloud storage, running embedded analysis locally with duckDB.
3. SQL is more alive than ever before — period.
4. Zero deployment effort — `pip install duckdb` and you are in
5. Open Source — There is a vibrant community forming, with support in key pieces like pandas, dbt or apachesuperset , not to mention new startups like DuckDB Labs and MotherDuck
What do you think about the future of DuckDB?
PS. I recommend watching this “What is DuckDB” video by The Seattle Data Guy where he discusses together with Joseph Machado (Senior Data Engineer at LinkedIn) about how DuckDB has entered the world of data by storm.